IDA 7.0: Internationalization (i18n)

Intended audience

This document describes an important change that happened in the code while designing IDA version 7.0: the move to using UTF-8 everywhere.

This is mostly of interest to plugin authors, either binary or IDAPython plugins.

Before 7.0

Prior to version 7.0, IDA would store the following strings:

into the database, using the local 8-bit encoding. Specifically:

The problem

There were many issues with this, but the most obvious ones are:

The solution

While we were knee-deep in the refactorings we did for 7.0, we decided it was a good time to improve the situation, and we did so by imposing UTF-8 everywhere in IDA: any string that transits within IDA's memory, is now encoded in UTF-8.

Plugin writers needing to support more than just the ASCII subset of Unicode code points, will certainly find the current situation more comfortable.

What about my older databases?

Because a byte is a byte, and without additional context it's impossible to know how that byte should be interpreted, we had to resort to heuristics when a database is ported to the 7.0 format.

The following will be converted during a database upgrade:

And, for any of those, here's how IDA will decide whether or not a conversion is needed:

..then IDA will convert that data to UTF-8, using the following rule:

  1. if a conversion encoding has been specified (see below), use that
  2. otherwise
    1. if on windows, assume the source is in the locale's codepage(s)
    2. otherwise (i.e., on Linux or OSX), assume the codepage(s) are those of "Western Europe" -- hopefully covering most databases

And if such a conversion happens (and no conversion encoding was specified), IDA will display an example, post-conversion, text in the messages list for you to figure out whether it did the right thing or not. E.g.,

#######################################################################
# It appears some comments in this database contain
# localized characters, that need conversion to UTF-8.
# From your system, IDA guessed the source encoding might be: "CP850"
# Here is a sample of text that was translated into that encoding:
#
#  "Mais, quand d'un passé ancien rien ne subsiste, après la mort des êtres, après la destruction des choses, seulesplus frêles mais plus vivaces, plus immatérielles, plus persistantes, plus fidèles"
#
# If this does not look correct, please see UPGRADE*CPSTRINGS**
# directives in ida.cfg for more info.
#######################################################################

(in this particular case, IDA converted a test function comment, which in 6.95 would be stored using the OEM codepage CP850 (i.e., "Western Europe"))

As you can see at the bottom of the message above, IDA hints at the UPGRADE_CPSTRINGS_* configuration directives. Let's have a look at those.

Specifying the source encoding for conversion to UTF-8 using UPGRADE_CPSTRINGS_*

In case IDA got it wrong, and either:

...then IDA will improperly interpret those bytes, and the conversion to UTF-8 will yield wrong, possibly garbled results.

In that case, you can help IDA by specifying:

  1. UPGRADE_CPSTRINGS_SRCENC
  2. UPGRADE_CPSTRINGS_SRCENCA

in order to instruct it what encodings/codepages should be used for data that's encoded using the OEM codepage or the ANSI codepage, respectively.

Please have a look in the cfg/ida.cfg file for more documentation regarding those directives.

Specifying those on the command-line

It's also worth pointing out that those directives can be passed on the command line, using the following syntax:

    ida -dUPGRADE_CPSTRINGS_SRCENC=CP866 /path/to/some.idb

In this case, IDA will use CP866 (i.e., Cyrillic OEM codepage) to perform the conversion of those string that are stored using the OEM codepage.