IDA 7.0: Internationalization (i18n)

Intended audience

This document describes an important change that happened in the code while designing IDA version 7.0: the move to using UTF-8 everywhere.

This is mostly of interest to plugin authors, either binary or IDAPython plugins.

Before 7.0

Prior to version 7.0, IDA would store the following strings:

segment names
function names
function local labels
comments fonr function, segment & address
database notepad contents
...

into the database, using the local 8-bit encoding. Specifically:

on Windows: usually, the OEM codepage, but sometimes the ANSI codepage (for the database notepad)
on Linux & OSX: the locale's 8-bit encoding, which in most cases ended up being UTF-8

The problem

There were many issues with this, but the most obvious ones are:

inability to pass an IDB to someone using a different locale, and have sensible non-ASCII text
it would sometimes be unclear what kind of encoding was used, for what transiting data (e.g., for plugins)

The solution

While we were knee-deep in the refactorings we did for 7.0, we decided it was a good time to improve the situation, and we did so by imposing UTF-8 everywhere in IDA: any string that transits within IDA's memory, is now encoded in UTF-8.

Plugin writers needing to support more than just the ASCII subset of Unicode code points, will certainly find the current situation more comfortable.

What about my older databases?

Because a byte is a byte, and without additional context it's impossible to know how that byte should be interpreted, we had to resort to heuristics when a database is ported to the 7.0 format.

The following will be converted during a database upgrade:

function comments
decompiler pseudo-C function comments
segment comments
hidden areas descriptions
structures comments
structure members comments
enum comments
enum members comments
database notepad contents
script snippets

And, for any of those, here's how IDA will decide whether or not a conversion is needed:

if the contents is not valid UTF-8, or
if a conversion encoding has been specified (see below)

..then IDA will convert that data to UTF-8, using the following rule:

if a conversion encoding has been specified (see below), use that
otherwise
1. if on windows, assume the source is in the locale's codepage(s)
2. otherwise (i.e., on Linux or OSX), assume the codepage(s) are those of "Western Europe" -- hopefully covering most databases

And if such a conversion happens (and no conversion encoding was specified), IDA will display an example, post-conversion, text in the messages list for you to figure out whether it did the right thing or not. E.g.,

#######################################################################
# It appears some comments in this database contain
# localized characters, that need conversion to UTF-8.
# From your system, IDA guessed the source encoding might be: "CP850"
# Here is a sample of text that was translated into that encoding:
#
#  "Mais, quand d'un passé ancien rien ne subsiste, après la mort des êtres, après la destruction des choses, seulesplus frêles mais plus vivaces, plus immatérielles, plus persistantes, plus fidèles"
#
# If this does not look correct, please see UPGRADE*CPSTRINGS**
# directives in ida.cfg for more info.
#######################################################################

(in this particular case, IDA converted a test function comment, which in 6.95 would be stored using the OEM codepage CP850 (i.e., "Western Europe"))

As you can see at the bottom of the message above, IDA hints at the UPGRADE_CPSTRINGS_* configuration directives. Let's have a look at those.

Specifying the source encoding for conversion to UTF-8 using `UPGRADE_CPSTRINGS_*`

In case IDA got it wrong, and either:

your locale's codepages on Windows don't correspond to those of the machine where the IDB was created, or
you are on Linux or OSX and the IDB was created on a Windows machine with a locale that features codepages other than those for "Western Europe"

...then IDA will improperly interpret those bytes, and the conversion to UTF-8 will yield wrong, possibly garbled results.

In that case, you can help IDA by specifying:

UPGRADE_CPSTRINGS_SRCENC
UPGRADE_CPSTRINGS_SRCENCA

in order to instruct it what encodings/codepages should be used for data that's encoded using the OEM codepage or the ANSI codepage, respectively.

Please have a look in the cfg/ida.cfg file for more documentation regarding those directives.

Specifying those on the command-line

It's also worth pointing out that those directives can be passed on the command line, using the following syntax:

    ida -dUPGRADE_CPSTRINGS_SRCENC=CP866 /path/to/some.idb

In this case, IDA will use CP866 (i.e., Cyrillic OEM codepage) to perform the conversion of those string that are stored using the OEM codepage.