Converting Unicode to ASCII using Python

July 18, 2006 at 12:54 pm | In Thoughts | Comments Off
Tags: ,

I’ve been working on a data migration task to take data from an ADT repository and put it into a fedora based repository. It was the most complex data migration that I’ve undertaken to date. Primarily because I needed to integrate with such things as JHOVE and extract the text from the PDF files. I leveraged the work I’d already done in this area and used modules that I’d created before. I even created a new module to allow me to construct an OAI 2.0 Dublin Core XML datastream.

I had an issue where Unicode characters were not flowing into the repository correctly. I only discovered this when I tested my objects in the repository and noticed some very odd characters. The easiest solution to this issue was to encode all strings as ASCII until such time as I can determine exactly where the fault lies. I’m not sure if it is in the source data, the conversion, the ingest into the repository, or the interface on top of the repository. I’m quite quickly learning that the world of character encoding is a very tangled and murky world.

To achieve the goal of converting the text into ASCII I wrote the following small function.

def convertText(text, action):
    """
    Convert a string with embedded unicode characters to have XML entities instead
    - text, the text to convert
    - action, what to do with the unicode
    If it works return a string with the characters as XML entities
    If it fails return raise the exception
    """
    try:
        temp = unicode(text, "utf-8")
        fixed = unicodedata.normalize('NFKD', temp).encode('ASCII', action)
        return fixed
    except Exception, errorInfo:
        print errorInfo
        print "Unable to convert the Unicode characters to xml character entities"
        raise errorInfo

The code leverages the existing Unicode support in Python and also uses the unicodedata module.

The first line of the function uses the built-in unicode function to convert the string passed to the convertText function into variable of the unicode type, as opposed to just a normal string. The conversion uses the “utf-8″ encoding scheme. I’ve found through testing that this covers all of the characters in the metadata.

The second line of the function can be broken into three parts. The first part is as follows:

unicodedata.normalize('NFKD', temp)

This part uses the normalize function of the unicodedata module to convert all compatibility characters with their equivalents. In this way data loss is minimised, as the Python documentation says:

normalize(form, unistr)

Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’.

The Unicode standard defines various normalization forms of a Unicode string, based on the definition of canonical equivalence and compatibility equivalence. In Unicode, several characters can be expressed in various way. For example, the character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).

For each character, there are two normal forms: normal form C and normal form D. Normal form D (NFD) is also known as canonical decomposition, and translates each character into its decomposed form. Normal form C (NFC) first applies a canonical decomposition, then composes pre-combined characters again.

In addition to these two forms, there two additional normal forms based on compatibility equivalence. In Unicode, certain characters are supported which normally would be unified with other characters. For example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). However, it is supported in Unicode for compatibility with existing character sets (e.g. gb2312).

The normal form KD (NFKD) will apply the compatibility decomposition, i.e. replace all compatibility characters with their equivalents. The normal form KC (NFKC) first applies the compatibility decomposition, followed by the canonical composition.

The second part of the line is as follows:

encode('ASCII', action)

This part uses the encode string method to encode the Unicode string into a plain ASCII string. The second parameter, action, can be one of four options, ‘ignore’, ‘replace’, ‘xmlcharrefreplace’, ‘backslashreplace’. The exact meaning of these four options is outlined in the Python documentation on the Codec Base Classes. The ones that I’ve used in my code are ‘ignore’, ‘replace’ and ‘xmlcharrefreplace’. To work around the issue that initially set me on this exploration I use either the ‘ignore’ or ‘replace’ options.

The ‘ignore’ option simply means that if a character in the source string can not be encoded in the target scheme, in this case ASCII, it is removed from the string. The ‘replace’ option means that the character is replaced with a question mark ‘?’.

If I ever get to the bottom of this encoding issue I will post the solution.

Blog at WordPress.com. | Theme: Pool by Borja Fernandez.
Entries and comments feeds.