Convert EUC-JP to UTF-8 in Python

So… I study Japanese. I was searching for a way to convert some EUC-JP encoded files to the UTF-8 (now standard in most OSes), and I found myself stuck with no tools to do so.

Searching the web, I’ve found many language functions to convert lines, strings or chunk of bytes from one encoding to the other, but I have in my hand a plain-text file (a rather large one).

I’ve used the Unicode Mapping CodePage 936 available from the unicode page to do the trick. Not all characters are there, but it mostly works. Since I’m quite in a hurry, this one will do for now. I’ll later try to encode it with BestFit 936 to see what I get (note: The current code does not work out-of-the-box with the BestFit file format. Some tweaking is needed).

So, having in hands the codepage, I figured I could make a small python script to convert my file to another encoding (UTF-8, in this case) and I’m sharing it with everyone (GPL License).

Download

If this code is useful for you, you could donate (any amount) to show your appreciation from the link below.


It was coded on a Mac, using Python 3.1, but it also works on Windows (I’ve tested it on a XP). Note that it’s quite crude and you need to alter some settings at the beggining of the file (Source file and Destination file at least).

This entry was posted in development, free and tagged , , | permalink | RSS feed for this post. Post a comment or leave a trackback: Trackback URL.

3 Comments

  1. Posted May 3, 2010 at 5:17 pm | Permalink

    I have some old flashcard files from when I used Kingkanji on my PDA. All those files are saved as EUC-JP text files, and I would like to try and write a script that can change them to use with Anki. I’ll give it a try and let you know how it goes.

  2. Greg
    Posted June 20, 2010 at 2:28 am | Permalink

    Hi, I needed to convert some EUC-JP encoded text to UTF-8 and google led me here. I had some trouble with the code page 936 though. For example the kanji for “new” (ie. “shin”, or “atarashii”), should be EUC-JP code BFB7, which is E696B0 in UTF-8. The second column in the cp936 seems to be UTF-16. “Shin” is 65B0 in UTF-16, which translates to the first column as D0C2. I think that is the GB2312 Chinese code set. Also the Wikipedia page on Code page 936 says its for Chinese (?)

    BUT, I’m a newb here so I’m just googling around trying to figure it out.

    I did find a nice solution though. “iconv” in UNIX or this Windows version:
    http://gnuwin32.sourceforge.net/packages/libiconv.htm

    Thanks.

  3. Posted June 21, 2010 at 7:50 am | Permalink

    @Greg
    Hi greg!
    Are you using Python 3.1? If you are, there is a much (much!) easier way to convert EUC-JP to UTF-8 characters from a file.
    Take the EDict file (http://www.csse.monash.edu.au/~jwb/edict.html) for example. It’s encoded in EUC-JP, and you can convert it with a few simple lines:

    fp = open("edict", "rb")
    for row in fp:
    print(row.decode("EUC-JP"))
    fp.close()

    Remember to decompress the edict file first.
    Hope it helps!

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">