web developer & system programmer

coder . cl

ramblings and thoughts on programming...


charset detection with python

published: 26-05-2012 / updated: 26-05-2012
posted in: development, programming, python, tips
by Daniel Molina Wegener

How many times did you require to detect the charset of a given file?. This task is quite easy to do with Python. Currently I am working in an application that requires to parse CSV files, where they come from Linux and Windows systems. The problem — as usual — are the Windows files, which are not using the UTF-8 encoding, rather than using Unicode, they are being exported as Windows-1250 and similar encodings. This is a big problem while you are trying to import data to Unicode collated tables, like those using UTF-8 encoding. On Python, the chardet module does all the magic.

So, the following code opens the file that is passed as argument to the script, reads its contents, and passes the content to the detect() function of the chardet module. The result is a dictionary with a confidence level and the detected encoding: {‘confidence’: 1.0, ‘encoding’: ‘UTF-8′}. The confidence level is a percentage and the encoding key is the detected encoding.


#!/usr/bin/env python
#
# -*- coding: utf-8; -*-

import sys
import chardet

file_handle = file(sys.argv[1])
content = file_handle.read()
file_handle.close()
result = chardet.detect(content)
print(repr(result))

You do not need to read the entire file. You can read a portion, but if the file is using another encoding, like UTF-32, which uses 4 bytes, you must ensure that you will read a multiple of 4 to ensure safe input, something like read(512). Also, those buffers using the BOM mark, are easier to detect. Also, for using the csv Python module with this feature, mainly for Windows-1250 and similar encoded files, you should try creating an UTF8Recoder, like this one.


No coments yet.

post a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>