Who knew that Microsoft Windows had
so many
different encodings?
There's CP1252, the almost-but-not-quite ISO-Latin-1 that is responsible for the evil breakage of "smart quotes" by encouraging web publishers to act like 0x93 is a valid way to represent a left double quote. At least it encodes É in a sensible place, 0xc9. But why stop at one codepage? There's also CP437, an ancient DOS codepage that is nothing like Latin-1 but contains Latin-1 characters like É at 0x90. Yes, that's a different place than CP1252. Apparently both of these evil 19th century codepages are still coexisting on my 21st century Windows XP system. I just dumped a bunch of MP3 files from my WinXP box to Linux and found the filenames hopelessly garbled. I finally guessed they're in CP437. I'm a bit surprised Samba didn't take care of it for me. Python to the rescue:
def cp437ToLatin1(s):
return unicode(s, 'cp437').encode('latin-1') |