A Refresher on Encodings

I was hanging out on the beach a couple weekends ago reading a book of Java puzzles (I definitely don't understand the meaning of "beach reading"), and I reached a chapter about Unicode escapes. I worked through the puzzles, but ultimately I thought, "This doesn't really apply to my work." I should have known that I'd be proven wrong as soon as I thought that, and naturally, I was. My team is currently working with multi-lingual support in our application, and of course that means I'm hanging out again with my old friends Unicode and file encodings. The three of us have always had a rocky relationship (I'm definitely the one to blame for that), but I feel like we've really made some strides in getting along better in the past couple of weeks. I'd like to share some of the knowledge I remembered and picked up during this refresher course. Some of these points will be specific to Java, but some can be useful for other languages as well.

Avoid using Unicode characters in your code. The Java compiler converts Unicode escapes into their respective characters BEFORE the tokenizer executes, so \u0022 inside a string literal will end your string and most likely cause your class not to compile. It’s something to be mindful of and share with your fellow developers so the next one reading your code will remember it.

Think about where your text will end up. What encoding are you using to store characters in your property’s files and database? Does that match the encoding you're using to present it to your users? If not, how does the conversion work? Do you know the encoding for either of those? You should give thought to these questions when using Unicode.

Think about the text you're receiving. If you allow user input, you'll need to understand all the encodings throughout the process from the browser to your database and back again. Converting between encodings can be a lossy conversion, leading to headaches while you try to figure out why you're displaying ?s and other strange characters to your users.

Don't let your IDE trick you. Modern IDEs are great: they perform mundane tasks for developers and allow us to focus on the important problems. Though sometimes they can hurt us if we don't understand what's happening behind the scenes, as with file encodings. It's easy to not know what encoding your file uses when the IDE reads and writes it for you automatically, but what if that encoding isn't the same between developers' machines? Suddenly classes won't compile and text begins displaying incorrectly.

Take the time to understand character sets and encodings. The topic isn't hard to understand, and you'll really save yourself from lots of headaches later.

Do you have helpful tips on encodings? Share them with me on Twitter @tylerskippy!