I’ve googled a lot to find the answer. But none really solve the problem simply and gracefully, even on stackoverflow. So we’ll do ourselves here 🙂

Actually, std::string supports operation using multibytes characters. This is the base of our solution:

g_cs is a Chinese word(“你好” which means hello) encoded in UTF-8. The code works under both Windows(WinXP+VS2005) and Linux(Ubuntu12.04+gcc4.6). You may wanna open a.txt to check whether the string is correctly written.

NOTE: Under Linux, we print the string directly since the default console encoding is UTF-8, and we can view the string. While under Window, the console DOES NOT support UTF-8(codepage 65001) encoding. Printing to it simply causes typo. We just convert it to a std::wstring and use MessageBox() API to check the result. I will cover the encoding issue in windows console in my next post, maybe.

I began to investigate the problem, since I cannot find a solution to read/write a UTF-8 string to XML file using boost::property_tree. Actually, it’s a bug and is already fixed in boost 1.47 and later versions. Unfortunately, Ubuntu 12.04 came with boost 1.46.1. When reading non-ASCII characters, some bytes are incorrectly skipped. The failure function is boost::property_tree::detail::rapidxml::internal::get_index(). My test code looks like:

Almost the same structure with the previous function. And finally the utf8_to_ucs2() function:

Please add header files yourselves to make it compile 🙂