Quantcast
Channel: SCN: Message List
Viewing all articles
Browse latest Browse all 8667

Re: UTF-8 to CESU-8 conversion

$
0
0

Hi, colleagues

 

The following works for me well,

 

You can also try to use python which is

easy to implement and test. If using unicode function and  represent CESU-8 encoded string as byte stream already encoded with UTF-8 this will work fine. The problem with CESU-9 only comes for Unicode point starting with U+10000 and higher. For those point you can use surrogaite pair which is available on wiki in google, here is the algorithm to get UTF-16 representation for Unicode points higher then FFFF.

v  = 0x64321

v′ = v - 0x10000

   = 0x54321

   = 0101 0100 0011 0010 0001

vh = v′ >> 10

   = 01 0101 0000 // higher 10 bits of v′

vl = v′ & 0x3FF

   = 11 0010 0001 // lower  10 bits of v′

w1 = 0xD800 + vh

   = 1101 1000 0000 0000

   +        01 0101 0000

   = 1101 1001 0101 0000

   = 0xD950 // first code unit of UTF-16 encoding

w2 = 0xDC00 + vl

   = 1101 1100 0000 0000

   +        11 0010 0001

   = 1101 1111 0010 0001

   = 0xDF21 // second code unit of UTF-16 encoding

 

In other words you get UTF-8 encoded stream which is perfectly understood by HANA and you can store the information perfectly by using your own codec that is compliant with CESU-8.

To get some knowledge about UTF-8 encoding you can refer to utfcpp.sourceforge.net library and the algorithm above can be used to extend it for CESU-8 compatibility.

You do not need to use UTF-16 for python, this will not work for HANA.

 

Regards,

Vasily Sukhanov


Viewing all articles
Browse latest Browse all 8667

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>