| Erlang UTF8 substring |
|
Today I needed to get a substring from UTF8 encoded input in Erlang. Strings are lists in Erlang and thus lists:sublist(List1, Start, Len) -> List2 is almost what we need. Except that UTF8 is variable length encoded. So we need to make it fixed-length encoded. UTF16 to the rescue! Here my solution, using only standard Erlang components: -module(utf8). -export([substr_utf8/2, substr_utf8/3, test/0]). substr_utf8(Utf8EncodedString, Length) -> substr_utf8(Utf8EncodedString, 1, Length). substr_utf8(Utf8EncodedString, Start, Length) -> ByteLength = 2*Length, Ucs = xmerl_ucs:from_utf8(Utf8EncodedString), Utf16Bytes = xmerl_ucs:to_utf16be(Ucs), SubStringUtf16 = lists:sublist(Utf16Bytes, Start, ByteLength), Ucs1 = xmerl_ucs:from_utf16be(SubStringUtf16), xmerl_ucs:to_utf8(Ucs1). test() -> LongUtf8 = "最受尊敬的互联网企业", ShortUtf8 = utf8:substr_utf8(LongUtf8, 3), iolist_to_binary(ShortUtf8). The code should speak for itself, I use the xmerl_ucs module for the difficult work. substr_utf8 is just a convenient shortcut in my program. Hope this helps someone else as well.
|
| < Prev |
|---|
