spacer.png, 0 kB
Erlang UTF8 substring

Today I needed to get a substring from UTF8 encoded input in Erlang. Strings are lists in Erlang and thus lists:sublist(List1, Start, Len) -> List2 is almost what we need. Except that UTF8 is variable length encoded. So we need to make it fixed-length encoded. UTF16 to the rescue!

Here my solution, using only standard Erlang components:


-module(utf8).
-export([substr_utf8/2, substr_utf8/3, test/0]).

substr_utf8(Utf8EncodedString, Length) ->
    substr_utf8(Utf8EncodedString, 1, Length).
substr_utf8(Utf8EncodedString, Start, Length) ->
    ByteLength = 2*Length,
    Ucs = xmerl_ucs:from_utf8(Utf8EncodedString),
    Utf16Bytes = xmerl_ucs:to_utf16be(Ucs),
    SubStringUtf16 = lists:sublist(Utf16Bytes, Start, ByteLength),
    Ucs1 = xmerl_ucs:from_utf16be(SubStringUtf16),
    xmerl_ucs:to_utf8(Ucs1).

test() ->
    LongUtf8 = "最受尊敬的互联网企业",
    ShortUtf8 = utf8:substr_utf8(LongUtf8, 3),
    iolist_to_binary(ShortUtf8).


The code should speak for itself, I use the xmerl_ucs module for the difficult work. substr_utf8 is just a convenient shortcut in my program. Hope this helps someone else as well.

 

 
< Prev
spacer.png, 0 kB
rightborder
© 2006-2008 StartInChina.com - News and tips from Shenzhen, China