Sidebar

How do I determine the unicode value of a character?

+1 vote
804 views
asked Aug 30, 2013 by mike-r-7535 (13,830 points)
edited Aug 30, 2013 by mike-r-7535

I have a channel that extracts text from a Word document, and sends the text to another system.  The Word document has a character that looks like a:

Ω

However, when it gets extracted from the document, I end up with something that looks like this:

Ω

Why is this?  And how do I fix it?

1 Answer

0 votes
 
Best answer

The Word document has a multi-byte, UTF-8 character.  When extracting the text, the multiple bytes show up as multiple characters in your conversion.  You can replace the character with a single byte character once you know its unicode value.

  1. Select and copy the character from the Word document into your clipboard.
  2. Go to a unicode converter tool like this one: http://www.endmemo.com/unicode/unicodeconverter.php
  3. Paste the character into the "Unicode Character" section.
  4. Click Convert
  5. Use the "Escaped Unicode" value in your replaceAll call.
  6. For more information, google for the "UTF-8 Code" value and add the key word "unicode" (google: unicode CE A9).

You can then replace the multi-byte character (Greek Capital Letter Omega), with a single-byte character or a description if you like:

var textWithNormalSemicolon = textWithSpecialSemicolon.replaceAll("\u03A9", "Omega");

 

answered Aug 30, 2013 by mike-r-7535 (13,830 points)
selected Aug 30, 2013 by mike-r-7535
...