Documentation for the AscToHTM conversion utility : Working with Unicode

Documentation for the AscToHTM Text to HTML converter

Working with Unicode

(New in version 5.0)
AscToHTM was not originally designed with Unicode in mind, and as a result support for Unicode text has been gradually added over time, with the result that earlier versions of AscToHTM may not support all the features described in this manual. If in doubt, please contact JafSoft for details.

What is Unicode?

Traditional single-byte character sets interpret the 8-bit character values (128-255) as special characters. So on a Russian machine this would be interpreted as Cyrillic, but on a different machine this could be read (wrongly) as Arabic (and vice versa). On most English-based PCs, the 8-bit characters are used for accented character used in certain European languages, so a Russian text would appear to have lots accented 'i's, 'e's and 'a's.

Unicode is a way of implementing text that supports multiple types of character sets at the same time so that - for example - it is possible to display Chinese and Cyrillic on the same page unambiguously. It does this by allocating each character in each language a unique code value, so that codes used for Cyrillic characters no longer overlap and conflict with those assigned to Arabic.

However, these code values are in most cases larger than can be represented in a single byte. As a result a way has to be chosen to represent each character by one or more bytes.

The following Unicode representations are commonly used

UTF-8
Each character is represented by 1, 2 or 3 bytes, depending on the which range the Unicode code value falls into. This has the advantage that all ASCII characters are a single byte, so for example all the HTML tags in a document are represented by a single byte each. This also means there are no null bytes contained in the text, which can make programming software to work with this text easier.

UTF-16
Each character is represented by a 2-byte pair (future characters may require 2 such pairs). The 2-byte pair is just the numerical representation of the Unicode value of each character. This makes the files easier to interpret, but also means that the byte order depends on how the machine stores its bytes - i.e. is the machine big-endian or little-endian. Because ASCII characters have a Unicode value less than 255 the ASCII characters map onto a byte pairs in which one of the bytes is null. Because each character requires two bytes, a single byte wrongly inserted into a UTF-16 stream will render all text that follows is as gibberish.

Unicode Byte Order Marks (BOMs)

Files that contain Unicode identify themselves by inserting a "Byte Order Mark" (BOM) at the top of the file. This is a two-byte marker for UTF-16 files and a three-byte marker for UTF-8 files. Modern applications will test for this byte marker and if present will then know how to interpret the contents of the file. For example Notepad as supplied with Windows XP can do this, whereas Notepad as supplied with Windows 98 could not.

In UTF-16 each character is represented by two bytes, and computers can store a two-byte value in different ways (known as "big-endian" and "little-endian"). Each operating system uses one method or another and it isn't usually an issue, but when Unicode files get passed from one machine to another, this becomes important. The BOM allows the two forms of UTF-16 (known as "UTF-16BE" and "UTF-16LE") to be distinguished.

Auto-detecting Unicode input

The software has some ability to auto-detect Unicode text, and will generally do so under the following circumstances

a 3-byte Byte Order Mark (BOM) is detected at the top of a UTF-8 input file
a 2-byte Byte Order Mark (BOM) is detected at the top of a UTF-16 input file
the input HTML contains an HTML entity that maps onto a Unicode code value which can't be converted into an ANSI or ASCII equivalent, In this case although the input HTML may not have been encoded as Unicode, the output will need to be in order to correctly display the Unicode character.

Creating Unicode output

The software will create Unicode output whenever it detects that the input files were Unicode, or wherever Unicode characters have been detected in the HTML entities of the original.

At present all Unicode output files will be UTF-8.

Controlling Unicode handling through use of policies

The following policies can be used to control the handling of Unicode during the conversion :-

Character encoding. Where the input file is ASCII, it could still contain Unicode-encoded data. You can use this policy to set the HTML character encoding to be a suitable value

Back to Contents List