Initial Graphics Exchange Specification IGES Home IGES Home NIST U.S. Pro Page Examples & Figures Tools Next Version In-work Items Current Version About Disclaimer

RFC 603

Title: Add Unicode as a String Constant

Submitted by: Curtis Parks
NIST
(301) 975-3517


Description of Problem

Adopting the UTF-8 encoding of ISO/IEC 10646-1, together with the new Global, would provide full internationalization for IGES while also providing full backward compatibility with file systems, parsers, and other software that rely on US-ASCII values.

While XML, for one, specifies 10646 in UTF-8 or UTF-16, it is not clear if the -16 will be used much given the clear advantage of the -8 encoding. An Internet Society RFC presents their acceptance of this encoding:

  "Character values from 0000 0000 to 0000 007F    (US ASCII repertoire) correspond to octets 00 to 7F    (7 bit US-ASCII values). A direct consequence is that    a plain ASCII string is also a valid UTF-8 string."
The above was quoted from ftp://ftp.isi.edu/in-notes/rfc2279.txt

Note: The Proposed Solution implements a proposal originated by Ed A. Reid on 4/24/00.

Proposed Solution

Add to References:
[IETF98] F. Yergeau, UTF-8, a Transformation Format of ISO 10646, Internet Engineering Task Force (IETF) RFC 2279, URL ftp://ftp.isi.edu/in-notes/rfc2279.txt, January 1998 (URL valid November 2000).

Replace the 2.2 title to read: "Section 2.2 File Formats"

Add into 2.2.2.3 String Data Type:
Unicode (ISO/IEC 10646-1) characters, provided that Global 27 Character Set Identifier Flag has been set to 2, are permitted following the Hollerith delimiter. Unicode shall be encoded using the UTF-8 [IETF98] character encoding scheme. A Unicode string shall not contain control characters (i.e., hexadecimal 0000 0000 through 0000 001F). Unicode characters are restricted to the ASCII subset for strings in Global Section 1-6, 12, 15, 18, 25, and 26, and within the Terminate Section.

Add to Table 1:
27 Integer Character Set Identifier Flag for string data type

Add Section 2.2.4.3.27 Character Set Identifier Flag. This "required, default" field specifies the character set used in string data types. The default is 1; which is interpreted as the ASCII character set. A value of 2 specifies the Unicode (ISO/IEC 10646-1) multi-octet character set, encoded using the UTF-8 character encoding scheme [IETF98]. Note that the universal character set (UCS) encoded in UTF-8 (UCS transformation formats) is the 8-bit encoding which also preserves backward compatibility with the full US-ASCII repertoire.

Change sentence in figures 2 and 3 to read:
using ASCII or Unicode/UTF-8 characters in columns 1-72.


Posted for comment 9/20/00


Program questions: ssd@nist.gov
Phone: (301) 975-4000, Fax: (301) 975-4715
Standards Services Division, NIST, 100 Bureau Drive, Stop 2100, Gaithersburg, MD 20899-2100

Website comments: tsweb@nist.gov

If you have any questions regarding this website, or notice any problems or inaccurate information, please contact the webmaster by sending e-mail to: TSWeb@nist.gov
NIST is an agency of the U.S. Department of Commerce.




Web site owner: Technology Services