21 Jun 1995 - Preliminary Information
Character set issues are ususally overlooked in the US. However, a World Wide Web has to confront the problem of displaying information in languages other than English. This is a fairly difficult problem that must be approached carefully.
The most complete solution would be Unicode, a two-byte character set that includes every modern language in the world. This may prove important in the future, but its use today is premature. A more modest solution is to use the ISO "8859" family of one-byte character sets. In particular, the ISO 8859-1 "Latin 1" character set supports all the Western European languages from Iceland, to the Nordic countries, to Italy.
There is little perspective in Connecticut about how people overseas actually configure their personal computers. The screen is a more powerful device and can support many different character sets. The keyboard is more constrained. Through the years there have been many different approaches to the keyboard entry of foreign language character sets. If SpHyDir is going to provide an easy to use editing environment, the data entry is an important part of the problem.
Without any user input, SpHyDir now caves in to the OS/2 System design. It embraces the IBM architecture of Code Pages. The assumption is that IBM sells hardware and software overseas and if it insists on pushing an architecture like Code Pages, then that must be how people are actually using the system. A few terms need to be defined:
A character set is a collection of characters that completely address a particular need. For example, the upper and lower case alphabet is a character set that can be used to express all the common names of people in the US (since names like "Sally2" and "Bi$$" don't occur). The minimal useful computer character set are the 94 characters in the ASCII set (although for many purposes you can get along without ~ ` { } or ^. Extensions to this character set exist to support particular foreign languages or special purposes (math, APL).
A font is a set of instructions for drawing each character in a character set on a screen or printer. The system normally uses a small set of bitmap fonts to display characters of normal size. Algorithmic fonts such as Microsoft's TrueType or Adobe's ATM fonts can be displayed in any size.
A standard that assigns number values to every character in a character set, allowing those characters to be stored in a computer memory, on disk, or to be transmitted on a communications line. ASCII and EBCDIC are examples of codes. A code always has some control characters to represent the end of a line, a backspace, a tab, and other functions. In ASCII, the control character values are from 0 to 31 and in EBCDIC they are from 0 to 63.
A Code Page is (essentially) a character code in which all the control values have been removed and replaced with addtional printable characters. Code Page is mostly an IBM term, though it has rubbed off on Microsoft. It allows a display or printer to have some additional special use characters that can be displayed in contexts where the normal functions of control characters are not needed.
When IBM designed the PC in 1980 there were no general international standards for character sets and code pages beyond the standard ASCII set. The PC created a Code Page by filling in the remaining 256-94=162 code locations with a haphazard collection of box drawing, international, and dingbat (club, face, "small house") characters. Years later this was designated in IBM terms as Code Page 437.
Later on during the 1980's, the Internation Standards Organization (ISO) finally developed a set of one-byte character sets that extended the ASCII standard to other character sets.
8859-1 covers Western Europe
8859-2 covers "Latin" Eastern Europe
8859-5 Cyrillic
8859-6 Arabic
8859-7 Greek
8859-8 Hebrew
8859-9 like 8859-1 but drop Iceland and add Turkey
The HTML 2.0 standard makes 8859-1 the default encoding for HTML documents. However, the HTTP and MIME standards allow a document to be encoded in any of the ISO 8859 family of code sets. It would be a mistake for SpHyDir to drop its USA-centered perspective only to adopt a slightly broader 8859-1 Western European perspective.
The "Latin 1" character set on which the 8859-1 code is based includes some characters which were not part of the IBM PC Code Page 437. Most of the vendors (Microsoft with Windows and NT, Adobe with PostScript and ATM, DEC) simply adopted 8859-1 as their standard code. IBM decided that it was too important to leave the basic box-drawing characters in their current location. Instead, they created Code Page 850, which includes all the Latin 1 characters but does not assign them to their 8859-1 code values.
The OS/2 Presentation Manager has a dummy Code Page 1004 that reflects the ISO 8859-1 character values. However, this is not recognized as a "real" Code Page number by most of the commands and OS/2 services that deal with such things.
Before beating up on IBM, it should be noted that the ISO 8859-1 standard may not be quite as useful as it first appears. While it is fairly simple to display 256 different characters (or more) on a computer screen or printer, it is very difficult to squeeze all those characters on the keyboard. Any one-byte code page will have too many characters for easy keyboard input, but not enough characters to handle the total information system requirement.
Long before modern computers and laser printers made a complete 8-bit code set possible, foreign countries had adopted variations on the old 7-bit "ASCII" character set. The idea was to give up a character you don't need for one that is more important in your country. The characters ` ~ ! @ # $ % ^ { } [ ] \ | < > could be replaced with Æ or Þ. These substitutions created other Code Pages in which the foreign use characters have been placed in the familiar ASCII location, and the ASCII charcters that they displaced have been put somewhere else.
There is also the problem that in any large publishing project, the character sets quickly expand beyond any 256 character subset. Beyond French and Spanish, there are Hebrew, Arabic, Cyrillic, Greek, and then the problem of mathematical symbols, special punctuation, and the stupid box drawing characters that caused all the trouble in the first place. Some of this you can handle with GIF files, but the rest pose a problem.
HTML and current World Wide Web practice address this issue with Entities. The characters that are not part of the standard ASCII set are referenced by name. An Entity reference to a character begins with "&", then contains the character name, and ends with ";". The special character used in HTML syntax are converted to Entities, with < > and & referenced as < > and & respectively. The Æ symbol is denoted Æ (short for "A-E ligature").
Going back to the earlier analysis, the Entity name allows HTML to refer to a character in a character set without becoming dependent on any particular code mapping. While a code mapping would limit you to 256 characters, the range of possible names is unlimited. Entities also allow you to accomodate Code Pages that either reflect historical accident (the original PC Code Page 437) or National Use subsets.
The CODPAGE statement in the OS/2 CONFIG.SYS dataset specifies first the default Code Page number, and then an alternate value. IBM normally makes 437 the default to support obsolete DOS utilities. In modern use, particularly when someone edits HTML files, it makes more sense to at least make 850 the default:
CODEPAGE=850,437
For more information, look up CODEPAGE in the Command Reference file in the Information folder.
SpHyDir does not change the current Code Page. The whole idea behind the current SpHyDir strategy is that whatever Code Page the user has currently selected must be familiar. The user must already know how to deal with it and how to comfortably enter data in the local language. So SpHyDir converts HTML use to the Code Page environment rather than trying to change OS/2 to some other character set.
The number of the current code page is used as a file extension. SpHyDir searches the root directory of the HTML library (determined from the HTMLLIB environment variable or the current directory when SpHyDir starts up). It looks for three files: ENTITES.xxx, CHARIN.xxx, and CHAROUT.xxx where xxx is the Code Page number. SpHyDir is distributed with *.850 versions of these three files for the recommended Code Page 850.
The ENTITIES file is an ordindary text file with entries to map the Entity names to values in the current code page. For example, the ENTITIES.850 begins with the lines:
b5 Aacute Á Capital A, acute accent
b7 Agrave À Capital A, grave accent
b6 Acirc  Capital A, circumflex accent
Only the first two items are significant. On the first line, "b5" is the hex representation of the value assigned to the character in the 850 code page and "Aacute" is the name of the entity (with the leading "&" and trailing ";" stripped off). The rest of the line is commentary.
SpHyDir has a builtin knowlege of the < > and & Entity names. These are also the only Entities that can be mapped to a code value below 80 hex. All the other Entity names that SpHyDir will process specially come from the ENTITIES.xxx file. However, if SpHyDir encounters an Entity name that is not defined in the file, it simply converts the "&" to a Smiley Face dingbat character and retains it in its Entity form in the Workarea and Text Edit windows. Later on the Similey Face is turned back to "&" when the HTML is generated.
The CHARIN.xxx and CHAROUT.xxx files provide translate tables to handle files encoded in the ISO 8859-1 character set. The CHARIN table translates characters from the HTML file with code values from A0 to FF hex to the corresponding codes in the current Code Page. The CHAROUT file provides a table to translate Code Page characters with a hex value of 80 to FF to the external ISO 8859-1 set.
SpHyDir provides CHARIN.850 and CHAROUT.850. Since the 850 Code Page contains the Entire Latin 1 character set, this appears to be a fairly reasonable arrangement. The user is free to create a CHARIN.437 to support the older PC character set, but since it does not contain all the characters in the Latin 1 alphabet some characters may be lost on input. Also, the CHAROUT table cannot meaningfully translate PC dingbat characters (like the box drawing character) that are not part of the 8859-1 set.
Assuming that the user adopts this suggestion to make 850 the default Code Page:
If a CHARIN.850 file has been copied to the root directory of the HTML library, then immediately after reading in an HTML file somewhere in that library, SpHyDir uses the table in that file to translate any ISO 8859-1 extended code values to their corresponding Code Page values. Note that this simply shuffles one set of code above hex 80 to another set of codes also above hex 80. Since all the HTML markup and entity names use standard ASCII characters below hex 80, the initial translation will not effect any of the subsequent syntax analysis.
If there is no CHARIN table, then any code value in the HTML file will be read into SpHyDir untranslated. It will display with whatever character the current Code Page assigns to that code value. However, without a CHAROUT table it will also be written back to HTML with its original code value. SpHyDir will not provide any help displaying or editing such characters, but it will not damage them if the user leaves them undisturbed.
When processing text, SpHyDir identifies an Entity from the leading "&". It will handle the <, >, and & Entities automatically. Without an ENTITIES.850 file in the root directory of the HTML library, those are the only Entities that it knows about. With an ENTITIES file, it will look up any other Entity names that the file defines and replace the Entity reference with the code value in the current Code Page for the corresponding character. The character will then display normally in the Workplace document tree and in the Text Edit Window.
Any Entity name that is not matched against the file remains an Entity. Since SpHyDir wants the "&" character to edit normally, and since a lot of dingbat characters are available, the "&" introducer is replaced by the Smiley Face character whose code value (in all Code Pages) is 01.
When it goes to generate HTML, SpHyDir converts the Smiley Face dingbat back to an "&". Thus anything that displays as a SmileyFace Entity in the Text Edit window will become a regular HTML Entity in the final file. It will not be translated to anything else.
Any text in a Paragraph or Header that contains extended code values (above hex 80) will be checked against the table build from the ENTITIES.xxx table. If a match is found, the character is replaced with an Entity reference to the character name.
Any extended code values that do not match Entity names will be translated by the CHAROUT.xxx table should it exist. These characters will remain as single byte codes. However, if they are proper Latin 1 characters then they should be assigned their 8859-1 values and should display properly with a browser.
If there is no CHAROUT table, any character in SpHyDir memory will be written to the HTML file without translation. If this happens to be a valid 8859-1 character, then it will display on most browsers.
Although 850 is the recommended International Code Page, many users may prefer other OS/2 Code Pages tailored to a particular country. It is trivial to generate another ENTITIES.xxx table file. Generating the CHARIN and CHAROUT tables are a bit more difficult, but the existing tables were generated with a C program and, given a bit more time, it may be possible for SpHyDir to provide these tables for other defined numbers:
852 Latin 2 (Czechoslovakia, Hungary, Poland)
857 Turkish
860 Portuguese
861 Iceland
863 Canada (French-speaking)
865 Nordic
This will remain an exercise unless some real user out on the Web reports that they use one of these Code Pages and would like them to be supported.
It is not clear if more is needed to support the right-to-left characters. For that matter, it is not clear if there is any Web Support for:
862 Hebrew-speaking
864 Arabic-speaking
Again, input from users would be helpful.
I have not studied the scope of the National Use Code Pages. They may not include all of the Latin 1 characters. Lacking official Entity names, and any usable Web standards, and any support from Browsers such as Netscape, it seems premature for PCLT to try to solve this problem all by itself. This is, however, an area where Entity notation has a substantial advantage over CHARIN/CHAROUT single character translation. A user with an Icelandic keyboard can still generate the occasional Turkish character as a named Entity even if that character cannot be natively displayed in the Text Edit Window. This is the reasoning behind the SpHyDir bias to generate output as Entity notation instead of as single byte 8859-1 encoding.
If this isn't exactly what you want, please E-mail Howard.Gilbert@yale.edu with additional suggestions.
Copyright 1995 PCLT -- SpHyDir Web Document Manager -- H. Gilbert
May be distributed with SpHyDir program
This document generated by SpHyDir, another fine product of PC Lube and Tune.