Handling Different Languages in HTML

A person might wonder how the different languages all around the world are represented in HTML. Is the HTML tag <HEAD> rendered as <头> in Chinese HTML, or <CABEÇA> in Portuguese HTML, for example? What a mess that would be if things worked out that way in computing! It turns out however that in HTML there is a remarkable solution to all of this which is at once simple, elegant, clean, straightforward, and with a real strong area in both forward and backward compatibility. And for those of us who speak English, it is also quite convenient as well. I will return to that presently.

With HTML 4.01 (and 4.0 as well), five new attributes and a special new tag were introduced in order for the HTML to reflect what language is being used. The first and most obvious of these is the LANG attribute. This attribute was added to nearly every tag or element that HTML 4.01 has, allowing the language of any part of an HTML file to be specifically identified, no matter where it is, in the existing tags used. But from the standpoint of the browser or other user agent, what exactly does the LANG attribute do? Let's start with a few things it doesn't do (and never will, for it was not so intended). It does not set the language to be used for the HTML (e. g. <HEAD> versus <头>). It does not modify the font style to change "A" into "א" for Hebrew, or "P" into "Π" for Greek, for example. It does not set the character encoding to be used (e. g. US-ASCII versus Shift-JIS). Most of all it does not invoke some babelfish translator to cause English text in the file to be displayed on the screen as being in the language called out.

While a mastery of all the different languages of the whole world (or even the tiny subset exampled or mentioned herein) is plainly beyond the reach of any of us mere mortals, mastery of the manner in which their various scripts are encoded in computers and used in HTML pages is not hard to attain at all. This can come in handy with HTML pages in that one may want to sprinkle the occasional foreign language phrase or unassimilated personal name into their site, e. g. René Descartes, Søren Kierkegaard, Ngô Đình Diệm, Ю́рий Андро́пов (Yuri Andropov), 胡锦涛 (Hu Jintao), and أسامة بن لادن (Osama bin Laden). For as they say in France, Vive la différence (or was it C'est la vie?). Technically, except for the one tag and few attributes that touch briefly on this, most of this is not HTML itself at all but a separate and related technology. However, as will be seen the language encoding and HTML technologies dovetail together quite nicely.

LANG merely informs as to what language the contents are written in. This does even so have several uses. For example, search engines can be set to look only for certain languages, or even to exclude languages. For example, one can set Google to return only pages written in Turkish, and it would do this by refusing to display any web page that does not include LANG="tr" within some one of its tags. If using a speech synthesizer user agent (as for example for the blind to be able to web-surf), the speech synthesizer might be able to respond to the language by pronouncing it correctly, not trying to read French words as English, for example. Finally (and this is one you can experiment with here), if you have an editor which can do spell checks, and is sophisticated enough to spell check in multiple languages (such as the more recent versions of Microsoft Word), by telling what language a word or phrase is in it will avoid triggering misspelled word warnings on the foreign language words. See what happens when you cut and paste the following two lines from the screen into such an editor:

Notice that though the two lines look the same on the browser screen, the spell checker has issues with "aquí," "tu," and "Madre" only in the first one ("He" and "a" are OK in English). But with the second one it notes no spelling errors at all. Because in the second case, the LANG attribute has been used to inform the browser (which carries over to your editor when one does a cut and paste) the fact that the contents of the paragraph are in Spanish. Instead of merely <P> (as is on the first of these two lines) it has <P LANG="es">. Here is a listing of the various two-letter language codes that LANG (and also HREFLANG which only belongs in <A> and <LINK>, plus the XHTML replacement for LANG which is xml:lang) may take. Many of the possible values for the CHARSET (and also ACCEPT-CHARSET which only belongs in <FORM>) attributes are demonstrated throughout this file. The other new attribute DIR and the new tag or element <BDO> are to be explained below when dealing with the encoding of Hebrew and Arabic.

But what about the data encoding? How can a browser know whether a file given to it is US-ASCII or Shift-JIS or whatever? And what good would it do to identify this within the file unless it already knew how to read the file in the first place? The nifty aspect of HTML is that no matter what the encoding (with only one slight variation in one extreme case), the exact sequence of bytes needed to make this identification will look exactly the same. There are four different places where this information can be contained, two of which are not in the file itself (and the other two are in the file). The first and most powerful place to put this information is in the HTTP header, something seen and used by your user agent but not itself displayed anywhere. This is something done at the server level, something way outside the scope of the HTML itself.

In some servers (including mine), there is a special file used for setting this function, called .htaccess. This text file is used to define a number of settings for the server to use in transmitting your pages. In order to enable all the various encoding types illustrated here, the following is several short extracts from that which has been added to that file for these pages:

The first line above sets a normal site-wide default of iso-8859-1 encoding, and each of the following trios of lines deal with specifying different encodings for the various files with certain file names beginning with lang. What this causes is that if a file is named (for example) lang2abc.html and the server has the above commands in its .htaccess file, the HTTP header will contain a line that reads:

The next two levels down in priority are inside the file. The first of these only applies to XHTML which has as an opening line something to the effect of <?xml version="1.0" encoding="UTF-8"?> (if for example the encoding type is UTF-8). The next level down is found in a <META>, as for example this file which has this line reading as <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8"> to set the same encoding of utf-8. After XHTML 1.0, this method ceases to be acceptible, even as the LANG attribute disappears and gives way to the xml:lang attribute.

Finally, the last such specification in priority (it only has effect if no other means are used to indicate the encoding) is the CHARSET attribute used on the <A> or <LINK> tags or elements to indicate the encoding used on the file thus pointed to. In this file, the <LINK> tags to the foreign language files in the <HEAD> portion and also the <A> link file views at the bottom feature both the CHARSET and HREFLANG attributes. But what actually are these encodings, how do they work, and how can they be used by someone who wants to use them for foreign text in an HTML page? Each of them represents a way to store data in a form that the machine can recognize and know what to do with, how to display, and thereby be useful to the reader no matter what language he speaks, by making it possible for web pages to also be in any language.

Data representation for storage and transmission has been a concern for computer and machine processing since Samuel Morse developed his first electric data transmission system the telegraph, and encoded the alphabetic, punctuation, and numeric data using various "dots" and "dashes" of varying lengths. With the age of digital computers, the attempt was made to reduce the numbers and letters to binary digit ("bit") patterns. Several early forms no longer in use include the Hollerith cards, a 5-bit pattern for punched paper tape, a 6-bit pattern called BCD, the Electronic Industries Association (EIA) eight-bit pattern for punched paper tape, and IBM's famous EBCDIC.

As all of the above are thankfully relegated to the remote dust bin of ancient data formats long no longer in use, all have been displaced with an early and universal ANSI standard known as ASCII. Eight bits makes a nice even number and only seven of them are needed to have a unique and special value for each letter and punctuation and numerical digit, and still leave plenty of room for quite a selection of invisible (non-printing) control codes. The eighth bit could then be used (and was at various times and in various applications) as a parity bit (set to 0 or 1 depending on whether the remaining bits had an even or odd number of ones or zeros). Such a parity bit would be useful in data transmission or in data storage where there is some potential for the data to be damaged. An incorrect parity would then serve as a way to know that the data has been damaged. Of course in data storage there is little one can do if the data is damaged (other than not use it, or use it only conditionally, knowing that the data is untrustworthy) but in data transmission one can request the incorrectly received message to be transmitted again, and then be able to detect if it has arrived correctly this time or else ask for it to be retransmitted again.

Note: In each of the examples given below, the first part of what is contained in the green frame is simply in this file, showing what should come up if your browser understands that part of the utf-8 encoding of this file. Below that is an <OBJECT> window showing a separate file actually encoded in the encoding called out therein. If the two files seen together within a green frame do not look the same, your browser has trouble displaying at least one of them.

The listing and construction tables (and files, where the file needs characters that are illegal in UTF-8) can be used for generating files in the designated character set. When a letter is by itself, just copy the letter as it appears on the screen and paste into place. When the letter is followed by something else in parenthesis, to get the letter copy and paste the thing in the parenthesis to get the letter. This works well with recent versions of Microsoft notepad using the Windows-1252 Code Page (default for notepad in the Western Europe and Americas).

The First and Most Basic Level: US-ASCII

As error detection and protection algorithms have gotten far more general and useful, the need to set aside a whole bit for each character has gone away and now there are effectively eight bits that can be used. Basic ASCII however only used the lower seven bits, with the eighth always being set to zero. This has been the case since long before the first HTML experiments were being made in 1990, and nearly all of the early HTML available on the net has been coded in the ASCII format. This format is still available and commonly used. It has the strength that it is universally accepted by any machine anywhere (not counting museum relics of pure historical interest only), without any trouble.

And this is also the key to the universality of HTML as well. The HTML tags and attributes and hard-coded attribute values (such as TOP or JUSTIFY, as seen in ALIGN="JUSTIFY") and even the entity names (such as á, á) are all coded in straight US-ASCII. No matter what other language any HTML file will be in, all the HTML tags and attributes and so forth will always be in one-byte seven bit ASCII codes recognized by all actively used computers. And there is room for these basic ASCII one-byte values in all the foreign language codes as well, ensuring total backward compatibility, for all the examples below, to even the most sophisticated. So the HTML has no <头> or <CABEÇA>; there is simply <HEAD> in no matter what language the file text is written in. And, using numeric entities, any character can be represented as can be represented in the most modern standard, Unicode, by such things as ụ which results in "ụ."

For all that, the basic ASCII is useful, not only for English, but also any number of other much more minor languages that also use the standard English characters, such as Tagalog (and most other Philippine dialects), Indonesian, any number of languages for many of the tiny islands of Oceania, and even some other languages when using a Romanized Popular Alphabet (such as Hmong). Needless to say, other than English most of these languages are from rather small and poorly known regions. To move above this, and even to accommodate some small number of imported words even in these languages (though it is an acceptable custom to simply use the nearest ASCII counterpart, e. g. "facade" for "façade" or "naive" for "naïve" or "pina" (or even "pinya") for "piña", which is used in Tagalog and other Philippine dialects as the borrowed Spanish word for "pineapple", at least some extension is needed, or else recourse to character or numeric entities. But for basic US-ASCII here is a sampling of how it should look, followed by an <OBJECT> window showing a file in the stated encoding:

A sample of English, Tagalog, and Indonesian using charset=US-ASCII (UTF-8):

Please Read This Message:

It does not seem possible to enjoy happiness on earth even for a short time. Sickness, aging, hunger, crime, insecurity and oppression often make life miserable.

Pakisuyong Basahin Ito:

Parang imposibleng tamasahin kahit ang sandaling kaligayahan sa lupa. Ang sakit, pagtanda, gutom, krimen, kawalang-siguro at kaapihan ang malimit na nagpapalungkot sa buhay.

Silakan Membaca Pesan Ini:

Rasanya seolah-olah tidak mungkin untuk menikmati kebahagiaan di bumi walaupun hanya sementara. Penyakit, usia tua, kelaparan, kejahatan, rasa tidak aman dan penindasan sering kali membuat hidup ini tak tertahankan.

Even in a pure US-ASCII file many non-ascii characters can still be attained through the use of character entities. These can either be specified as strings or numeric values, but there are very many numeric values for which there is no mere character string. This is the official and definitive list of the universally-recognized character entities.

The Next Level Up: Supplementing ASCII with Eight-bit Codes

Once it became practical to begin using the eighth bit, this doubled the total number of codes (letters, punctuations, control codes, etc.) possible. The first and most obvious use of these extended code points was to supplement the normal ASCII letters with additional letters, ligatures, and even the same letters, but with various diacritical marks added. This buys one a large range of languages which mostly use the same letters as English, and need only a few additional "letters" to round out their alphabet. The result would be text in which the ASCII letters and the particular needed supplementary "letters" would be intermixed. Furthermore, since many of these languages use many of the same supplementary "letters" (how many different languages use ü or é for example), it would not take all that many to support a tremendous array of the various languages.

Unfortunately there has also been a tremendous diversity as to how to use these upper 128 possible character-byte values. In the earliest days, there were a number of proprietary coding schemes, of which I illustrate two here. The first is that used by the Microsoft DOS operating system, now presently known (sustained for backward compatibility by modern Microsoft products) as simply Code Page 437 ("CP437"). This primitive standard featured quite a number of not only such letters as needed for other languages but also any number of graphic characters for box drawing, mathematical symbols, shading, and various dingbats (smiley faces, card suites, etc.) as deemed useful to the programmers of MS-DOS. In this file I have confined my examples and construction tables to letters and punctuation, so the dingbats and box drawing characters are not to be shown herein. Another is that developed by Macintosh, in some ways based on MS-DOS but also being more advanced and much closer to present standards.

Such ancient and primitive proprietary standards would soon give away to the most thorough single-character multi-language representation concieved, namely ISO 8859, as well as several special national or linguistic codes. But even these have their limitations, eventually giving way to Unicode which encompasses all "letters" written in any known sort of language around the world.

Early Proprietary Supplements to ASCII

Code Page 437 can produce the following diacritical letters and ligatures and distinctive punctuation (including financial symbols) as listed here. I have not included it in this file but displayed another file since some of the characters used in creating these letters are not valid characters for any modern HTML file and this included file cannot be validated. Furthermore, two characters (á and ¡) require additional intervention to be inserted in a file that would serve as MS-DOS format. Note also that there is no distinct character for the German double sharp s ("ß") but rather that the Greek Beta ("β") is simply recycled to this purpose. Be all that as it may, here is what is available in CP437:

One should be able to see from the below samples that several languages can be easily covered using CP437.

A sample of German, French, Italian, and Swedish using charset=CP437 (UTF-8):

Lesen Sie bitte diese Botschaft:

Es scheint nicht einmal für kurze Zeit möglich zu sein, in Glück auf der Erde zu leben. Vielen wird das Leben durch Krankheit, Altersbeschwerden, Hunger, Kriminalität, Unsicherheit und Bedrükkung schwergemacht.

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

La prego di leggere questo messaggio:

La felicità sulla terra sembra quasi irrealizzabile. Malattie, vecchiaia, fame, criminalità, insicurezza e oppressione spesso rendono la vita infelice.

Var god läs detta meddelande:

Det verkar inte vara möjligt att åtnjuta lycka på jorden ens en kort tid. Sjukdom, åldrande, hunger, brottslighet, osäkerhet och förtryck gör ofta livet olyckligt.

A sample of Spanish using charset=CP437 (UTF-8):

Hemos visto en el capítulo anterior qué inmenso tesoro de gracias recibió María en el primer instante de su Inmaculada Concepción; ahora bien, teniendo presentes las enseñanzas de los teólogos sobre la plenitud de la gracia de María, y habiéndose multiplicado en todos los instantes aquel inmenso caudal de gracias, de luces, de sabiduría y de virtudes, ¡cual sería el tesoro de merecimientos con que se hallaría enriquecida María el día de su nacimiento!

Sírvase leer este mensaje:

Ni siquiera por loco tiempo parece posible disfrutar de felicidad en la Tierra. La enfermedad, la vejez, el hambre, el crimen, la inseguridad y la opresión suelen causar sufrimiento en la vida.

One should be able to see that despite its crudity, the old MS-DOS could represent much in the way of various languages, though many might be only partially or imperfectly supported. In particular, very few of the special letters appear in their capital forms, but in lower case the assortment is much more varied, and sufficient for the above languages, or at least for the samples of them provided here.

The other proprietary coding to be demonstrated here is that used by the Apple/Macintosh computers. Unlike CP437, the Macintosh code page can be validated by the Web Consortium's HTML validator. From the following examples one should see that Macintosh encoding is good for these and more, but first, here is what is available in Macintosh:

Here, Macintosh is shown supporting all languages that MS-DOS CP 437 supported in the above examples.

A sample of German, French, Italian, and Swedish using charset=macintosh (UTF-8):

Lesen Sie bitte diese Botschaft:

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

La prego di leggere questo messaggio:

La felicità sulla terra sembra quasi irrealizzabile. Malattie, vecchiaia, fame, criminalità, insicurezza e oppressione spesso rendono la vita infelice.

Var god läs detta meddelande:

Det verkar inte vara möjligt att åtnjuta lycka på jorden ens en kort tid. Sjukdom, åldrande, hunger, brottslighet, osäkerhet och förtryck gör ofta livet olyckligt.

In addition, it also not only supports Spanish (as does MS-DOS CP 437) but now also Danish:

A sample of Spanish and Danish using charset=macintosh (UTF-8):

Sírvase leer este mensaje:

Ni siquiera por loco tiempo parece posible disfrutar de felicidad en la Tierra. La enfermedad, la vejez, el hambre, el crimen, la inseguridad y la opresión suelen causar sufrimiento en la vida.

Læs venligst dette budskab:

Selv for en kortere tid synes det umuligt at opnå lykke på jorden. Sygdom, sult, alderdom, kriminalitet, utryghed og undertrykkelsen kaster alt for ofte en mørk skygge over tilværelsen.

The Danish is made possible by the addition of the letter "ø" to the possible characters, but in fact Macintosh is capable of supporting many more languages than CP437. Yet even this record would be outdone by the capacity and flexibility of the first and most official standard today.

The First and Most General Standard, Latin-1 (ISO-8859-1)

ISO-8859-1 had its origins in the first and most basic extension of ASCII used by Digital Corp. as supported by their famous VT100 terminal. Back then the standard was called the European Computer Manufacturers Association (ECMA) Standard 94: 8-Bit Single Byte Coded Graphic Character Sets - Latin Alphabets No. 1 to No. 4, and No. 1 was the most important since it is exactly what later became ISO-8859-1.

This standard is now so universal that even the most advanced Unicode still bases its first 256 code points on this standard. HTTP headers are presumed to be in this standard (and it is dangerous to use any other standard for HTTP headers), and any HTML file, if there is no calling out of its encoding anywhere, will be displayed by default as ISO-8859-1 unless the user sets special browser settings to something else. Despite the fact that ISO-8859-1 squanders a full 32 possible code points on rather theoretical (and very rarely used) control codes, the remaining 96 code points are specially directed to support at least the following languages: Danish, Dutch (not including the Dutch ligature "Ĳ" and "ĳ"), English, Faeroese, Finnish and French (missing only the very rarely used letters Š, š, Ž, ž, Œ, œ, and Ÿ), German, Icelandic, Irish, Italian, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swedish, Albanian, Indonesian, the Hmong Romanized Popular Alphabet, Tagalog (and most Philippine dialects), Afrikaans and Swahili.

All the characters found in charset=iso-8859-1:
A, a, À, à, Á, á, Â, â, Ã, ã, Ä, ä, Å, å
B, b
C, c, Ç, ç
D, d, Ð, ð
E, e, È, è, É, é, Ê, ê, Ë, ë
F, f
G, g
H, h
I, i, Ì, ì, Í, í, Î, î, Ï, ï
J, j
K, k
L, l
M, m, µ
N, n, Ñ, ñ
O, o, Ò, ò, Ó, ó, Ô, ô, Õ, õ, Ö, ö, Ø, ø
P, p
Q, q
R, r
S, s, ß
T, t
U, u, Ù, ù, Ú, ú, Û, û, Ü, ü
V, v
W, w
X, x
Y, y, Ý, ý, ÿ
Z, z
Æ, æ
Þ, þ
¡, ¿, «, »
$, ¢, £, ¤, ¥

Notice in the above listing of the letters available there are none in parenthesis, since every letter is simply itself with the same code point between the UTF-8 of this file and iso-8859-1.

A sample of German, Italian, Swedish, and Icelandic using charset=iso-8859-1 (UTF-8):

Lesen Sie bitte diese Botschaft:

La prego di leggere questo messaggio:

La felicità sulla terra sembra quasi irrealizzabile. Malattie, vecchiaia, fame, criminalità, insicurezza e oppressione spesso rendono la vita infelice.

Var god läs detta meddelande:

Det verkar inte vara möjligt att åtnjuta lycka på jorden ens en kort tid. Sjukdom, åldrande, hunger, brottslighet, osäkerhet och förtryck gör ofta livet olyckligt.

Við biðjum þig að lesa þetta:

Ekki virðist unnt að vera hamingjusamur hér á jörðinni einu sinni um skamman tíma. Veikindi, öldrun, hungur, glæpir, öryggisleysi og kúgun gera mönnum tilveruna oft óbærilega.

As you can see from the above, iso-8859-1 supports the above languages plus still more languages. And here is another sample:

A sample of Spanish, Danish, and French using charset=iso-8859-1 (UTF-8):

Sírvase leer este mensaje:

Ni siquiera por loco tiempo parece posible disfrutar de felicidad en la Tierra. La enfermedad, la vejez, el hambre, el crimen, la inseguridad y la opresión suelen causar sufrimiento en la vida.

Læs venligst dette budskab:

Selv for en kortere tid synes det umuligt at opnå lykke på jorden. Sygdom, sult, alderdom, kriminalitet, utryghed og undertrykkelsen kaster alt for ofte en mørk skygge over tilværelsen.

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

Microsoft's equivalent to ISO-8859-1, Windows-1252

Microsoft utilized some of the code space squandered in ISO 8859 for the mostly mythical additional control codes to add several letters needed to complete the support of French and Finnish, and also to add some punctuation marks and financial symbols.

Notice in the above listing of the letters available there are none in parenthesis, since every letter is simply itself since I have been using Windows-1252 for each of the construction files.

A sample of German, Italian, Swedish, and Icelandic using charset=Windows-1252 (UTF-8):

Lesen Sie bitte diese Botschaft:

La prego di leggere questo messaggio:

La felicità sulla terra sembra quasi irrealizzabile. Malattie, vecchiaia, fame, criminalità, insicurezza e oppressione spesso rendono la vita infelice.

Var god läs detta meddelande:

Det verkar inte vara möjligt att åtnjuta lycka på jorden ens en kort tid. Sjukdom, åldrande, hunger, brottslighet, osäkerhet och förtryck gör ofta livet olyckligt.

Við biðjum þig að lesa þetta:

Ekki virðist unnt að vera hamingjusamur hér á jörðinni einu sinni um skamman tíma. Veikindi, öldrun, hungur, glæpir, öryggisleysi og kúgun gera mönnum tilveruna oft óbærilega.

As you can see from the above, Windows-1252 supports the same languages as ISO-8859-1 except that its coverage of French and Finnish is completed. And here is another sample:

A sample of Spanish, Danish, and French using charset=Windows-1252 (UTF-8):

Sírvase leer este mensaje:

Ni siquiera por loco tiempo parece posible disfrutar de felicidad en la Tierra. La enfermedad, la vejez, el hambre, el crimen, la inseguridad y la opresión suelen causar sufrimiento en la vida.

Læs venligst dette budskab:

Selv for en kortere tid synes det umuligt at opnå lykke på jorden. Sygdom, sult, alderdom, kriminalitet, utryghed og undertrykkelsen kaster alt for ofte en mørk skygge over tilværelsen.

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

The French ligature "œ" is not included in the basic vanilla ISO-8859-1 though most user agents recognize the character. Windows-1252 adds this ligature (and a few other letters) making the display of this letter more possible. The ISO standard would later on make similar corrections by deleting some of the less needed puctuation and fractions, thus still reserving the control code space so carefully spared by all of ISO 8859. See ahead at the ISO 8859-15 standard.

The Other Latin-extension parts of the ISO 8859 Standard

ISO-8859-1 is only the first of quite a series of differing standards which among them provide for a great many more languages than ISO-8859-1 alone can provide. By merely specifying which variant of ISO-8859 one wants to use one can obtain quite a further assortment of other letters with other diacritical marks not contained in the basic vanilla ISO-8859-1. Basically, when one specifies some other variant, e. g. ISO-8859-2, a substantially different set of upper-128 byte value letters is made available, yet it is all considered part of the same standard, vetted and approved by the same body that originally vetted and approved the original basic ISO-8859-1. In planning these other variants of ISO-8859, one thing that was done is that where different language groups share certain letters/diacritical mark combinations with ISO-8859-1, they use the same code points where reasonably possible. For example the German letters "Ä," "Ö," "Ü," "ß," "ä," "ö," and "ü" all occur in the same code point locations in each of ISO-8859-1 through -4, -9 and -10, and -13 through -16, the last. This would allow German to be blended with any language supported by any of these ISO 8859 variants. And of course the basic vanilla ASCII characters needed by English are supported by all ISO-8859 variants, which comes in handy for the English-Language HTML codes needed for the HTML of any language.

In the ISO-8859 suite of character sets, there are allocated 16 basic variants, with a couple subvarieties of a couple of them, and one other which is not used. In this first level of coding schemes that enable multiple languages, the overall scheme used by the proprietary coding schemes demonstrated above is used, namely that normal ASCII characters would have any non-ASCII letters simply intermixed into the flow of the ASCII text as needed by the demands of the language being used. The first four of these standards are all based on the ECMA Standard 94 Latin Alphabets No. 1 through No. 4. You will note that there are several numbers skipped here. These will be introduced in a later section regarding Dual Language extensions, which represents a different approach, despite being also parts of the same ISO-8859 standard. The versions here are all called "ISO Latin dash numbers," 1 through 10. For the first four, their ISO Latin numbers coincide with the numbers in the dash after the 8859, e. g. ISO Latin-3 is ISO-8859-3, and so forth. However, ISO Latin-5 is actually ISO-8859-9, and ISO Latin-10 is ISO-8859-16. The skip is because of those other versions which are called "ISO Latin slash Languages," Russian, Arabic, Greek, Hebrew, Thai, and a skipped position originally intended for another language. So for example ISO Latin/Hebrew would be the same as ISO-8859-8.

The First Standard for Central and Eastern Europe, Latin-2 (ISO-8859-2)

For Central and Eastern Europe, the following languages are particularly supported by the second variant of ISO 8859, Latin-2. This variant makes it possible to support Bosnian, Polish, Croatian, Czech, Slovak, Slovenian, and Hungarian, Irish, Serbian (if one uses the Latin transcription), Sorbian (Lusatian) along with English, German, Finnish, and Albanian as supported by ISO-8859-1. It is often replaced with the newer ISO-8859-16 which provides much of the same support but subtracts Bosnian, Czech, Slovak, and adds Irish Gaelic, Italian, Romanian, and French.

All the characters found in charset=iso-8859-2:
A, a, Á, á, Â, â, Ä, ä, Ą(¡), ą(±), Ă(Ã), ă(ã)
B, b
C, c, Ç, ç, Ć(Æ), ć(æ), Č(È), č(è)
D, d, Ď(Ï), ď(ï), Đ(Ð), đ(ð)
E, e, É, é, Ë, ë, Ę(Ê), ę(ê), Ě(Ì), ě(ì)
F, f
G, g
H, h
I, i, Í, í, Î, î
J, j
K, k
L, l, Ł(£), ł(³), Ľ(¥), ľ(µ), Ĺ(Å), ĺ(å)
M, m
N, n, Ń(Ñ), ń(ñ), Ň(Ò), ň(ò)
O, o, Ó, ó, Ô, ô, Ö, ö, Ő(Õ), ő(õ)
P, p
Q, q
R, r, Ŕ(À), ŕ(à), Ř(Ø), ř(ø)
S, s, Ś(¦), ś(¶), Š(©), š(¹), Ş(ª), ş(º), ß
T, t, Ť(«), ť(»), Ţ(Þ), ţ(þ)
U, u, Ú, ú, Ü, ü, Ů(Ù), ů(ù), Ű(Û), ű(û)
V, v
W, w
X, x
Y, y, Ý, ý
Z, z, Ź(¬), ź(¼), Ž(®), ž(¾), Ż(¯), ż(¿)
$, ¤

A sample of Czech, Slovak, Hungarian and Croatian using charset=iso-8859-2 (UTF-8):

Prosím, čtěte toto poselství:

Zdá se, že není možné těšit se ze štěstí na Zemi ani po krátkou dobu. Nemoci, stárnutí, hlad, zločiny, nejistota a útlak - to jsou věci, které činí život strastiplným.

Prosím, čítajte toto posolstvo:

Zdá sa, že nie je možné tešiť sa zo šťastia na Zemi dokonca ani počas krátkej doby. Choroby, starnutie, hlad, zločiny, neistota a útlak často naplňujú život strasťami.

Kérem olvassa el ezt a hírt:

Úgy tűnik, nem lehetséges, hogy ha csak rövid ideig is, boldogságnak örvendhessünk a földön. Betegség, öregség, éhinség, bűnözés, bizonytalanság és elnyomás gyakran egy boldogtalan élet okozói.

Molim Vas pročitajte ovu vijest:

Izgleda nemoguće makar samo za kratko vrijeme radovati se sreći na Zemlji. Bolest, starost, glad, kriminal, nesigurnost i tlačenje, često doprinose nesretnom životu.

A sample of Polish using charset=iso-8859-2 (UTF-8):

Proszę przeczytać

Zaznawanie niczym niezmąconego szczęścia na ziemi - nawet przez krótki czas - wydaje się niemożliwe. Zbyt często uprzykrzają życie choroby, starzenie się, głog, przestępczość, różne niebezpiezpieczeństwa lub ucisk.

Z papieskiej przysięgi koronacyjnej:

Przysięgam nie zmieniać niczego z przekazanej mi tradycji, ani niczego, co było przede mną strzeżone przez mych miłych Bogu poprzedników, ani nie naruszać, ani nie zmieniać, ani nie zezwalać na zmiany.

Microsoft's equivalent to ISO-8859-2, Windows-1250

The First Standard for Southern Europe, Latin-3 (ISO-8859-3)

The third variant ISO-8859-3 is primarily meant to serve Turkish, Maltese, and Esperanto, along with English, Dutch, German, Italian, Spanish, French (minus the same letters), and Afrikaans as supported by ISO-8859-1. No other ISO 8859 variant provides for Maltese and Esperanto, though Turkish is provided for in ISO 8859-9.

All the characters found in charset=iso-8859-3:
A, a, À, à, Á, á, Â, â, Ä, ä
B, b
C, c, Ç, ç, Ċ(Å), ċ(å), Ĉ(Æ), ĉ(æ)
D, d
E, e, È, è, É, é, Ê, ê, Ë, ë
F, f
G, g, Ğ(«), ğ(»), Ġ(Õ), ġ(õ), Ĝ(Ø), ĝ(ø)
H, h, Ħ(¡), ħ(±), Ĥ(¦), ĥ(¶)
I, i, Ì, ì, Í, í, Î, î, Ï, ï, İ(©), ı(¹)
J, j, Ĵ(¬), ĵ(¼)
K, k
L, l
M, m, µ
N, n, Ñ, ñ
O, o, Ò, ò, Ó, ó, Ô, ô, Ö, ö
P, p
Q, q
R, r
S, s, Ş(ª), ş(º), Ŝ(Þ), ŝ(þ), ß
T, t
U, u, Ù, ù, Ú, ú, Û, û, Ü, ü, Ŭ(Ý), ŭ(ý)
V, v
W, w
X, x
Y, y
Z, z, Ż(¯), ż(¿)
$, £, ¤

A sample of Maltese and Turkish using charset=iso-8859-3 (UTF-8):

Jekk Jogħġbok Aqra Dan il-Messaġġ:

Ma tanx tidher li hi ħaġa possibbli li wieħed igawdi l-ferħ fuq l-art anki għal żmien qasir. Il-mard, ix-xjuħija, il-ġuħ, id-delitti, in-nuqqas ta' sigurtà u l-moħqrija spiss jagħmlu l-ħajja miżerja.

Lütfen Bu Mesajı Okuyun:

Şimdi yeryüzünde kısa da olsa, mutlu olarak yaşamanın mümkün olmadığı görülüyor. Hastalık, yaşlılık, açlık, cürüm, güvensizlik ve baskı, hayatı mutsuz hale getiriyor.

The First Standard for Northern Europe, Latin-4 (ISO-8859-4)

ISO-8859-4 is the first of three meant to address the Northern European languages, in this case Estonian, Latvian, Lithuanian, Greenlandic, and Sami. ISO 8859 parts 10 (Latin-6) and 13 (Latin-7) also attempt to cover pretty much the same set of languages.

All the characters found in charset=iso-8859-4:
A, a, Á, á, Â, â, Ã, ã, Ä, ä, Å, å, Ą(¡), ą(±), Ā(À), ā(à)
B, b
C, c, Č(È), č(è)
D, d, Đ(Ð), đ(ð)
E, e, É, é, Ë, ë, Ē(ª), ē(º), Ę(Ê), ę(ê), Ė(Ì), ė(ì)
F, f
G, g, Ģ(«), ģ(»)
H, h
I, i, Í, í, Î, î, Ï, ï, Ĩ(¥), ĩ(µ), Į(Ç), į(ç), Ī(Ï), ī(ï)
J, j
K, k, Ķ(Ó), ķ(ó), ĸ(¢)
L, l, Ļ(¦), ļ(¶)
M, m
N, n, Ŋ(½), ŋ(¿), Ņ(Ñ), ņ(ñ)
O, o, Ô, ô, Õ, õ, Ö, ö, Ø, ø, Ō(Ò), ō(ò)
P, p
Q, q
R, r, Ŗ(£), ŗ(³)
S, s, Š(©), š(¹), ß
T, t, Ŧ(¬), ŧ(¼)
U, u, Ú, ú, Û, û, Ü, ü, Ų(Ù), ų(ù), Ũ(Ý), ũ(ý), Ū(Þ), ū(þ)
V, v
W, w
X, x
Y, y
Z, z, Ž(®), ž(¾)
Æ, æ
$, ¤

A sample of Lithuanian using charset=iso-8859-4 (UTF-8):

Iš popiežaus karūnavimo priesaikos:

Aš prisiekiu nieko nekeisti gautoje tradicijoje, ir nekeisti nieko tame, ką iki manęs saugojo mano Dievo palaiminti pirmtakai, prisiekiu nesikėsinti į ją, ir neleisti įvesti jokių naujovių.

Microsoft's equivalent to ISO-8859-4, Windows-1257

The Second Standard for Southern Europe, Latin-5 (ISO-8859-9)

This variant is almost identical to the basic vanilla ISO-8859-1, except that a few letters useful for Icelandic are subtracted in favor of other letters useful for Turkish. As one can also see here, the support of many other European languages is not impaired in comparison to iso-8859-1.

All the characters found in charset=iso-8859-9:
A, a, À, à, Á, á, Â, â, Ã, ã, Ä, ä, Å, å
B, b
C, c, Ç, ç
D, d
E, e, È, è, É, é, Ê, ê, Ë, ë
F, f
G, g, Ğ(Ð), ğ(ð)
H, h
I, i, Ì, ì, Í, í, Î, î, Ï, ï, İ(Ý), ı(ý)
J, j
K, k
L, l
M, m, µ
N, n, Ñ, ñ
O, o, Ò, ò, Ó, ó, Ô, ô, Õ, õ, Ö, ö, Ø, ø
P, p
Q, q
R, r
S, s, Ş(Þ), ş(þ), ß
T, t
U, u, Ù, ù, Ú, ú, Û, û, Ü, ü
V, v
W, w
X, x
Y, y, ÿ
Z, z
Æ, æ
¡, ¿, «, »
$, ¢, £, ¤, ¥

A sample of Spanish, Danish, French, German, Italian, Swedish, and Turkish using charset=iso-8859-9 (UTF-8):

Sírvase leer este mensaje:

Ni siquiera por loco tiempo parece posible disfrutar de felicidad en la Tierra. La enfermedad, la vejez, el hambre, el crimen, la inseguridad y la opresión suelen causar sufrimiento en la vida.

Læs venligst dette budskab:

Selv for en kortere tid synes det umuligt at opnå lykke på jorden. Sygdom, sult, alderdom, kriminalitet, utryghed og undertrykkelsen kaster alt for ofte en mørk skygge over tilværelsen.

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

Lesen Sie bitte diese Botschaft:

La prego di leggere questo messaggio:

La felicità sulla terra sembra quasi irrealizzabile. Malattie, vecchiaia, fame, criminalità, insicurezza e oppressione spesso rendono la vita infelice.

Var god läs detta meddelande:

Det verkar inte vara möjligt att åtnjuta lycka på jorden ens en kort tid. Sjukdom, åldrande, hunger, brottslighet, osäkerhet och förtryck gör ofta livet olyckligt.

Lütfen Bu Mesajı Okuyun:

Şimdi yeryüzünde kısa da olsa, mutlu olarak yaşamanın mümkün olmadığı görülüyor. Hastalık, yaşlılık, açlık, cürüm, güvensizlik ve baskı, hayatı mutsuz hale getiriyor.

Microsoft's equivalent to ISO-8859-9, Windows-1254

The Second Standard for Northern Europe, Latin-6 (ISO-8859-10)

All the characters found in charset=iso-8859-10:
A, a, Á, á, Â, â, Ã, ã, Ä, ä, Å, å, Ā(À), ā(à), Ą(¡), ą(±)
B, b
C, c, Č(È), č(è)
D, d, Ð, ð, Đ(©), đ(¹)
E, e, É, é, Ë, ë, Ę(Ê), ę(ê), Ė(Ì), ė(ì), Ē(¢), ē(²)
F, f
G, g, Ģ(£), ģ(³)
H, h
I, i, Í, í, Î, î, Ï, ï, Į(Ç), į(ç), Ī(¤), ī(´), Ĩ(¥), ĩ(µ)
J, j
K, k, Ķ(¦), ķ(¶), ĸ(ÿ)
L, l, Ļ(¨), ļ(¸)
M, m
N, n, Ņ(Ñ), ņ(ñ), Ŋ(¯), ŋ(¿)
O, o, Ó, ó, Ô, ô, Õ, õ, Ö, ö, Ø, ø, Ō(Ò), ō(ò)
P, p
Q, q
R, r
S, s, Š(ª), š(º), ß
T, t, Ŧ(«), ŧ(»)
U, u, Ú, ú, Û, û, Ü, ü, Ũ(×), ũ(÷), Ų(Ù), ų(ù), Ū(®), ū(¾)
V, v
W, w
X, x
Y, y, Ý, ý
Z, z, Ž(¬), ž(¼)
Æ, æ
Þ, þ
$

A sample of Lithuanian using charset=iso-8859-10 (UTF-8):

Iš popiežaus karūnavimo priesaikos:

Aš prisiekiu nieko nekeisti gautoje tradicijoje, ir nekeisti nieko tame, ką iki manęs saugojo mano Dievo palaiminti pirmtakai, prisiekiu nesikėsinti į ją, ir neleisti įvesti jokių naujovių.

The Standard for the Baltic Rim of Europe, Latin-7 (ISO-8859-13)

This is the third and last attempt to address the Northern European languages with an ISO 8859 variant. This last adds a few rarely used characters lacking in the other two.

All the characters found in charset=iso-8859-13:
A, a, Ä, ä, Å, å, Ą(À), ą(à), Ā(Â), ā(â)
B, b
C, c, Ć(Ã), ć(ã), Č(È), č(è)
D, d
E, e, É, é, Ę(Æ), ę(æ), Ē(Ç), ē(ç), Ė(Ë), ė(ë)
F, f
G, g, Ģ(Ì), ģ(ì)
H, h
I, i, Į(Á), į(á), Ī(Î), ī(î)
J, j
K, k, Ķ(Í), ķ(í)
L, l, Ļ(Ï), ļ(ï), Ł(Ù), ł(ù)
M, m, µ
N, n, Ń(Ñ), ń(ñ), Ņ(Ò), ņ(ò)
O, o, Ó, ó, Õ, õ, Ö, ö, Ø(¨), ø(¸), Ō(Ô), ō(ô)
P, p
Q, q
R, r, Ŗ(ª), ŗ(º)
S, s, Š(Ð), š(ð), Ś(Ú), ś(ú), ß
T, t
U, u, Ü, ü, Ų(Ø), ų(ø), Ū(Û), ū(û)
V, v
W, w
X, x
Y, y
Z, z, Ź(Ê), ź(ê), Ż(Ý), ż(ý), Ž(Þ), ž(þ)
Æ(¯), æ(¿)
«, », “(´), ”(¡), „(¥), ’(ÿ)
$, ¢, £, ¤

A sample of Polish and Lithuanian using charset=iso-8859-13 (UTF-8):

Proszę przeczytać

Z papieskiej przysięgi koronacyjnej:

Iš popiežaus karūnavimo priesaikos:

Aš prisiekiu nieko nekeisti gautoje tradicijoje, ir nekeisti nieko tame, ką iki manęs saugojo mano Dievo palaiminti pirmtakai, prisiekiu nesikėsinti į ją, ir neleisti įvesti jokių naujovių.

The Standard for the Celtic Languages of Europe, Latin-8 (ISO-8859-14)

All the characters found in charset=iso-8859-14:
A, a, À, à, Á, á, Â, â, Ã, ã, Ä, ä, Å, å
B, b, Ḃ(¡), ḃ(¢)
C, c, Ç, ç, Ċ(¤), ċ(¥)
D, d, Ḋ(¦), ḋ(«)
E, e, È, è, É, é, Ê, ê, Ë, ë
F, f, Ḟ(°), ḟ(±)
G, g, Ġ(²), ġ(³)
H, h
I, i, Ì, ì, Í, í, Î, î, Ï, ï
J, j
K, k
L, l
M, m, Ṁ(´), ṁ(µ)
N, n, Ñ, ñ
O, o, Ò, ò, Ó, ó, Ô, ô, Õ, õ, Ö, ö, Ø, ø
P, p, Ṗ(·), ṗ(¹)
Q, q
R, r
S, s, Ṡ(»), ṡ(¿), ß
T, t, Ṫ(×), ṫ(÷)
U, u, Ù, ù, Ú, ú, Û, û, Ü, ü
V, v
W, w, Ẁ(¨), ẁ(¸), Ẃ(ª), ẃ(º), Ẅ(½), ẅ(¾), Ŵ(Ð), ŵ(ð)
X, x
Y, y, Ý, ý, Ỳ(¬), ỳ(¼), Ÿ(¯), ÿ, Ŷ(Þ), ŷ(þ)
Z, z
Æ, æ
$, £

The Second Attempt at a General Standard, Latin-9 (ISO-8859-15)

This is a superset of ISO-8859-1 which finally supplies the few missing letters for French. The goal was to provide all the same letters as the Microsoft Code Page 1252. It does this by deleting some non-letters.

All the characters found in charset=iso-8859-15:
A, a, À, à, Á, á, Â, â, Ã, ã, Ä, ä, Å, å
B, b
C, c, Ç, ç
D, d, Ð, ð
E, e, È, è, É, é, Ê, ê, Ë, ë
F, f
G, g
H, h
I, i, Ì, ì, Í, í, Î, î, Ï, ï
J, j
K, k
L, l
M, m, µ
N, n, Ñ, ñ
O, o, Ò, ò, Ó, ó, Ô, ô, Õ, õ, Ö, ö, Ø, ø
P, p
Q, q
R, r
S, s, Š(¦), š(¨), ß
T, t
U, u, Ù, ù, Ú, ú, Û, û, Ü, ü
V, v
W, w
X, x
Y, y, Ý, ý, Ÿ(¾), ÿ
Z, z, Ž(´), ž(¸)
Æ, æ, Œ(¼), œ(½)
Þ, þ
¡, ¿, «, »
$, €(¤), ¢, £, ¥

A sample of German, Italian, Swedish, and Icelandic using charset=ISO-8859-15 (UTF-8):

Lesen Sie bitte diese Botschaft:

La prego di leggere questo messaggio:

La felicità sulla terra sembra quasi irrealizzabile. Malattie, vecchiaia, fame, criminalità, insicurezza e oppressione spesso rendono la vita infelice.

Var god läs detta meddelande:

Det verkar inte vara möjligt att åtnjuta lycka på jorden ens en kort tid. Sjukdom, åldrande, hunger, brottslighet, osäkerhet och förtryck gör ofta livet olyckligt.

Við biðjum þig að lesa þetta:

Ekki virðist unnt að vera hamingjusamur hér á jörðinni einu sinni um skamman tíma. Veikindi, öldrun, hungur, glæpir, öryggisleysi og kúgun gera mönnum tilveruna oft óbærilega.

As you can see from the above, ISO-8859-15 supports the same languages as ISO-8859-1 except that its coverage of French and Finnish is completed, just like Windows-1252, except that the few added characters are at different code points. And here is another sample:

A sample of Spanish, Danish, and French using charset=ISO-8859-15 (UTF-8):

Sírvase leer este mensaje:

Ni siquiera por loco tiempo parece posible disfrutar de felicidad en la Tierra. La enfermedad, la vejez, el hambre, el crimen, la inseguridad y la opresión suelen causar sufrimiento en la vida.

Læs venligst dette budskab:

Selv for en kortere tid synes det umuligt at opnå lykke på jorden. Sygdom, sult, alderdom, kriminalitet, utryghed og undertrykkelsen kaster alt for ofte en mørk skygge over tilværelsen.

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

The Second Standard for Central and Eastern Europe, Latin-10 (ISO-8859-16)

This is the second attempt at providing for the Southern part of Europe, with particularly better support of Hungarian and Romanian.

All the characters found in charset=iso-8859-16:
A, a, À, à, Á, á, Â, â, Ä, ä, Ą(¡), ą(¢), Ă(Ã), ă(ã)
B, b
C, c, Ç, ç, Č(²), č(¹), Ć(Å), ć(å)
D, d, Đ(Ð), đ(ð)
E, e, È, è, É, é, Ê, ê, Ë, ë, Ę(Ý), ę(ý)
F, f
G, g
H, h
I, i, Ì, ì, Í, í, Î, î, Ï, ï
J, j
K, k
L, l, Ł(£), ł(³)
M, m
N, n, Ń(Ñ), ń(ñ)
O, o, Ò, ò, Ó, ó, Ô, ô, Ö, ö, Ő(Õ), ő(õ)
P, p
Q, q
R, r
S, s, Š(¦), š(¨), Ș(ª), ș(º), Ś(×), ś(÷), ß
T, t, Ț(Þ), ț(þ)
U, u, Ù, ù, Ú, ú, Û, û, Ü, ü, Ű(Ø), ű(ø)
V, v
W, w
X, x
Y, y, Ÿ(¾), ÿ
Z, z, Ź(¬), ź(®), Ż(¯), ż(¿), Ž(´), ž(¸)
Æ, æ, Œ(¼), œ(½)
«, », „(¥), ”(µ)
$, €(¤)

A sample of Romanian and Hungarian using charset=iso-8859-16 (UTF-8):

Vă rog citiți această veste!

Se pare a nu fi posibil nici pentru un timp scurt să te bucuri de a trăi fericit pe pămînt. Boala, bătrînețea foamea, criminalitatea, nesiguranța și subjugarea contribuie de multe ori la o viața nefericită.

Kérem olvassa el ezt a hírt:

The Most Extreme Single-byte Latin Extension, VISCII

The most extreme example of using the upper 128 possible byte values for adding to the basic ASCII character set would be that used for the Vietnamese Language. In the Vietnamese Language there are fully 67 different letter-and- diacritical mark combinations, each in upper and lower case, in addition to the regular ASCII characters. This totals 134 possible additional character glyphs that this language needs for computer text. A corresponding Vietnamese association to the ANSI association that had produced the original ASCII produced a special variant of ASCII for Vietnam called VISCII. VISCII followed the above illustrated tradition perfectly by using one byte for each character, no matter what. To do this it not only used up all upper 128 byte values, but even allocated the six least-used ASCII control codes to the six remaining (and least used) characters, just so that every one of them has a single byte representation.

A sample of Vietnamese using charset=VISCII (UTF-8):

Xin mời quý bạn đọc trang này:

Hiện nay dường như không thể sống trong hạnh phúc trên trái đất này, dù chỉ là trong khoảng thời-gian ngắn. Bệnh-tật, già yếu, đói kém, tội ác, sự bất-an và áp-bức thường làm cho đời sống đầy khổ-sở.

Microsoft's equivalent to VISCII, Windows-1258

Microsoft introduced an alternate means of supporting all the vietnamese characters by using combining diacritic characters for most of them, and thus compressing all into a much smaller code space, and one which is presentable here in this UTF-8 file (no illegal characters).

All the characters found in the Windows-1258 character set:
A, a, À, à, Á, á, Â, â, Ã(AÞ), ã(aÞ), Ả(AÒ), ả(aÒ), Ă(Ã), ă(ã), Ạ(Aò), ạ(aò),
Ầ(ÂÌ), ầ(âÌ), Ấ(Âì), ấ(âì), Ẫ(ÂÞ), ẫ(âÞ), Ẩ(ÂÒ), ẩ(âÒ), Ậ(Âò), ậ(âò),
Ằ(ÃÌ), ằ(ãÌ), Ắ(Ãì), ắ(ãì), Ẵ(ÃÞ), ẵ(ãÞ), Ẳ(ÃÒ), ẳ(ãÒ), Ặ(Ãò), ặ(ãò)
B, b
C, c
D, d, Đ(Ð), đ(ð)
E, e, È, è, É, é, Ê, ê, Ẽ(EÞ), ẽ(eÞ), Ẻ(EÒ), ẻ(eÒ), Ẹ(Eò), ẹ(eò),
Ề(ÊÌ), ề(êÌ), Ế(Êì), ế(êì), Ễ(ÊÞ), ễ(êÞ), Ể(ÊÒ), ể(êÒ), Ệ(Êò), ệ(êò)
F, f
G, g
H, h
I, i, Ì(IÌ), ì(iÌ), Í, í, Ĩ(IÞ), ĩ(iÞ), Ỉ(IÒ), ỉ(iÒ), Ị(Iò), ị(iò)
J, j
K, k
L, l
M, m
N, n
O, o, Ò(OÌ), ò(oÌ), Ó, ó, Ô, ô, Õ(OÞ), õ(oÞ), Ỏ(OÒ), ỏ(oÒ), Ơ(Õ), ơ(Õ), Ọ(Oò), ọ(oò),
Ồ(ÔÌ), ồ(ôÌ), Ố(Ôì), ố(ôì), Ỗ(ÔÞ), ỗ(ôÞ), Ổ(ÔÒ), ổ(ôÒ), Ộ(Ôò), ộ(ôò),
Ờ(ÕÌ), ờ(õÌ), Ớ(Õì), ớ(õì), Ỡ(ÕÞ), ỡ(õÞ), Ở(ÕÒ), ở(õÒ), Ợ(Õò), ợ(õò)
P, p
Q, q
R, r
S, s
T, t
U, u, Ù, ù, Ú, ú, Ũ(UÞ), ũ(uÞ), Ủ(UÒ), ủ(uÒ), Ư(Ý), ư(ý), Ụ(Uò), ụ(uò),
Ừ(ÝÌ), ừ(ýÌ), Ứ(Ýì), ứ(ýì), Ữ(ÝÞ), ữ(ýÞ), Ử(ÝÒ), ử(ýÒ), Ự(Ýò), ự(ýò)
V, v
W, w
X, x
Y, y, Ỳ(YÌ), ỳ(yÌ), Ý(Yì), ý(yì), Ỹ(YÞ), ỹ(yÞ), Ỷ(YÒ), ỷ(YÒ), Ỵ(Yò), ỵ(yò)
Z, z
$

A sample of Vietnamese using charset=Windows-1258 (UTF-8):

Xin mời quý bạn đọc trang này:

An All-Different Language Character Set in the Extension

Additions and supplements to the basic ASCII character set are fine for a great many languages all based on the Latin characters, but there are other languages that use altogether different character sets. For these languages, a different solution was sought, namely using the upper values for hosting the complete alphabet in the language. This tends to make for a much less universal encoding since it only or primarily support only the script used for the extension, and furthermore, in many cases the fullest potential of such an approach was not utilized. In several instances, related languages that could have been supported by the alternate set, with the addition of some few additional letters for which the present standard could easily have spared unassigned code points, were often ignored. For example, such a set was introduced for Arabic, but this set is inadequate for Persian (Farsi), Urdu, and several other lesser languages with an "Arabic-like" script, since these languages use some several other letters not contained in basic Modern Standard Arabic.

Dual-Language Varieties of ISO 8859

One other aspect of this approach is that even where the alternate language might share a few letters with the English, in this case redundant letters are provided in their sequence among the foreign letters, so that unlike the above examples which blended ASCII text letters with the various kinds of extended text letters, in this case the only blending done is with numbers and punctuation which might still (in some cases) be borrowed from the ASCII range. Still, one might wonder why the regular ASCII letters were not stirred in where they could have fit in smoothly. One reason is that some letters that might look the same would actually happen to be different letters. For example in both Greek and Russian, the "Ρ" (in Greek) and the "Р" (in Russian), both look like the "P" in English but actually are the letter "R." But even where the letter might be an exact equivalent to the ASCII letter it looks like, it still makes sense, both for such things as alphabetical sorting engines which for these languages do not need to make special exceptions for letters that are out of sequence, and also one might want a particular font type to apply only to the letters of the foreign language, but not to the regular ASCII letters.

The Standard for Russian and Similar Languages, Latin/Cyrillic (ISO-8859-5)

This is one of the best done dual language ISO 8859 variants since it simply supports all of Cyrillic, which means it is good for all of Bulgarian, Macedonian, Russian, Serbian, Byelorussian, and Ukrainian, plus those cases where other languages (such as Kurdish) may in certain locales use the Cyrillic script.

A sample of Russian and Ukranian using charset=iso-8859-5 (UTF-8):

Прочтите, пожалуйста, зту информацию:

Кажется, что даже на короткое время не возможно жить в счастье на земле. Многим жизнь делается трудной из-за болезней, старческого недомогания, голода, преступлений, ненадежности и угнетения.

Прошу прочитати цю вістку:

Здається що на землі неможливо втішатись щастям навіть на короткий час. Через хвороби старіння, голодування, злочин, ненадійність і пригноблення життя часто стає нещасним.

Microsoft's equivalent to ISO-8859-5, Windows-1251

Despite how well ISO-8859-5 covers the Cyrillic languages, the proprietary standard Windows-1251 actually finds much more use internationally than the ISO standard. Windows-1251 arranges the letters in a substantially different manner, incompatible with ISO-8859-5.

The Standard for Modern Standard Arabic, Latin/Arabic (ISO-8859-6)

Modern Arabic is perfectly supported by this variant of ISO-8859, but its fullest potential was clearly not utilized. In particular, with only a few additional letters, several other Arabic-like languages could have been also supported. There certainly is enough unused code points to add all of (for example) پ چ ژ گ ک (useful for Persian), and even ں ھ ہ ے ٹ ڈ and ڑ as well (useful for Urdu, in addition to some of the additional Persian letters), and probably others as well, or also such things as the arabic numbers, ٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ and ٩ (funny how we call our familiar 0 1 2 3 4 5 6 7 8 and 9 numbers "Arabic" numbers when in Arabia itself they don't use these, though as can be seen the ones and nines do look similar).

And here is a chance to demonstrate the one new HTML tag/element introduced in HTML 4.0 and 4.01 to accomodate the different languages, along with the DIR attribute. Like LANG, DIR can be used as an attribute to almost any HTML tag. It is supposed to set the direction, taking an attribute of either RTL ("Right-to-Left") or LTR ("Left-to-Right") to specify which direction the enclosed text is meant to take. In some older browsers, if recognized at all, it may set the direction for the display of all characters in the specified direction, but in newer browsers, it can apply to everything except the text in letters of the language itself, e. g. punctuation, numbers, spaced lists, and so forth, but the letters of a word would not be reversed because their order has a higher priority than the DIR attribute of every HTML tag, except one. So, those letters that are in Arabic (or Hebrew or Syriac or any of a few other lesser-known Right-to-Left scripts) automatically display themselves in a reversed order, so that as the sequence of bytes in the file go, each later letter appears to the left of the prior letter (or wraps around to the right edge for the next line) instead of the more familiar Left-to-Right direction of most other scripts. The one exception is the <BDO> tag which stands for Bi-Directional algorithm Override, and this even reversed the letters within the words, overriding the natural direction the letters belong and would otherwise be displayed.

See here with some simple examples. First I show two paragraphs with each of English and Arabic in their native directions (but I include the DIR attribute to enforce what would be the default for the whole file so that it may work fully consistantly):

Next we show what happens when DIR is used on any other HTML element to try to reverse its normal direction, namely here the <P> tag:

And finally we show what happens when DIR is used on both <P> and <BDO> in the same direction, opposite to that of the language:

So, one can see here one of the complexities of Arabic as a language for display on computers. Not only does it go right to left and use different characters than the Latin-based languages (Hebrew does both of those things as well) but also the letters are always blended in a cursive manner, and furthermore take on different shapes depending on whether it connects to the previous, the next, or both, or neither letters on each side of it, and some few can do other strange things when they combine, as for example a Laam followed by an Alif. By normal combining algorithms as described the two letters together should look something like an English letter "U" but instead they combine to make a "لا." But this would be why the word "هذه" would look the same even when the direction of the letters is reversed, for the first and third (last) letter are actually the same letter, but merely taking on different forms as one begins the word (connecting to the second letter, but not to the last letter of the previous word due to the space in between), and the other ends the word, again not connecting to the first letter of the next word due to the space in between, and also not connecting to the middle letter because in Arabic, certain letters do not connect on one side (or the other, depending on which letter) even when used in the same word. In this listing, for the combining vowel points, I have used the letter Daal (Arabic letter "D" or "د" as their "seat" and the letter itself is of course therefore not part of the vowel point itself, but merely needed for it to "combine" with.

A sample of Arabic using charset=iso-8859-6 (UTF-8):

يرجى قراءة هذه الرسالة:

لا يبدو التمتع بالسعادة على الارض مكنا حتى لوقت قصير. فالمرض والشيخوخة والجوع والجريمة والخطر والظلم كثيرا ما يجعل الحياة شقية.

أَبْجَدْ هَوَّزْ حُطِّيَ كَلَمُنْ سَعْفَصِ قُرِشَتْ ثَخَذٌ ضَظَغٌ.

أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ.

بَلْ تُحِبٌ قَرِيبَكَ كَنَفْسِكَ ـ لاَوِيِّين.

Microsoft's equivalent to ISO-8859-6, Windows-1256

Microsoft remedied some of the deficiencies of ISO-8859-6 by introducing their own Windows-1256 which introduces the above listed letters needed for Persian and Urdu, though still no numbers. Some of the letters are in the same places, but others, and all the vowel points, are assigned different code points, so despite some commonality the two encodings are not compatible. In particular, Windows-1256 preserved several of the ISO-8859-1 code points, particularly those needed for French (since many Arabic-speaking countries also speak French), which in ISO-8859-6 are either inactive (unassigned) code points, or else assigned arabic letters or vowel points.

A sample of Arabic, Persian, Urdu, and French using charset=Windows-1256 (UTF-8):

يرجى قراءة هذه الرسالة:

أَبْجَدْ هَوَّزْ حُطِّيَ كَلَمُنْ سَعْفَصِ قُرِشَتْ ثَخَذٌ ضَظَغٌ.

أبجد هوز حطي كلمن سعفص قرشت ثخذ ضظغ.

بَلْ تُحِبٌ قَرِيبَكَ كَنَفْسِكَ ـ لاَوِيِّين.

دوره تاستان که از آغاز تا انقراض شاهنشاهى هخامنشى، باستان قرن بيستم تا حدود چهارم و سوم پيش از ميلاد را دربرمىگيرد.

اردو انڈو آريائى زبانوں کى انڈو ايرانى شاخ کى ايک زبان ہے جس کا تعلق انڈويورپى زبانوں سے ہے.

Veuillez lire ce message:

Il semble impossible d'être heureux sur la terre, même peu de temps. La maladie, la vieillesse, la faim, la criminalité, l'insécurité et l'oppression rendent la vie pénible.

The Standard for Contemporary Greek, Latin/Greek (ISO-8859-7)

All the characters found in charset=iso-8859-7:
Α(Á), α(á), Ά(¶), ά(Ü)
Β(Â), β(â)
Γ(Ã), γ(ã)
Δ(Ä), δ(ä)
Ε(Å), ε(å), Έ(¸), έ(Ý)
Ζ(Æ), ζ(æ)
Η(Ç), η(ç), Ή(¹), ή(Þ)
Θ(È), θ(è)
Ι(É), ι(é), Ί(º), ί(ß), ΐ(À), Ϊ(Ú), ϊ(ú)
Κ(Ê), κ(ê)
Λ(Ë), λ(ë)
Μ(Ì), μ(ì)
Ν(Í), ν(í)
Ξ(Î), ξ(î)
Ο(Ï), ο(ï), Ό(¼), ό(ü)
Π(Ð), π(ð)
Ρ(Ñ), ρ(ñ)
Σ(Ó), σ(ó), ς(ò)
Τ(Ô), τ(ô)
Υ(Õ), υ(õ), Ύ(¾), ύ(ý), ΰ(à), Ϋ(Û), ϋ(û)
Φ(Ö), φ(ö)
Χ(×), χ(÷)
Ψ(Ø), ψ(ø)
Ω(Ù), ω(ù), Ώ(¿). ώ(þ)
«, », ΅(µ)
$, €(¤), ₯(¥)

A sample of Greek using charset=iso-8859-7 (UTF-8):

Παρακαλω Διαβάστε τό Άγγελμα Αυτό:

Φαίνεται απίθανο νά καταφέρει κανείς νά απολαύσει τήν ευτυχία πάνω στή γη, έστω καί γιά σύντομο χρονικό διάστημα. Ή αρρώστια, τά γηρατειά, η πείνα, τό έγκλημα, η ανασφάλεια καί η καταπίεση κάνουν συχνά τή ζωή θλιβερή.

«Ούτως γαρ ηγάπησεν ο Θεός τον κόσμο ώστε τον Υιόν αυτού τον μονογενή έδωκεν ινα πας ο πιστευων εις αυτον μη αποληται αλλ εχη ζωην αιωνιον.» (Ιωάννην γ΄ 16)

και αγαπήσεις τον πλησίον σου ως εαυτόν - Λευιτικον.

Microsoft's equivalent to ISO-8859-7, Windows-1253

The Standard for Hebrew, Latin/Hebrew (ISO-8859-8)

All the characters found in charset=iso-8859-8:
א אלוף Alef (à)
ב בית Bet (á)
ג גימל Gimel (â)
ד דלת Dalet (ã)
ה הא He (ä)
ו ויו Vav (å)
ז זין Zayin (æ)
ח חית Khet (ç)
ט טית Tet (è)
י יוד Yod (é)
כ כף Kaf (ë)
ך כף ספת Kaf Sophit (final) (ê)
ל למד Lamed (ì)
מ מם Mem (î), µ
ם מם ספת Mem Sophit (final) (í)
נ נון Nun (ð)
ן נון ספת Nun Sophit (final) (ï)
ס סמך Samekh (ñ)
ע עין Ghayin (ò)
פ פא Pe (ô)
ף פא ספת Pe Sophit (final) (ó)
צ צדי Tsadi (ö)
ץ צדי ספת Tsadi Sophit (final) (õ)
ק קוף Quf (÷)
ר ריש Resh (ø)
ש שין Shin (ù)
ת תיו Tav (ú)
‗ (underline) (ß)
<LRM> (Left-to-right-mark) (ý)
<RLM> (Right-to-left-mark) (þ)
$, ¢, £, ¤, ¥

A sample of Hebrew using charset=iso-8859-8 (UTF-8):

קרא נא את השורות הבאות:

נראה שאין זה בהישג-ידנו ליהנות מאושר צל הארץ, אפילו לזמן קצר. חולי, זיקנה, רצב, פשצ, אי-ביטחון ודיכוי מאמללים לצתים קרובות את החיים.

ואהבת לרעך כמוך ‗ ויקרא.

Microsoft's equivalent to ISO-8859-8, Windows-1255

Microsoft remedied one deficiency of ISO-8859-8 by introducing their own Windows-1255 which introduces the Hebrew vowel points. Unlike some of the other Windows code pages, Windows-1255 is nearly totally compatible with ISO-8859-8, except that it adds the vowel points (to code points unassigned in the ISO standard), also adds the very useful high dash, but deletes the Hebrew underline character.

A sample of Hebrew using charset=Windows-1255 (UTF-8):

קרא נא את השורות הבאות:

נראה שאין זה בהישג־ידנו ליהנות מאושר צל הארץ, אפילו לזמן קצר. חולי, זיקנה, רצב, פשצ, אי־ביטחון ודיכוי מאמללים לצתים קרובות את החיים.

וְאָֽהַבְתָּ לְרֵֽעֲךָ כָּמוֹךָ ‗ ויקרא.

ואהבת לרעך כמוך ‗ ויקרא.

The Standard for Thai, Latin/Thai (ISO-8859-11)

All the characters found in charset=iso-8859-11:
ก(¡), ข(¢), ฃ(£), ค(¤)
ฅ(¥), ฆ(¦), ง(§), จ(¨)
ฉ(©), ช(ª), ซ(«), ฌ(¬)
ญ[*], ฎ(®), ฏ(¯), ฐ(°)
ฑ(±), ฒ(²), ณ(³), ด(´)
ต(µ), ถ(¶), ท(·), ธ(¸)
น(¹), บ(º), ป(»), ผ(¼)
ฝ(½), พ(¾), ฟ(¿), ภ(À)
ม(Á), ย(Â), ร(Ã), ฤ(Ä)
ล(Å), ฦ(Æ), ว(Ç), ศ(È)
ษ(É), ส(Ê), ห(Ë), ฬ(Ì)
อ(Í), ฮ(Î), ฯ(Ï), ะ(Ð)
ก ั (combining) (Ñ)
า(Ò), ำ(Ó)
ก ิ (combining) (Ô)
ก ี (combining) (Õ)
ก ึ (combining) (Ö)
ก ื (combining) (×)
ก ุ (combining) (Ø)
ก ู (combining) (Ù)
ก ฺ (combining) (Ú)
เ(à), แ(á), โ(â), ใ(ã)
ไ(ä), ๅ(å), ๆ(æ)
ก ็ (combining) (ç)
ก่ ่ (combining) (è)
ก ้ (combining) (é)
ก ๊ (combining) (ê)
ก ๋ (combining) (ë)
ก ์ (combining) (ì)
ก ํ (combining) (í)
ก ๎ (combining) (î)
๚(ú), ๛(û)
$, ฿(ß)

Instructions for getting ญ character in ISO-8859-11 using MS notepad:
* - to get ญ: Open notepad and enter "íA" and save in UTF-8. Open file as ANSI and look for "Ã-A" string.
The hyphen in the middle can be cut and paste into notepad (in the ANSI mode) to provide the ญ character.

A sample of Thai using charset=iso-8859-11 (UTF-8):

โปรดอ่านเรื่องนี้:

คูเหมือนจะ เป็นไปไม่ไค้ที่จะมีความสุขบนแผ่นคิน โลกแม้กระทั่งในช่วง เวลาสั้นๆ. ความเจ็บป่วย วัยชรา ความหิว อาชญากรรม ความไม่ ปลอคภยและก ารกคขี่มักจะทำให้ชีวิฅลำเค็ญ.

A note on the Standard for Hindī and Similar Languages, Latin/Devanāgarī (ISO-8859-12)

This number was originally intended to be allocated to the Devanāgarī script, which looks something like this:

A sample of Hindi which could have used charset=iso-8859-12 (UTF-8):

कृपया इस संढेश को पढ़ें:

भारतीय भाषाओं के किसी भी शब्द या ध्वनि को देवनागरी लिपि में ज्यों का त्यों लिखा जा सकता है और फिर लिखे पाठ को लगभग 'हू-ब-हू' उच्चारण किया जा सकता है, जो कि रोमन लिपि और अन्य कई लिपियों में सम्भव नहीं है, जब तक कि उनका कोई ख़ास मानकीकरण न किया जाये, जैसे आइट्रांस या आइएएसटी।

In 1997, the ISO committee abandonded the creation of this standard, and so there is no actual iso-8859-12. This script could be used in several Indian languages, including Sanskrit, Hindi, Marathi, Sindhi, Bihari, Bhili, Marwari, Konkani, Bhojpuri, languages from Nepal like Nepali, Nepal Bhasa, Tharu and sometimes Kashmiri and Romani. A pity. But it is handled in Unicode (UTF-8, UTF-16, ...), as can be seen here.

A More Complex Dual Language Encoding, Shift_JIS

The above one-byte-per-character encodings would be adequate for the above mentioned languages, and doubtless many others as well, though the rest do not have such well-known established standards by which they are specified. But this approach is insufficient for such languages as Chinese or Korean or Japanese that have tens of thousands of different characters used in their language. So there can be no ISO-8859 standard for any of these languages, and there is none. In the case of Japanese however, one does get close to this with Shift_JIS in that everything representable with such widely varied characters can also be represented with a special category of Japanese letter known as Katakana. In normal Japanese, the Katakana primarily serve as literal transliteration characters, useful for example for loan words from other languages, or to illustrate the sounds of the other letters (in teaching), or for sound effects (onomateopea). Some early attempts at encoding Japanese for computers took the approach of using the Katakana characters, since there are few enough of them to represent all of them (and all of ASCII as well) in one byte. There is room enough in fact that Shift_JIS takes this approach, combining it with dual-byte codes to cover the non-katakana letters or symbols. While the lower range is used for regular ASCII characters (as is done in all the above examples, part of the upper range is used for the Katakana letters and much of the rest serves as "first bytes" of two-byte codes for all the other letters covered by Shift-JIS. Shift_JIS also includes Greek and Cyrillic letters, numerous punctuation marks, symbols, and dingbats within its two-byte codespace. It even has regular "ASCII" type English letters, but in a special double-width form that can be used (intermixed) with the Japanese, where for example, an Enlish letter "A" would have the same length and size as a typical Japanese letter.

Unlike in the above examples, it is hopelessly out of scope to show all the Japanese characters supported by Shift_JIS, so I will content myself here with showing the single-byte Katakana characters and a miniscule sampling of some of the sorts of letters possible using two-byte sequences.

The Katakana (and other single-byte) characters as found in the Shift_JIS character set:
｡(¡), ｢(¢), ｣(£), ､(¤),
･(¥), ｦ(¦), ｧ(§), ｨ(¨),
ｩ(©), ｪ(ª), ｫ(«), ｬ(¬),
ｭ[*], ｮ(®), ｯ(¯), ｰ(°),
ｱ(±), ｲ(²), ｳ(³), ｴ(´),
ｵ(µ), ｶ(¶), ｷ(·), ｸ(¸),
ｹ(¹), ｺ(º), ｻ(»), ｼ(¼),
ｽ(½), ｾ(¾), ｿ(¿), ﾀ(À),
ﾁ(Á), ﾂ(Â), ﾃ(Ã), ﾄ(Ä),
ﾅ(Å), ﾆ(Æ), ﾇ(Ç), ﾈ(È),
ﾉ(É), ﾊ(Ê), ﾋ(Ë), ﾌ(Ì),
ﾍ(Í), ﾎ(Î), ﾏ(Ï), ﾐ(Ð),
ﾑ(Ñ), ﾒ(Ò), ﾓ(Ó), ﾔ(Ô),
ﾕ(Õ), ﾖ(Ö), ﾗ(×), ﾘ(Ø),
ﾙ(Ù), ﾚ(Ú), ﾛ(Û), ﾜ(Ü),
ﾝ(Ý), ﾞ(Þ), ﾟ(ß),
¥(\), ‾(~)

Instructions for getting ｭ character in Shift_JIS using MS notepad:
* - to get ｭ: Open notepad and enter "íA" and save in UTF-8. Open file as ANSI and look for "Ã-A" string.
The hyphen in the middle can be cut and paste into notepad (in the ANSI mode) to provide the ｭ character.

The Final and Comprehensive Solution: Unicode

As one can see, the above techniques, especially the ISO-8859 standard, but also the Microsoft code pages (and indeed many other lesser-known codes, mostly national or linguistic in nature) provide for quite a range of languages, and yet there are still quite some serious deficiencies in them. The first and foremost deficiency is their inability to cover those languages with a large character set, such as Chinese, Japanese, or Korean, or else do so in non-standard ways, as for example the Shift_JIS exampled above which has to go to two bytes per character for nearly all Japanese characters, but provides no coverage of Chinese or Korean. Considering that Chinese (Cantonese) is the biggest single language block of people, and the others (Mandarin, Korean, and Japanese) which also add many more speakers, that's something like half the world's speakers being unable to use the simple one-byte codes shown above.

Another problem is the mixture of different languages in a single file. As long as one only writes in one language it is enough to find some encoding that supports that language and one is off and running. But what do you do when you want to combine different languages, for example for a bilingual presentation (though for that one could use separate web pages for the different languages), or else for some scholarly work written in one language, but discussing in detail (and quoting) a document written in another language? Depending upon what combination of languages you intend to do here you may or may not get lucky. Obviously, since English only uses the standard ASCII letters, it can be easily combined with any of the above languages without a hitch, as for example in a Greek-English Interlinear Bible (illustrated here) and German is supported for all the "Latin-dash" number language character sets in ISO-8859, but support of other languages can often be more hit or miss. One could easily enough combine Maltese and Turkish text in one file using ISO-8859-3, but how would one combine Maltese and Polish, or Maltese and Lithuanian, which languages cannot use ISO-8859-3 but only ISO-8859-2 (in the case of Polish) or ISO-8859-4 (in the case of Lithiuanian)? However, one can combine Polish and Lithuanian in one file (but not with Maltese) if using ISO-8859-13.

The solution to all this is Unicode, the all-encompassing character set for every character of every known language. The above encodings with the exception of the two-character sequences of Windows-1258 for many (but not all) Vietnamese characters, and the non-Katakana characters of Shift_JIS, represent one character (letter) for one byte of data in the file. But as there are only 256 possible byte values in that range, any alphabet with even 257 letters (or more) absolutely cannot be represented in such an encoding. Unicode abandons that constraint, allowing single character values to use two or more bytes.

The one obvious positive aspect of this is the ability to handle all languages used anywhere in the world. So in Unicode, all the tens of thousands of characters needed for each of Chinese, Japanese, and Korean, can at last be all represented and each letter assigned a unique code point. Furthermore, it then becomes possible for any and all languages to be blended together in one file. This file is written in Unicode in order that so many different languages to be illustrated could all be sampled within this one file. In Unicode, any combination of languages is possible in a single file. Indeed, I see no way to improve upon Unicode's flexibility and coverage of all languages, so one can expect to last as long as there are computers and electronic text.

But does that mean that all the above encodings are now obsolete, useless as anything but historical curiosities? While that may well happen (if not already) in the case of the proprietary encoding schemes, such as the Code Page 437 and Macintosh encodings illustrated above, I think there still exists one real benefit to using the above encodings where possible, and here is an example. Take the Russian file referenced here which is (as stored on the server) exactly 18,007 bytes long. In the Windows-1251 encoding each character of any kind in the file is always exactly one byte. As the file consists mostly of Russian with a few small bits of English (including the HTML tags therein) both languages are easily and properly accomodated within the limited code space of Windows-1251. In Unicode the same exact file (using the utf-8 variety of Unicode), character for character, takes up precisely 30,973 bytes, nearly twice as large! And if the utf-16 variety of Unicode is used this same file (we use here the internet default "big-endian" byte ordering scheme) takes up 36,124 bytes, fully twice the original size! It doesn't take a lot of imagination to realize what that would do to such things as file download time, and for that matter, what it would do to the overall bandwidth load on the internet, were all text files to be suddenly converted to Unicode tomorrow.

А(°), а(Ð)	Ц(Æ), ц(æ)
Б(±), б(Ñ)	Ч(Ç), ч(ç)
В(²), в(Ò)	Ш(È), ш(è)
Г(³), г(Ó), Ѓ(£), ѓ(ó)	Щ(É), щ(é)
Д(´), д(Ô)	Ъ(Ê), ъ(ê)
Е(µ), е(Õ), Ё(¡), ё(ñ)	Ы(Ë), ы(ë)
Ж(¶), ж(Ö)	Ь(Ì), ь(ì)
З(·), з(×)	Э(Í), э(í)
И(¸), и(Ø), Й(¹), й(Ù)	Ю(Î), ю(î)
К(º), к(Ú), Ќ(¬), ќ(ü)	Я(Ï), я(ï)
Л(»), л(Û)	Ђ(¢), ђ(ò)
М(¼), м(Ü)	Є(¤), є(ô)
Н(½), н(Ý)	Ѕ(¥), ѕ(õ)
О(¾), о(Þ)	І(¦), і(ö), Ї(§), ї(÷)
П(¿), п(ß)	Ј(¨), ј(ø)
Р(À), р(à)	Љ(©), љ(ù)
С(Á), с(á)	Њ(ª), њ(ú)
Т(Â), т(â)	Ћ(«), ћ(û)
У(Ã), у(ã), Ў(®), ў(þ)	Џ(¯), џ(ÿ)
Ф(Ä), ф(ä)	$
Х(Å), х(å)

ا الف Alif (Ç)	ه هاء Haa (ç)
ب داب Daaba (È)	و واو Waaw (è)
ت تاء Taa (Ê)	ي ياء Yaa (ê)
ث ثاء Thaa (Ë)	ى الف مقصورة Alif Maqsuura (é)
ج جيم Jiim (Ì)	ة تاء مربوطة Taa Marbuuta (É)
ح حاء Haa (Í)	ء همزة Hamza (Á)
خ خاء Xhaa (Î)	أ Hamza for beginning a or u (Ã)
د دال Daal (Ï)	إ Hamza for beginning i (Å)
ذ ذال Thaal (Ð)	ؤ Hamza seated on Waw (Ä)
ر راء Raa (Ñ)	ئ Hamza seated on Yaa (Æ)
ز زاي Zaay (Ò)	آ Madda (Â)
س سين Siin (Ó)	دَ (combining vowel a) (î)
ش شين Shiin (Ô)	دُ (combining vowel u) (ï)
ص صاد Saad (Õ)	دِ (combining vowel i) (ð)
ض ضاد Daad (Ö)	دْ (combining sukoon) (ò)
ط طاء Taa (×)	دّ (combining shadda) (ñ)
ظ ظاء Thaa (Ø)	دً (combining -an nunation) (ë)
ع عين Ghayn (Ù)	دٌ (combining -un nunation) (ì)
غ غين Rhayn (Ú)	دٍ (combining -in nunation) (í)
ف فاء Faa (á)	ـ dash (à)
ق قاف Qaaf (â)	، comma (¬)
ك كاف Kaaf (ã)	؛ semicolon (»)
ل لام Laam (ä)	؟ question mark (¿)
م ميم Miim (å)	$, ¤
ن نون Nuun (æ)