The old site is still available, and faster than before. Return to the classic

Unicode

Unicode


Many files posted at sacred texts since the spring of 2002 have
embedded Unicode. Unicode is a multi-byte alphabet which can represent
all major world scripts, and many obscure ones as well.
This solves a major problem for creators of etexts, as it is now
possible to fully transcribe texts in multiple languages without
requiring ASCII transliterations, special fonts or browsing software.
Unicode enabling also takes care of right-to-left scripts more-or-less
automatically.

All modern web browsers support Unicode if you have a
decent Unicode font installed,
provided you designate that font as your default font.


That said, this is definitely still on the cutting edge,
and you may need to tweak your browser settings to get the
full character set.
And there are some features which are buggy in particular
browsers, although support seems to be getting better in newer
versions; having an up-to-date version of your
operating system also helps.


For instance, some browsers have a few problems displaying some
subscript and superscript characters
such as Hebrew vowel points (they get displayed to the left of where they
should be, with a space above them).
Some older versions of Internet Explorer do not display medial and final
forms when displaying Arabic (which makes it unusable for this purpose).
Firefox 3, on Windows XP, with Code2000 doesn’t display the
entire Quran character set, particularly some more obscure ones.
IE8 on Windows XP, with Code2000 renders all but three of the
archaic Quranic characters correctly.
We haven’t tested every browser/OS/font combination.
For this reason, we have also posted
a version of the Quran which uses
gif images to display Arabic
.
But this is an exception.
And this may have been fixed in more recent versions of the browser.


It appears that Firefox does not render Devanagari ‘i’ correctly:
it places it after the associated consonant, not before.


IE and Safari do not display the correct presentation
forms for Unicode Cyrillic italics:
Safari does not even allow Cyrillic to be italicized, whereas
IE shows italicized forms of the base graphemes, which is incorrect.
Opera and Firefox display these presentation forms correctly.
Strangely enough, the italic Cyrillic presentation forms are displayed
correctly in MS Word 2003.


Some problems viewing some polytonic Greek files
on the 5.0 CD-ROM under Mac OS-X have been reported.
These have been fixed on the website and the 6.0 DVD-ROM, but not
on the 5.0 CD-ROM.


We welcome any comments or questions about the
visibility of Unicode on this site in various browsers, and we will
add advisories on this page.
Extensive Unicode resources can be found at
unicode.org [External Site].

Recommended Unicode Fonts


If you need a Unicode font, we recommend
the Code 2000 shareware font [External Site].
This is a very extensive Windows font, and the one which we use to test the site with.



We also recommend the site https://www.alanwood.net/unicode/fonts.html, which lists dozens of Unicode fonts for a variety of platforms.


A Unicode font, Arial Unicode MS, comes with Windows XP.
It has some good points: it seems to have better
coverage of some of the more obscure Arabic characters than Code2000.
That said, Arial Unicode MS is not pretty, and if reading everything in
a sans serif font isn’t your cup of tea, you may want to look elsewhere.
Note that this font may not be installed on your XP system by default.
If you have XP and don’t see Arial Unicode MS as one of your available fonts,
you may need to dig out your Windows disk.
You also can buy it from Microsoft, but they charge an exorbitant $99 for it.
With so many free and inexpensive Unicode fonts, there is no reason to
pay that much!


There is also a page about
font issues regarding the Unicode Hebrew Bible at sacred-texts which includes a specialized redistributable font.

Enabling Unicode in Your Browser


The most common complaint is ‘I downloaded and installed Code2000 but
I still see little boxes in your files’.
This is because you also have to tell your browser that you want
to view Unicode content using that font
.


First of all, we recommend that if you have an older
browser, you should obtain the most recent version.
If you are using AOL or another ISP which has a bundled browser,
you may wish to get the most recent
version of Internet Explorer or Netscape and use it for browsing
Unicode content; the bundled browsers are notoriously buggy,
particularly when it comes to cutting-edge features such as Unicode.

Here’s how to get Unicode working in Internet Explorer using Code2000.
The procedure is very similar for other browsers.



1. Download and Install the Unicode Font


First of all you need to download the font and install it.
For instance, if you are using Windows XP, you start the Control Panel
‘Fonts’ program, and then select ‘Install New Font’ from the ‘File’ menu.


2. Make the Unicode Font Your Default Web Page Font


Let’s assume you have downloaded
and installed the ‘Code2000’ font.
Start Internet Explorer and
go into ‘Tools | Internet Options’ and select the ‘Fonts’ dialog.


On the ‘Web Page Font’, Code2000 should show up in the scrolling listbox,
if you downloaded it and installed it correctly.
Select it.


Unless you do this, some Unicode characters (such as the accented Greek
characters and some Hebrew characters) may not show up.

I’m still seeing little boxes! What to do?


The most common problem is skipping step two in the previous
section
.
If you don’t designate a full Unicode font as your default
‘Web Page Font’, you will still only have whatever minimal
Unicode support is built into your operating system.



Typically this will include some of the
simplest extended Latin accented characters,
as well as basic Greek and Hebrew
characters. However, you won’t be able to view
specialized accented Latin characters,
polytonic Greek, or pointed Hebrew.
You won’t be able to see any Arabic or Devanagari characters,
astrological symbols, and so on.
These will show up as the dreaded ‘boxes’
(or question marks in some browsers).


The web pages with heavy Unicode dependencies
at this site don’t have embedded font information
because that would greatly inflate their size;
and in the case of sections such as the Hebrew Bible
and Sanskrit/Transliterated Rig Veda,
that adds up to some serious extra baggage.
Therefore I leave it up to you to tell your browser which
font to use.
You can always switch it back easily if you aren’t reading specialized
Unicode content.

Manually Selecting Unicode Encoding


You may need to also manually select ‘Unicode (UTF-8)’ in certain browsers.
For instance, under Internet Explorer, you can select ‘View | Encoding’,
and ‘Unicode (UTF-8)’.
Under Netscape, this is ‘View | Character Coding’.



Technically, some of these pages don’t use the UTF-8 encoding scheme.
However this seems to be the only way to specify that you are viewing
Unicode content for some browsers.
I’ve started to add UTF-8 META tags to all files which have any amount of
Unicode.
This seems to have helped.

Unicode Implementation

Technically speaking, the Unicode characters are embedded in 8 bit
HTML using ‘character entities’, for instance:


ॐ = ॐ

א = א‎

Ω = Ω

If your browser is Unicode-enabled, you should see the Sanskrit letter for
‘Aum’ (see this image);
the Hebrew letter Aleph,
and a Greek capital Omega above.


For disk space and bandwidth reasons, I’ve also started to use
the UTF-8 encoding scheme in the files which are predominantly Unicode,
such as the Greek and Hebrew portions of the Bible and the Rig Veda.
This is a variable-length binary compression scheme which encodes
Unicode efficiently.
Instead of the 6 bytes per character that the HTML entity requires,
UTF-8 requires one to three bytes to represent the 16 bit
Unicode character set.
Most modern browsers handle UTF-8 automatically, assuming you have
installed a complete Unicode font.

In some cases Unicode has been used to transcribe Latin
characters with accents outside the ISO-8859-1 HTML character set.
In other cases complete texts or extensive portions of the text are
in Unicode.
Among the Unicode character sets in use currently are Arabic, Chinese,
Extended Latin, Greek, Hebrew, Tibetan, Runic and Sanskrit.


Some of the Unicode-enabled files at sacred-texts include: