Protégé is used by people from around the world who speak many languages and use most of the world's character sets.
The majority of problems in dealing with character sets are handled transparently (thanks to Java's good support for this).
In general, things should "just work". However, there are some issues related to saving, loading, and exchanging files that
users should be aware of. These issues can be very confusing and this FAQ attempts to answer the most commonly asked questions
surrounding file encoding.
What's a file encoding?
An
encoding is a way to store a character as a sequence of one or more numbers. Typically, the only numbers
available on most systems are in the range 0-255 (referred to as a
byte). The encoding specifies how to map all available
characters on a system into one or more bytes. If the number of characters in use on the system is less than 255 (as it is on most
US and Western European systems), then creating such a mapping is not very difficult and is typically done with one byte per
character. If the numbers of characters available for use on a system is thousands or tens of thousands (as on some Asian systems),
then this mapping is more challenging, especially if you desire some sort of compatibility with systems in use elsewhere.
A number of different encoding schemes have been developed and are in use. There are multiple encodings for US English by itself;
two common ones are ASCII and EBCDIC. There are also multiple encoding schemes for most other languages. The confusion that this
causes for applications (particularly multilingual applications) is enormous. An international effort was developed to come up with a
standard encoding to cover all languages. This standard is
Unicode.
A
file encoding is a specification of the way that a character is stored in a file as a sequence of one or more
bytes. This can be, but is not required to be, the same as the default encoding used on the system for applications. The Unicode
encoding specification describes four different possible (and incompatible) file encodings. Common file encodings are ASCII, Latin-1,
UTF-8, UTF-16, BIG-5, etc. Each of these encodings has advantages and disadvantages and was developed with a specific purpose in
mind. No one encoding scheme is "better" than the others in all contexts. The character "A" could thus be
represented as a single byte "68" in one file encoding and as a sequence of three bytes "132 145 234" in another
file encoding.
File encodings are important because different systems use different default file encodings and it requires some effort to get
a system with one default file encoding to read a file written with another file encoding. It's fairly common outside of the US for
a machine with an English language version of MS Windows to be sitting right next to a machine with a non-English version of MS
Windows.
Doesn't Unicode handle all of this for you?
The short answer is no. Unicode is a character encoding that's more and more commonly used in applications. All Java applications
(and thus Protégé) use a Unicode encoding for internal purposes. File systems however, typically support a local character encoding
by default. By default, Java applications use the default system file encoding for all file input and output.
What's the difference between a file encoding and a font?
A font determines what a character looks like on the screen. A file encoding determines what a character looks like in a
file. The character "A" could appear as three straight lines in one font or as a combination of circles and wavy lines in
another (cursive) font. Underneath these visual differences, the same encoding (number) for "A" is carried around by the
system and written into a file according to the file encoding in use.
What file encoding does Protégé use by default?
Protégé uses the UTF-8 encoding. UTF-8 is one of the Unicode file encodings and the encoding that's
compatible with the most common US English language encoding (ASCII). All Unicode characters, and this means essentially all characters in any language, can be
stored in UTF-8 (or indeed in any of the other Unicode file encodings). UTF-8
also appears to be the most commonly used of the Unicode
encodings.
Why doesn't Protégé just use the default file encoding on my system?
It used to. As our user community grew, many users in different countries started to try and exchange Protégé projects and ran
into the problems mentioned above. Users in the same physical location but with different Windows installations had problems moving projects
around. Even users who where composing knowledge bases solely in English ran into this problem because the incompatibility is not at
the language level but at the encoding (and machine) level. A number of our users requested that we deal with this problem somehow. Starting
with release 1.9, we made UTF-8 the default file encoding for Protégé. UTF-8 was selected because (a) it can encode
all Unicode characters, and (b) it's the most compatible our previous files.
How can I make Protégé use another file encoding?
We suggest that you don't do this. It will make it quite difficult for others to use your projects (or for you to use your
projects on other machines). Nevertheless, it is possible. You need to specify the Java property
"file.encoding=<encoding_name>" from the Java command line. If you start Protégé by
double-clicking on the executable (or a PPRJ file), then you need to specify this property in the
Protege.lax file in your Protégé installation directory.
How can I translate files between different encodings?
There's a program available in Sun's Java Development Kit called
"
native2ascii". In spite
of the strange name, this program can convert a file in any encoding to any other encoding. The default behavior of this program
is to translate a file from the native file encoding to UTF-8 (not ASCII !).
How can I troubleshoot my problems with fonts or file encodings?
If you're running Protégé from a script or batch file you need to be sure to set the file.encoding property at the
Java command line. Otherwise, you'll get whatever the default is on your system. Starting with release 2.0, Protégé
will print out a warning message on startup if you're not using UTF-8.
If you're running Protégé from inside of a Java development environment, you also need to set the file.encoding
property. How you do this depends on your development environment.