USERS
DEVELOPERS


see also:
PROTEGE-FRAMES FAQ
PROTEGE-OWL FAQ
PROTEGE WIKI

 protégé file encoding faq

Protégé is used by people from around the world who speak many languages and use most of the world's character sets. The majority of problems in dealing with character sets are handled transparently (thanks to Java's good support for this). In general, things should "just work". However, there are some issues related to saving, loading, and exchanging files that users should be aware of. These issues can be very confusing and this FAQ attempts to answer the most commonly asked questions surrounding file encoding.




What's a file encoding?

An encoding is a way to store a character as a sequence of one or more numbers. Typically, the only numbers available on most systems are in the range 0-255 (referred to as a byte). The encoding specifies how to map all available characters on a system into one or more bytes. If the number of characters in use on the system is less than 255 (as it is on most US and Western European systems), then creating such a mapping is not very difficult and is typically done with one byte per character. If the numbers of characters available for use on a system is thousands or tens of thousands (as on some Asian systems), then this mapping is more challenging, especially if you desire some sort of compatibility with systems in use elsewhere.

A number of different encoding schemes have been developed and are in use. There are multiple encodings for US English by itself; two common ones are ASCII and EBCDIC. There are also multiple encoding schemes for most other languages. The confusion that this causes for applications (particularly multilingual applications) is enormous. An international effort was developed to come up with a standard encoding to cover all languages. This standard is Unicode.

A file encoding is a specification of the way that a character is stored in a file as a sequence of one or more bytes. This can be, but is not required to be, the same as the default encoding used on the system for applications. The Unicode encoding specification describes four different possible (and incompatible) file encodings. Common file encodings are ASCII, Latin-1, UTF-8, UTF-16, BIG-5, etc. Each of these encodings has advantages and disadvantages and was developed with a specific purpose in mind. No one encoding scheme is "better" than the others in all contexts. The character "A" could thus be represented as a single byte "68" in one file encoding and as a sequence of three bytes "132 145 234" in another file encoding.

File encodings are important because different systems use different default file encodings and it requires some effort to get a system with one default file encoding to read a file written with another file encoding. It's fairly common outside of the US for a machine with an English language version of MS Windows to be sitting right next to a machine with a non-English version of MS Windows.

Doesn't Unicode handle all of this for you?

The short answer is no. Unicode is a character encoding that's more and more commonly used in applications. All Java applications (and thus Protégé) use a Unicode encoding for internal purposes. File systems however, typically support a local character encoding by default. By default, Java applications use the default system file encoding for all file input and output.

What's the difference between a file encoding and a font?

A font determines what a character looks like on the screen. A file encoding determines what a character looks like in a file. The character "A" could appear as three straight lines in one font or as a combination of circles and wavy lines in another (cursive) font. Underneath these visual differences, the same encoding (number) for "A" is carried around by the system and written into a file according to the file encoding in use.

What file encoding does Protégé use by default?

Protégé uses the UTF-8 encoding. UTF-8 is one of the Unicode file encodings and the encoding that's compatible with the most common US English language encoding (ASCII). All Unicode characters, and this means essentially all characters in any language, can be stored in UTF-8 (or indeed in any of the other Unicode file encodings). UTF-8 also appears to be the most commonly used of the Unicode encodings.

Why doesn't Protégé just use the default file encoding on my system?

It used to. As our user community grew, many users in different countries started to try and exchange Protégé projects and ran into the problems mentioned above. Users in the same physical location but with different Windows installations had problems moving projects around. Even users who where composing knowledge bases solely in English ran into this problem because the incompatibility is not at the language level but at the encoding (and machine) level. A number of our users requested that we deal with this problem somehow. Starting with release 1.9, we made UTF-8 the default file encoding for Protégé. UTF-8 was selected because (a) it can encode all Unicode characters, and (b) it's the most compatible our previous files.

How can I make Protégé use another file encoding?

We suggest that you don't do this. It will make it quite difficult for others to use your projects (or for you to use your projects on other machines). Nevertheless, it is possible. You need to specify the Java property "file.encoding=<encoding_name>" from the Java command line. If you start Protégé by double-clicking on the executable (or a PPRJ file), then you need to specify this property in the Protege.lax file in your Protégé installation directory.

How can I translate files between different encodings?

There's a program available in Sun's Java Development Kit called "native2ascii". In spite of the strange name, this program can convert a file in any encoding to any other encoding. The default behavior of this program is to translate a file from the native file encoding to UTF-8 (not ASCII !).

How can I troubleshoot my problems with fonts or file encodings?

If you're running Protégé from a script or batch file you need to be sure to set the file.encoding property at the Java command line. Otherwise, you'll get whatever the default is on your system. Starting with release 2.0, Protégé will print out a warning message on startup if you're not using UTF-8.

If you're running Protégé from inside of a Java development environment, you also need to set the file.encoding property. How you do this depends on your development environment.