Protege3JDBCSchema

From Protege Wiki
Revision as of 12:16, March 13, 2014 by TaniaTudorache (talk | contribs)

Jump to: navigation, search

Design and Rationale for the Protege 3 database schema

This document discusses the rationale, design, and implementation of the JDBC back-end for Protégé.

This page is still under construction!

Why we did it the way we did

There are a number of ways to map an object-oriented schema (essentially part of what Protégé produces) to an RDBMS schema. These different ways all involve a number of tradeoffs between optimizing for space, time, queries, modifications, insertion, deletion, etc. We wanted to support the possibility that our users might want to map onto a legacy, perhaps even non-relational, database. Different types of mappings also seem more appropriate at different development stages.

Early on, when you are just trying to get the class structure right, you want something where the schema is easy to change. Later on, when you are working with instances, you may want something that is optimal for insertion, modification, and removal of instances and perhaps has a "natural" feel about the mapping of classes to tables. Even later, when you are concentrating on applications, you may want something that is optimized for exactly the queries that are most important to you.

It was clear that no matter which approach we took for the database back-end, we were unlikely either to make very many people happy, or to keep them happy for very long. Thus, there seemed to be two possible approaches: Develop (or use) a mapping language that would allow us to map the ontology to any RDBMS schema Allow users to write code to plugin into our system to access their database with whatever schema and optimizations they choose, and then provide a simple example implementation of this plug-in. There are a number of commercial products that address the first option by parsing Java source files and producing RDBMS schema and interface classes of various types. We don't have Java source laying around for classes created inside of Protégé. More importantly, we allow arbitrary changes to the class structure after we have acquired instances. When the classes are changed we try not to lose any more information from the instances than necessary. We also allow arbitrary "type migration" of instances from one class to another. These features are really central to the way users interact with our tool and we were not aware of any commercial support for such a thing.

Thus, developing something along the lines of the first option was a lot of work, as well as being orthogonal to the main thrust of our research here. We see ourselves as providing usable tools for knowledge base development and use and not as the world's experts on OO-to-relational mapping tools. Also, our bias is against doing such a large project having no experience with simpler systems. We prefer to build big things by extending little things that have proven to be inadequate. In this case we didn't have any experience building the corresponding little things, inadequate or otherwise.

So, we chose the second option, allowing users to create their own database plug-ins. This choice is also consistent with the rest of the Protégé plug-in architecture. We chose the simplest schema that we could think of and focused on the "maximal change" usage where the class structure and hierarchy is undergoing constant change. In this design, therefore, there is no attention paid to things such as query performance of any type.

What we did

We have a single table that stores the entire contents of the knowledge base. This includes classes, slots, facets and instances. The table has a fixed number of columns which are listed below. The Protégé metaclass architecture (which is modeled after the CLOS Meta-Object Protocol) is used explicitly in the table to simplify things: we treat all classes, slots, and facets as frames. Each entry in the database corresponds to a frame in Protege. Classes have slots such as ":DIRECT_SUPERCLASS" to maintain the inheritance hierarchy. All frames have a :NAME slot which contains the name of the frame. The database format is:


TBD

Column Type Description Can be null Can be empty
frame VARBINARY or VARCHAR frame name No No
frame_type SMALLINT same as "value_type" but for the frame column No No
slot VARBINARY or VARCHAR slot frame name No No
facet VARBINARY or VARCHAR facet frame name (0 if not a facet value) No No
is_template BIT or BOOL 0 => value is OKBC "own", 1 => value is OKBC "template" No No
value_index INT number used to maintain relative ordering of slot_or_facet_value entries for a frame-slot(-facet) combination No No
value_type SMALLINT number used to indicate the "type" of the value stored in slot_or_facet_value. The number-to-type conversion is given in the file: edu.stanford.smi.protege.storage.database.DatabaseUtils.java No No
short_value VARCHAR facet value if facet is not 0, slot value otherwise. Holds values of length that will fit in a varchar (typically <= 255) Yes No
long_value MEDIUMTEXT same as slot_or_facet_value but holds values too long to fit in slot_or_facet_value Yes No


Note

If you browse a Protege database you may notice that, in the case of the superclass and subclass relations, we appear to store duplicate information. For example with class A we store that its subclass is B and with B we store that its superclass is A. Maintaining separate records for these relations is necessary because we want to be able to maintain the ordering of both subclasses and superclasses. So while the "slot value" information is indeed duplicated in these records, the "index" information is unique. (Subclass ordering is a user-interface feature that a number of our users have requested. Protege attaches no meaning to the ordering of superclasses or subclasses.)