Difference between revisions of "Protege3JDBCSchema"
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | = Design and Rationale for the | + | = Design and Rationale for the Protégé 3 database schema = |
− | This document discusses the rationale, design, and implementation of the JDBC back-end for Protégé | + | This document discusses the rationale, design, and implementation of the '''JDBC back-end for Protégé 3'''. |
− | |||
− | |||
== Why we did it the way we did == | == Why we did it the way we did == | ||
− | There are a number of ways to map an object-oriented schema (essentially part of what Protégé produces) to an RDBMS schema. These different ways all involve a number of tradeoffs between optimizing for space, time, queries, modifications, insertion, deletion, etc. We wanted to support the possibility that our users might want to map onto a legacy, perhaps even non-relational, database. Different types of mappings also seem more appropriate at different development stages. | + | There are a number of ways to map an object-oriented schema (essentially part of what Protégé 3 produces) to an RDBMS schema. These different ways all involve a number of tradeoffs between optimizing for space, time, queries, modifications, insertion, deletion, etc. We wanted to support the possibility that our users might want to map onto a legacy, perhaps even non-relational, database. Different types of mappings also seem more appropriate at different development stages. |
Early on, when you are just trying to get the class structure right, you want something where the schema is easy to change. Later on, when you are working with instances, you may want something that is optimal for insertion, modification, and removal of instances and perhaps has a "natural" feel about the mapping of classes to tables. Even later, when you are concentrating on applications, you may want something that is optimized for exactly the queries that are most important to you. | Early on, when you are just trying to get the class structure right, you want something where the schema is easy to change. Later on, when you are working with instances, you may want something that is optimal for insertion, modification, and removal of instances and perhaps has a "natural" feel about the mapping of classes to tables. Even later, when you are concentrating on applications, you may want something that is optimized for exactly the queries that are most important to you. | ||
It was clear that no matter which approach we took for the database back-end, we were unlikely either to make very many people happy, or to keep them happy for very long. Thus, there seemed to be two possible approaches: | It was clear that no matter which approach we took for the database back-end, we were unlikely either to make very many people happy, or to keep them happy for very long. Thus, there seemed to be two possible approaches: | ||
− | Develop (or use) a mapping language that would allow us to map the ontology to any RDBMS schema | + | * Develop (or use) a mapping language that would allow us to map the ontology to any RDBMS schema |
− | Allow users to write code to plugin into our system to access their database with whatever schema and optimizations they choose, and then provide a simple example implementation of this plug-in. | + | * Allow users to write code to plugin into our system to access their database with whatever schema and optimizations they choose, and then provide a simple example implementation of this plug-in. |
+ | |||
There are a number of commercial products that address the first option by parsing Java source files and producing RDBMS schema and interface classes of various types. We don't have Java source laying around for classes created inside of Protégé. More importantly, we allow arbitrary changes to the class structure after we have acquired instances. When the classes are changed we try not to lose any more information from the instances than necessary. We also allow arbitrary "type migration" of instances from one class to another. These features are really central to the way users interact with our tool and we were not aware of any commercial support for such a thing. | There are a number of commercial products that address the first option by parsing Java source files and producing RDBMS schema and interface classes of various types. We don't have Java source laying around for classes created inside of Protégé. More importantly, we allow arbitrary changes to the class structure after we have acquired instances. When the classes are changed we try not to lose any more information from the instances than necessary. We also allow arbitrary "type migration" of instances from one class to another. These features are really central to the way users interact with our tool and we were not aware of any commercial support for such a thing. | ||
Thus, developing something along the lines of the first option was a lot of work, as well as being orthogonal to the main thrust of our research here. We see ourselves as providing usable tools for knowledge base development and use and not as the world's experts on OO-to-relational mapping tools. Also, our bias is against doing such a large project having no experience with simpler systems. We prefer to build big things by extending little things that have proven to be inadequate. In this case we didn't have any experience building the corresponding little things, inadequate or otherwise. | Thus, developing something along the lines of the first option was a lot of work, as well as being orthogonal to the main thrust of our research here. We see ourselves as providing usable tools for knowledge base development and use and not as the world's experts on OO-to-relational mapping tools. Also, our bias is against doing such a large project having no experience with simpler systems. We prefer to build big things by extending little things that have proven to be inadequate. In this case we didn't have any experience building the corresponding little things, inadequate or otherwise. | ||
− | So, we chose the second option, allowing users to create their own database plug-ins. This choice is also consistent with the rest of the Protégé plug-in architecture. We chose the simplest schema that we could think of and focused on the "maximal change" usage where the class structure and hierarchy is undergoing constant change. In this design, therefore, there is no attention paid to things such as query performance of any type. | + | So, we chose the second option, allowing users to create their own database plug-ins. This choice is also consistent with the rest of the Protégé 3 plug-in architecture. We chose the simplest schema that we could think of and focused on the "maximal change" usage where the class structure and hierarchy is undergoing constant change. In this design, therefore, there is no attention paid to things such as query performance of any type. |
== What we did == | == What we did == | ||
− | We have a single table that stores the entire | + | We have a single table that stores the entire content of the knowledge base. This includes classes, slots, facets and instances. The table has a fixed number of columns which are listed below. The Protégé 3 metaclass architecture (which is modeled after the CLOS Meta-Object Protocol) is used explicitly in the table to simplify things: we treat all classes, slots, and facets as frames. Each entry in the database corresponds to a frame in Protege. Classes have slots such as <code>":DIRECT_SUPERCLASS"</code> to maintain the inheritance hierarchy. All frames have a <code>:NAME</code> slot which contains the name of the frame. The database format is: |
− | |||
− | |||
{| class="wikitable" style="text-align: left; color: green; width:100%" | {| class="wikitable" style="text-align: left; color: green; width:100%" | ||
Line 36: | Line 33: | ||
|frame | |frame | ||
|VARBINARY or VARCHAR | |VARBINARY or VARCHAR | ||
− | |frame name | + | |frame name (unique in the KB, used as identifier for the frame) |
|No | |No | ||
|No | |No | ||
Line 72: | Line 69: | ||
|value_type | |value_type | ||
|SMALLINT | |SMALLINT | ||
− | |number used to indicate the "type" of the value stored in slot_or_facet_value. The number-to-type conversion is given in the | + | |number used to indicate the "type" of the value stored in slot_or_facet_value. The number-to-type conversion is given in the files: Frames - [http://smi-protege.stanford.edu/svn/protege-core/trunk/src/edu/stanford/smi/protege/storage/database/DatabaseUtils.java?view=markup edu.stanford.smi.protege.storage.database.DatabaseUtils.java] and the OWL extensions: [http://smi-protege.stanford.edu/svn/owl/trunk/src/edu/stanford/smi/protegex/owl/model/factory/FrameTypeId2OWLJavaClass.java?view=markup edu.stanford.smi.protegex.owl.model.factory.FrameTypeId2OWLJavaClass] |
|No | |No | ||
|No | |No | ||
Line 89: | Line 86: | ||
|} | |} | ||
+ | The <code>Type</code> column shows the column types used in the table schema. However, different relational engines use different types. You can find the exact mapping in the file: [http://smi-protege.stanford.edu/svn/protege-core/trunk/src/edu/stanford/smi/protege/storage/database/KnownDatabase.java?view=markup edu.stanford.smi.protege.storage.database]. | ||
'''Note''' | '''Note''' | ||
− | If you browse a Protege database you may notice that, in the case of the superclass and subclass relations, we appear to store duplicate information. For example with class A we store that its subclass is B and with B we store that its superclass is A. Maintaining separate records for these relations is necessary because we want to be able to maintain the ordering of both subclasses and superclasses. | + | If you browse a Protege 3 database you may notice that, in the case of the superclass and subclass relations, we appear to store duplicate information. For example with class A we store that its subclass is B and with B we store that its superclass is A. Maintaining separate records for these relations is necessary because we want to be able to maintain the ordering of both subclasses and superclasses. So while the "slot value" information is indeed duplicated in these records, the "index" information is unique. (Subclass ordering is a user-interface feature that a number of our users have requested. Protege attaches no meaning to the ordering of superclasses or subclasses.) |
Latest revision as of 11:13, March 25, 2014
Design and Rationale for the Protégé 3 database schema
This document discusses the rationale, design, and implementation of the JDBC back-end for Protégé 3.
Why we did it the way we did
There are a number of ways to map an object-oriented schema (essentially part of what Protégé 3 produces) to an RDBMS schema. These different ways all involve a number of tradeoffs between optimizing for space, time, queries, modifications, insertion, deletion, etc. We wanted to support the possibility that our users might want to map onto a legacy, perhaps even non-relational, database. Different types of mappings also seem more appropriate at different development stages.
Early on, when you are just trying to get the class structure right, you want something where the schema is easy to change. Later on, when you are working with instances, you may want something that is optimal for insertion, modification, and removal of instances and perhaps has a "natural" feel about the mapping of classes to tables. Even later, when you are concentrating on applications, you may want something that is optimized for exactly the queries that are most important to you.
It was clear that no matter which approach we took for the database back-end, we were unlikely either to make very many people happy, or to keep them happy for very long. Thus, there seemed to be two possible approaches:
- Develop (or use) a mapping language that would allow us to map the ontology to any RDBMS schema
- Allow users to write code to plugin into our system to access their database with whatever schema and optimizations they choose, and then provide a simple example implementation of this plug-in.
There are a number of commercial products that address the first option by parsing Java source files and producing RDBMS schema and interface classes of various types. We don't have Java source laying around for classes created inside of Protégé. More importantly, we allow arbitrary changes to the class structure after we have acquired instances. When the classes are changed we try not to lose any more information from the instances than necessary. We also allow arbitrary "type migration" of instances from one class to another. These features are really central to the way users interact with our tool and we were not aware of any commercial support for such a thing.
Thus, developing something along the lines of the first option was a lot of work, as well as being orthogonal to the main thrust of our research here. We see ourselves as providing usable tools for knowledge base development and use and not as the world's experts on OO-to-relational mapping tools. Also, our bias is against doing such a large project having no experience with simpler systems. We prefer to build big things by extending little things that have proven to be inadequate. In this case we didn't have any experience building the corresponding little things, inadequate or otherwise.
So, we chose the second option, allowing users to create their own database plug-ins. This choice is also consistent with the rest of the Protégé 3 plug-in architecture. We chose the simplest schema that we could think of and focused on the "maximal change" usage where the class structure and hierarchy is undergoing constant change. In this design, therefore, there is no attention paid to things such as query performance of any type.
What we did
We have a single table that stores the entire content of the knowledge base. This includes classes, slots, facets and instances. The table has a fixed number of columns which are listed below. The Protégé 3 metaclass architecture (which is modeled after the CLOS Meta-Object Protocol) is used explicitly in the table to simplify things: we treat all classes, slots, and facets as frames. Each entry in the database corresponds to a frame in Protege. Classes have slots such as ":DIRECT_SUPERCLASS"
to maintain the inheritance hierarchy. All frames have a :NAME
slot which contains the name of the frame. The database format is:
Column | Type | Description | Can be null | Can be empty |
---|---|---|---|---|
frame | VARBINARY or VARCHAR | frame name (unique in the KB, used as identifier for the frame) | No | No |
frame_type | SMALLINT | same as "value_type" but for the frame column | No | No |
slot | VARBINARY or VARCHAR | slot frame name | No | No |
facet | VARBINARY or VARCHAR | facet frame name (0 if not a facet value) | No | No |
is_template | BIT or BOOL | 0 => value is OKBC "own", 1 => value is OKBC "template" | No | No |
value_index | INT | number used to maintain relative ordering of slot_or_facet_value entries for a frame-slot(-facet) combination | No | No |
value_type | SMALLINT | number used to indicate the "type" of the value stored in slot_or_facet_value. The number-to-type conversion is given in the files: Frames - edu.stanford.smi.protege.storage.database.DatabaseUtils.java and the OWL extensions: edu.stanford.smi.protegex.owl.model.factory.FrameTypeId2OWLJavaClass | No | No |
short_value | VARCHAR | facet value if facet is not 0, slot value otherwise. Holds values of length that will fit in a varchar (typically <= 255) | Yes | No |
long_value | MEDIUMTEXT | same as slot_or_facet_value but holds values too long to fit in slot_or_facet_value | Yes | No |
The Type
column shows the column types used in the table schema. However, different relational engines use different types. You can find the exact mapping in the file: edu.stanford.smi.protege.storage.database.
Note
If you browse a Protege 3 database you may notice that, in the case of the superclass and subclass relations, we appear to store duplicate information. For example with class A we store that its subclass is B and with B we store that its superclass is A. Maintaining separate records for these relations is necessary because we want to be able to maintain the ordering of both subclasses and superclasses. So while the "slot value" information is indeed duplicated in these records, the "index" information is unique. (Subclass ordering is a user-interface feature that a number of our users have requested. Protege attaches no meaning to the ordering of superclasses or subclasses.)