| |||
|
|
|
Introduction Subtypes can be found in many data models. They occur when an entity (the supertype) consists of other entities (subtypes) that exist at lower levels of abstraction and have their own particular attributes. For instance, in a model of a medical database, Person could be a supertype, and Male Person and Female Person could be subtypes, as shown in Figure 1. One particular attribute in the supertype is used as a category discriminator. The value in this column in the physical database indicates which subtype each record in the supertype is associated with. A special reference data table houses these values, and contains one record for each value that can be used as a category discriminator. In the physical implementation of Figure 1, the table Gender will contain two records, one with a Gender Code of “M” (Male Person) and the other with a Gender Code of “F” (Female Person). Each record in the Person table is associated with a record either in the Male Person table or the Female Person table depending on the value of Gender Code in that particular Person record.
All this is well supported by data modeling methodologies and tools. However, in my personal experience subtypes are not seen all that often in data models. This is somewhat surprising because many sets of business rules operate on only a specific subset of records in a table. This can easily be seen in SQL statements that update different columns in a table, but which have identical, and often rather complex, WHERE clauses. Similarly, sets of reports are often found which deal with only a specific subset of records in a table. The structure and behavior of applications would, therefore, imply that there should be a lot more subtypes in a database design than are apparent. In reality, the subtypes are there, but are hidden, and because they are hidden, they are not properly administered. Understanding hidden subtypes is necessary if the enterprise’s information assets are to be properly managed on a number of different levels. Why Do Hidden Subtypes Exist? The reason that so few real or “formal” subtypes are found in data models appears to be partly due to the complexity they introduce into programming. For instance, if a new Person record has to be inserted in the design shown in Figure 1, then either a new Male Person, or a new Female Person record should simultaneously be inserted in the appropriate subtype table. However, a system operator does not always know the gender of a patient right away, making it impossible to do the two inserts. The alternative of inserting a record in the supertype table and waiting until the gender is known before inserting a record in the appropriate subtype table is difficult to program. Given these complexities, programmers greatly prefer having only one table that incorporates the supertype and the subtypes. A second problem is that one supertype can have different sets of subtypes, which is difficult to produce a data model for, even more difficult to program against, and incredibly difficult to use for writing reports. For instance, the Person entity shown in Figure 1 may break down into a number of independent sets of subtypes in the context of a medical database as follows:
Each of these subtypes may have its own unique attributes, and may have some business rules that apply uniquely to it. If a data modeler rigorously implements these subtypes, the Person entity will have over 10 subtype entities. Any programmer presented with such a design would not be very happy, and any user trying to query a database built on such a design would probably not know where to begin. A third issue that often emerges is that a subtype can exist with only one category. That is, the supertype exists with only one subtype that has its own unique attributes. Perhaps the Person entity could have a Juvenile subtype, with some special attributes, but no corresponding Non-Juvenile subtype. In such a case, the category discriminator can be a simple indicator. There is no need for a separate reference data table to hold what would only be a single record for the category discriminator. Cases of single subtypes are quite common, and do not fit well with the theory behind formal subtypes. All of these factors contribute to reducing the inclusion of formal subtypes in data models. Instead, a single entity is implemented that includes the supertype and all subtypes. In reality subtypes still exist, but they are hidden within entities that have no associated subtype entities. The Role of Reference Data and IndicatorsWhen hidden subtypes are implemented instead of formal subtypes we typically see a database table surrounded by a cluster of reference data tables. The term reference data is used for data that represents things that the enterprise does not manage (e.g. Country), or data that is used only to categorize other data found in the database. Reference data tables are sometimes called “lookup tables” and “domain values”, and usually consist of a single code column and a single description column.
Figure 2 shows how reference data tables are used to identify hidden subtypes in the Person record. This design resembles the “fact table” and “reporting dimensions” that are commonly found in dimensional data modeling approaches used for building certain kinds of data warehouses and marts. Although the reference data tables in Figure 2 can be used as reporting dimensions, they represent much more than this. They identify the subsets of records in the Person table that represent hidden subtypes. Specific attributes belong to each of these hidden subtypes, and unique sets of business rules apply to each hidden subtype. One big advantage of using reference data tables to identify hidden subtypes is that an additional subtype category can very easily be accommodated in a physically implemented database simply by adding a new record to a reference data table. This makes it much easier for databases to change with business needs. By contrast, formal subtypes would require introduction of a new table. Of course completely new hidden subtypes (as opposed to an additional category for an existing hidden subtype) will require a new reference data table. In addition to reference data, attributes that are indicators can also identify hidden subtypes. In Figure 2, Juvenile Status Indicator serves this purpose. Indicators are attributes that have two values like “True” and “False”, or “Yes” and “No”, to indicate the presence or absence of something. In terms of hidden subtypes, indicators can only identify a subtype that contains one category, like Juvenile in Figure 2. As noted above, there is no Non-Juvenile subtype. One problem with indicators can occur if the hidden subtype needs an additional category added to it. When this happens an indicator no longer works. There is now a need to distinguish between two different subtypes, not simply a need to identify a single subtype. Unfortunately, indicators tend to be corrupted when this happens. They are usually populated with different codes, but no corresponding parent reference data table is introduced. This is a problem that exists in many databases. One way to eliminate it is to avoid using indicators to identify hidden subtypes, and to use reference data tables instead, even if they contain only one record. It should be noted that both reference data and indicators play roles other than identifying hidden subtypes in data models. Just because they are present does not mean that hidden subtypes are present. Loss of Subtype MetadataHidden subtypes make database implementation easier in general terms than when formal subtypes are used. However, they also create a lot of problems that are typically not mitigated by data modelers, DBA’s, or other IT professionals. These problems originate from the fact that a subtype is something that needs to be tracked. If a subtype is not present in the metamodel of a database design, then no information can be tracked for it. IT professionals then have to rely on personal knowledge, or information dispersed in documentary form. If a subtype exists, then its existence should be recorded via some basic metadata. Each subtype should have a name, a description, and should be related to other subtypes of the same category. Each subtype should also be directly related to the record in the reference data table (or indicator) that serves as a category discriminator. Unfortunately, this last requirement crosses the boundary between metadata and physical data in a database, making it unpalatable for many data professionals. A major problem with hidden subtypes is that it is impossible to tell from a data model if a particular attribute belongs exclusively to one hidden subtype, or if the attribute applies to the supertype, and thus to every record in the implemented database. Some data modelers try to use naming conventions, or groupings of attributes, or even color schemes to show which attributes belong together. However, data modelers do not always understand why they are doing this. The reality is that a hidden subtype exists to which a subset of attributes uniquely applies. Since there is no formal metadata for hidden subtypes, these groups of attributes cannot be related to the subtypes to which they belong. This problem also makes a complete nonsense of the NULL / NOT NULL options that exist for physically implemented columns in databases. Everyone involved in database design eventually comes across a database column that in some circumstances must contain a value, but in other circumstances must never contain a value. The attribute Person Responsible for Juvenile in the Person entity in Figure 2 is like this. If a person is a juvenile then a person who is responsible for them must be identified, and Person Responsible for Juvenile must be NOT NULL. However, if a person is not a juvenile, they are responsible for themselves and Person Responsible for Juvenile must be NULL. Forcing a data modeler to choose between NULL and NOT NULL for an attribute misses this point entirely. Poorly Identified Hidden SubtypesThe situation in Figure 2 is actually somewhat ideal. Sometimes, hidden subtypes can be so well hidden that it is nearly impossible to detect them. This happens when neither a single record in a reference data table, nor a single indicator column, identifies a hidden subtype. An example might be found in a customer database where customers who have not placed an order in the past 90 days and still owe an outstanding balance are subject to special business rules and special reporting. Such customers could belong to the hidden subtype Customers At Risk of Default. The Customer entity could have one attribute Days Since Last Order Placed, and another attribute Balance Outstanding that identify this hidden subtype. Each time this hidden subtype needs to be processed the value in Days Since Last Order Placed has to be compared to 90, and the value in Balance Outstanding has to be compared to 0. This may have to occur in many business rules, reports, and queries. The risk is that programmers, and perhaps users will make mistakes when repeatedly implementing the logic to identify the members of the hidden subtype Customers At Risk of Default. Furthermore, what happens if the definition of the hidden subtype changes to customers who have not placed an order in the past 60 days and still owe an outstanding balance? Hidden subtypes are poorly identified if complex logic alone is used to identify them. It is quite difficult to be aware of their existence. In these circumstances an indicator column should be added to identify the hidden subtype. Thus, an attribute Customer At Risk of Default Indicator could be added to the Customer table. It should be set once, by one business rule, and then used to identify members of the hidden subtype. Use of indicators in this way not only makes it easier to identify hidden subtypes, but it reduces the risk that comes from repeatedly implementing complex logic to identify hidden subtypes. In particular, it is very useful for keeping business rules atomic. A business rule that has to contain complex logic to identify a member of a hidden subtype, and which also contains a calculation for update purposes is really two business rules and is not atomic. Explicit identification of hidden subtypes through indicators or reference data helps to keep business rules atomic. ConclusionHidden subtype are frequently found in database designs, but are not supported by data modeling tools and techniques. Furthermore, few data professionals are aware of them or the implications that hidden subtypes have. An increased awareness of the existence of hidden subtypes could have a very positive impact on the quality of both the designs of databases and the applications built on these databases. Malcolm Chisholm is an independent consultant focusing on reference data management and building business rules engines. He is an author of two books on these topics and is a frequent author and speaker on data administration issues. More information can be found at http://www.refdataportal.com/ and http://www.bizrulesengine.com/. The Data Administration
Newsletter (TDAN.com) © Copyright 1997-2004 - The Data
Administration Newsletter (TDAN.com) - Disclaimer |