RJLRef: IVPR\COOL\NotationForDataModels.txt Reasons for COOL's data modeling conventions: (a technical supplement to 20100907MetadataMeetingNotes.txt) RJLRef: IVPR\NotationForDataModels.txt (Also online at: www.cs.uml.edu/~lechner/COOL-GEN/NotationForDataModels.txt I. Front-end Presentation Benefits II. Back-end Implementation Benefits III. Desirable COOL framework extensions IV Online Links This note outlines the many advantages of COOL's graphic data modeling conventions for metadata AND data. It is also a case for transforming datasets to Third Normal Form (3NF). It remains to be seen if any of this is relevant to metadata for Weave I separate (I) presentation advantges and (II) implementation advantages. These are followed by (III) a list of desirable framework extensions The last section (IV) is online links to info on my home page and to David Hay's.site... I Front-end Presentation Benefits ========================= I(1) Notation is a proper subset of most diagram styles Supersets to which BDE is upward-compatible include ER diagrams for RDBs and UML class models. BDE has two minor differences from Oracle's Case Designer/Barker notation (see next 2 notes) which Hay recommends in http://www.cs.uml.edu/~lechner/DavidHay/DHay_ComparingDModTechniques.htm [or .pdf] I(2) High-density models are possible using mnemonic data-type abbreviations This feature could be used with all modeling notations, but it is ubiquitous in the metadata captured by the COOL-BDE block diagram editor and in the code generated from this metadata by COOL-GEN. Acronyms are used to label nodes on the data model so that many more nodes and edges can be drawn in smaller space. A glossary of node names is provided on a caption of the same diagram. [Namespaces are TBD to avoid name clashes to support multiple dataset sources and content.] I(3) Consistent use of the directed graph pattern in data models BDE uses a simple directed graph (nodes + edges) metamodel. An edge in metadata defines a one-to-many relation. A many-to-many relation is expanded to a pair of 1 to many relations and a child entity type whose instances include [references to] an ordered pair of nodes. A repeating group of attribute values is also modeled as a child entity type at the end of a 1 to many relationship. The metadata captures all relationship links as foreign keys (or XML IDRefs) in the child-set. These fkeys match primary key values (or XML IDs) in the parent nodes. These suffice to represent metamodels as well as data models for many COOL framework applications at UML/CS. I(4) Notation allows mixed use of both an arrow head or tail ('Crowsfoot'). These can represent two semantic options: e.g., forward- or reverse-directed access preference, or a selected spanning tree among the directed edges for XML or other encoding as a nested tree structure. (This direction indicator is not available in 1 to 1 relations with no arrowhead. If desired, BDE can split such a link and add an optional arrowhead at its center (--->--- or ---<---) for this purpose.) In BDE, an arrowhead or an arrow-tail ( 'CrowsFoot' ) at the child end of a relation link both imply the same multiplicity and permit many children per parent instance. An arrow-head or an arrow-tail (not both) is used exclusively in other graphic styles. (BDE is consistent with UML and others where an arrowhead implies maxmult = M; however some theory texts (e.g. Ullman) use it to denote just the the opposite: a single-valued functional range or instance is available in that direction (maxmult = 1).) I(5) Chen's 'diamond' symbol is superfluous Peter Chen's ERA data model placed a diamond on a relational link and allowed this symbol to have attributes. (Ternary relations connect three entity types to the diamond instead of two.) BDE's diagram convention (like many others) eliminates this symbol. .When relations have their own atributes (and especially when they have child-sets of their own) modelers must augment the 'diamond' to provide spacefor attributes that depend on both keys (e.g. by hanging them off of the diamond or in its bounding box. ). This box implies a third entity type which can hold with associated fkey attributes to reference parents the fkeys of participating parents (plus non-key attributes as well). If you erase the diamond and retain only its bounding box you get the familiar entity type box with two 1 to many links entering it - thus 'proving' sufficiecy of entity box/class symbol to represent all nodes. The diamond is redundant because it is implied in any entity that has at least two incoming arrows I(6) Multiplicity can be represented graphically and/or textually Our notation has a graphic block diagram Editor COOL-BDE, which can capture textual labels for min-max multiplicity range at both ends of a link. As on UML class diagrams, these can be specified by two meta-attributes: where minmult = 0 or 1 and maxmul = 1 or M. UML also permits the default shorthand 1 for 1..1 at source end and * for 0..* at destination end. (Some tools also support + for 1..* as well.) Our notation can also support dashed half-line to represent minmult=0 as in Oracle/Barker notation. Dashes mean that this relation MAY (or may NOT) return a value, because it is only a partial function. A solid half-line represents minmult=1 (a total function which always has a non-null return a value[s]. There is still the question of interpreting the half-line's placement: I recommend putting dashes at the destination (range or co-domain) end of a relation where UML places the label minmult=0. This is consistent with UML in placing both minmult and maxmult indicators at the same end where they indicate partial and multi-valued function returns, respectively Oracle/Barker places the dashed half-line at the source (or domain) end of the link to represent minmult=0 at the opposite destination end. This is consistent with 'May or Must participate' constraints, expressed in natural English, as used in some tools. Both interpretations are reasonable. A case can also be made for using a simple directed arrow link style (without minmult). One reason is that default values of 1..1 at source end and 0..* (or 0..1 with no arrowhead) at the destination end are almost always valid when many-to-many relations are expanded (see # ). (As a limiting case, the sparse relationship matrix becomes an empty set.) Another reason for ignoring the minmult notation is the risk of getting some of the minmult wrong. (Design reviews must add them before implementation.) The directed arrow default (maxmult only) should be interpreted as an 'exactly one to indefinitely many' relation link: 'For each source instance there are ZERO or more related destination instances, while for each destination instance there MUST be a related source instance'. This is consistent with the principle of 'Referential Integrity', which implies a single-valued total inverse function for evary 1 to many relation: Every existing child must have exactly one parent. [This default arrow link style implies that every child instance of an N-ary relation MUST have a valid non-null reference to each of its N parents. This does not support what are called functional or existence dependencies: For N = 2, a child may exist because a repeating group that is expanded to a child-set has sub-structure that includes some optional foreign keys. For N >= 3 a child may sometimes relate only K < N of its parent types, with null-valued references to another N-K optional but non-existent parent[s].] These are esoteric implications which are beyond the scope of Third Normal Form (3NF). I (7) Clear contrast between inheritance and associative links. Inheritance links must be easily distinguished from relational associations. For maximum comprehension, BDE uses an inverted T-shape (organization chart style) to draw each single-inheritance tree structure. This was common in early class diagrams. The T-bar hangs below the superclass and a link extends down to each subclass. [BDE uses the node/edge digraph model for all block diagrams (data, metadata, and state models). A work-around is needed for T-shape links: BDE adds a horizontal bar as a subtype of node, to provide a 'fanout' point for superclass partitioning into subclasses. The BDE2TTA (or b2t) converter from nodes to entities can recognize this bar and bypass it to declare subclasss relations. I(8) First Normal Form (1NF) Diagrams [For back-end advantages of 1NF, see II(2) The recommended data model diagram is in First Normal Form because it removes repeating groups from the internal content of entities. Flat tables or spreadsheet grids can display each table's content, It is possible to physically nest repeating groups of child instances below [at most one of] their parent instance[s]. Note that this non-nested layout is not the only choice: The existence of even one binary relation means that the model is not a tree structure: more than one one spanning tree exists. There is no need to choose one such tree (even in XML) for nesting, particularly if many links exist outside of the spanning tree. Then all links might as well be retained as fkey-valued reference types, not just those on the tree.. Retained fkeys should be redundantly represented as IDRef's, if XML is to represent the full RDB content; to ignore them gives XML format incomplete semantics and makes it useful only for transport until it is supplemented by metadata. BDE allows free-form text line comment per TA instance declaration inside a node. However, GEN supports only a few data types (see II(20). These have proven adequate for all COOL-GEN applications including GEN, BDE and LCP themselves. GEN model conventions expand repeating groups and many-to-many relations to force explicit design choices at the expense of fragmenting the data entities. This is a worthwhile tradeoff from two perspectives:integrating GEN with BDE, and unifying user-level with implementation-level models. It has the advantage of explicitly revealing structural features by exposing the semantic implications of internal repeating groups and multi-valued relations. This is significant both to users who specify a need to deal with multi-valued and/or partial access, and to programmers who implement their semantics, as implied by multiplicity meta-attributes. I(9) Reverse engineering risks Note that tools which reverse engineer class declarations or database schemas into graphic data models may or may not detect all code generation semantics. [IBM's Rhapsody tool had problems reverse-engineering data declarations in BDE's main memory database management code, because GEN's API converts external fkeys to internal list structure pointers. These added complexity to the diagrams that were reverse-engineered from the code base. Users and analysts may not need or want to know about this complexity.] I(10) One-to-many links have only meta-attributes. Many-to-many relations (which do have attributes) are transformed in GEN's 1NF data models. What remains are 1-to-1 or 1-to-many associations, whose instance-level attributes can be merged into the box representing the child entity type. [Links DO have class-level meta-attributes like multiplicity and role-names and constraints beyond third-normal-form (3NF); these can be stored as metadata of the fkey type or subclass in the TA-instance. They do not vary over instances in the dataset that store fkey values.] II. Back-end Implementation Benefits ============================ II (1) Support for automatic code generation COOL-GEN is a code generator whose input is metadata that can come from (or go to) BDE. The generated code has an API to import and export a relational database as a single dataset with instances of mixed table types in any order. The internal representation is a main-memory database wth linked lists to support navigation within and among all its related structs or objects. The most complete version (CHGEN) generates C-code. Two other GEN versions are GENCPP and GENJAVA. They all use the same metdata input. CHGEN and GENCPP are boot-strapped to use their own API to manage GEN input metadata. BDE is a CHGEN application, in the sense that its database operations use the GEN API, althnough its source code is C++. [BDE could use the slightly different API from GENCPP (TBD - see II(20). BDEJAVA has a different architecture and was never integrated with GENJAVA.] The same metadata provides input to all GEN versions. II(2) 1NF database (no list- or set-valued attributes) (for front-end advantages see I(8)) GEN's metadata and API support a First-Normal-Form (1NF) data model. 1NF removes all repeating groups of property values from all records and attaches them as child table types. This encourages uniform regular and predictable table structures that expose all sub-structure which is otherwise masked from the casual reader. It permits ASCII data and metadata files to be managed by the same line-oriented version control system along with source code. Import and export of datasets whose child tables are nested under parent ones is supported by GEN API utilities import pr_load() and export pr_dump() utilities. These advantages come at the expense of more numerous but less complex entity types. A repeating group is replaced by a set of child entity instances associated with each parent by fkeys. (The pr_load utility produced by GEN adds first- and last-child pointers in the parent, and parent and next-child references in each child during data import.) (See references in IV below.) II(3 ) Consistent coding conventions across all data and metadata. Applications can use the generated API to import/export 1NF RDBs with an ASCII text line for each table row. This API includes macros to access data fields and navigate relationship links. Uniform and consistent naming conventions based on the metadata make such applications easy to read, write and document. II(4) A common version-controlled table format (line-oriented ASCII text) All data and metadata have the same line-oriented, column-aligned record format, thanks to 1NF constraints. All of it can reside in the same version-controlled repository along with source code. Imported datasets can have mixed record types in any order. Exported datasets can use depth-first order (nested child-sets) or breadth-first order (sequential tables). II (5) Support for data model evolution Evolving schema versions may change over time. Relations can change from single- to multi-valued upon discovery of exceptional cases. Objects that were in 1 to 1 correspondence may have been declred with imbedded cross-references. Instead, a many-to-many associative entity may be needed to relate multiple instances indirectly. It may be preferable to err on the side of caution by creating this table early on. On the other hand, efficient or distributed access to datasets may require that both direct and indirect access cases coexist in the same relation. GEN can simply regenerate the API after metadata revision, but changing the application code to match the new API is still a manual problem. Writing converters for existing datasets is also a manual problem. Associative entities are desirable because they indirectly couple related parents. This child instance captures mutual dependencies among parents without changing the data content of either parent. A 1 to 1 isomorphic relationship link (maxmult = minmult =1 at both ends) probably relates two independent peers (e.g. synonyms, code conversions, different contexts, or remote data access). Although each related instnace could contain an fkey to the other one, the benefits of decoupling may justify placing an associative entity between them for indirect traversal. Declaring a set of references to the other parent in both parent types will not support attributes that depend on both parents. These are replaced by a child-set containing one tuple of references. Child instances become part of a sparse matrix relational structure. For a 1 to 1 relation, minmult=1 suggests the existence of an independent parent, and minmult=0 at the other end implies a dependent child[-set]. (A parent can exist without children, but no child can exist without a parent). TO maintain a consistent topological ordering of containment relations I recommend placing the dependent type at a lower level than the independent one. III. Desirable COOL framework extensions =============================== These are motivated by potential OIC applications and by past projects for which designs and partial code exists. III(1) Augment GEN to produce XML inport/export format. XML is not desirable for local storage or programming, but it is useful for data transport. A generic converter between 3NF tables and XML should be simple to write: it could interpret metadata that is input by GEN's API to read compatible datasets and produce the equivalent XML, and vice versa. An incomplete prototype exists in project GENXML, which tried (without success) to produce an XML formatted metadata from metadata tables SV, TT and TA. This should be trivially possible for any data model and its metadata as processed by GEN. III(2 ) Extend metadata to support Views (see IV(4)) (See IV(4) for these metamodel extensions). This requires adding tables ViewVersion (VV) and TableStatistics (TS) and the metaschema relationships [SV]--->[VV]----->[VT]>------[TT]; (For each schema version there can be multiple ViewVersions defined as subsets of the TT-child-set of SV; Table VT captures the many-to-many association between VV and TT. The Admin Console GUI could select TT rows to populate VV and VT from the data model diagram. (Projections onto subsets of TableAttributes (TA) are not defined here but ViewAttribute (VA) could be added to associate VT with [a subset of] the TA chidren of one TT.) The big problem with subschema views is maintaining the dataset to support View-based updates. How should the missing columns be filled in if the View does not include them? Presumably, datasets are read-only so updates are not needed in OIC/Weave. [Caveat: This type of View does not support SQL Joins that compose new tables from columns of related Tables.] III(3) Extend metadata to define Candidate Keys (see IV(4) (See IV(4) for these metamodel extensions). This requires adding a new table CK as a child of TT, and its child table CA to associate each CK with the set of attributes (TA-rows) that define it. Note these attributes may reside in multiple TT parents (typically along a hierarchical chain of ancestor levels). One purpose of Candidate Keys is to guide SQL query formulation. User-visible keys may be required for this. Another important use of CK definitions is to guide a verifier that ensures that candidate keys are unique over all TT-rows whichi they purport to identify, whenever a dataset is imported. III(4) Augment GENCPP with object fkey and simple value subclasses of TA. Metadata type TA (TableAttribute) instances in principle have two subclasses. These capture simple value properties and foreign key reference properties, respectively. Conceptually, ForeignKey FK is that subclass of TA which holds an entity reference datatype. BDE is written in C++ but uses the library of .c and .h code produced by CHGEN from its data model (without inheritance). Therefore, its fkeys are simply TA's with a metadata tagfield that asserts is_key = 1 (or s or c, for a superclass). A new BDE version which used the C++ code API from GENCPP could introduce a new FK subclass of TA. III(5) Support additional non-key data types declarable in TA. Non-key attributes declared in table TA are simple values of three distinct types: int, float, word and textline. A word is a fixed-length char array with no internal whitespace. Textline is a variable-length char array with no line-feeds, but is only allowed in the last field where it is line-terminated. (Preceding key and non-key fields are white-space terminated.) All record types declared in table TT contain an initial pkey, optional fkeys, non-key numeric and word values and the possibly empty text string as its last TA-child. This format is exported.by the GEN API utility pr_dump(). . Numeric and char[k] variables are declared by GEN for conversion by the import utility pr_load(). Single-line records without quoted strings or multi-line values simplify tabular display which also improves readability. III(6) Support 'tokenized' multiline text and generate concordances. (See the last 2 slides in the BDE User Guide III(2)) . The RDB's managed by GEN-produced code are line-oriented for readability and performance. In particular, BDE uses the GEN API to manage the metadata it collects from diagrams. However, multi-line text may be needed for human consumption, particularly for OIC metadata derived from ontologies, and for Weave datasets. GEN's API already supports data models whose multi-line paragraphs are stored in a child-set of text-lines. Alternately it can be parsed into 'token' (TK) instances which associate each word position in the text with a data dictionary entry. The token table is a binary relation between word positions in a database record and words in a dictionary. A concordance of word usage is available by following thie relation in the revese direction. COOL's SYMTXT and BDESYM prototypes extended BDE's data model and source code to lexically analyze the content of a text paragraph. The text was replaced by creating dictionary entries and a binary 'token' relation that associated each text position with an fkeyh to a dictionary entry (or to a punctuation token). The grammar for parsing text could be application-specific. III(7) Support Namespace extensions to metadata and the GEN API Separate namespaces must be associataed with Schema Versions to avoid naming ambigities to manage distributed databases and multiple data model sources. Name clashes are prohibited in table short name mnemonics as well. COOL projects so far have avoided these by keeping all TT abbreviations disjoint. This will not be possible without dispersing 'Use Namespace ' statements throughout the source code. This assumes that the SV instance remains constant during data operations that do not cross SV or data model boundaries. The problem gets worse if tables from multiple RDB and metadata schemas are to be merged into one file or processed concurrently. This requirees the space overhead of a namespace mnemonic prefix tag that is prepended to each pkey and fkey TT-type tag and instance number in eacdh dataset to be processed concurrently. GEN's API code parametrizes the key encoding process, which simplifies introducing this prefix. III(8) Support candidate, preferred, and alternate keys: GEN produces code in which integer-valued surrogate pkeys and fkeys are used exclusively and stored explicitly. This improves search efficiency for surrogate key comparison, provided their values are known. Users re aware of composite keys comprised of multiple attributes. These are called candidate keys; some are composites of multiple separate partial key fields (e.g., in a hierarchical datacube dimension, selectors at each level of the tree may be combined into a unique key for a fine-grained record instance; in a family services database, a temporary family unit might be identifiable by both parent names plus an interval of time when they were living together.) To implement SQL queries, various candidate keys might be used as row-selectors or search fields within child tables. Pre-defined keys could be used for regression testing of database queries. David Hay's paper IV(6) describes different ways in which data models can be annotated to identify such preferred and alternate keys. GEN model diagrams visualize the multi-level network which must be navigated to locate an entity instance. Application programmers then write the navigation code. GEN metadata can be augmented with candidate key metadata by adding tables CK and CA (a CK to TA association). Appending these does not change existing metadata tables TT and TA. They provide a meta-database that stores candidate key info. This database can be populated in an eager or lazy fashion, manually or graphically or by a process of discovery. A proposal to add alternate keys and namespace extensions to GEN metadata is at IV(4): http://www.cs.uml.edu/~lechner/COOL-GEN/TTTA_metadata_jk_rl.ppt It adds to the existing metaschema [SV]-->[TT]--->[TA] the following additional tables and links: ([TT])---->[CK]---->[CA]<----([TA]) Each table type TT owns a new set of candidate keys in CK. Each CK entry has a place (and optional name) for each useful candidate key of the class defined in its parent TT-instance. Each CK is defined by one child-set of CA-entries. CA defines the set of TA attributes whose values define this key. (These key components may range over all TT's in the data model.). The new CA child-set of a TA also define the set of candidate keys in which that particular TA participates. IV Online References: ===================================== [An earlier draft was emailed to CurranK and JimG) A vast amount of documentation exists on COOL Framework components, thanks to numerous student projects at UML/CS. Many hundreds of students played some role and are responsible for multiple prototype versions and a few robust and stable ones..These fill many directories under my homepage http://www.cs.uml.edu/~lechner. (another advantageous COOL feature :-) This can be a blessing if you find what you want, or a curse when you don't see the forest for the trees. I began a partial reorganization of this tree (from projects within each semester to semesters within projects). Some back-references exist and some are TBD or lost.. GEN's metadata and API support a First-Normal-Form (1NF) RDB. 1NF requires repeating groups of property values to be removed from all records and tabulated separately. This encourages uniform regular and predictable table structures that expose all sub-structure which is otherwise masked from the casual reader. It permits ASCII data and metadata files to be managed by the same line-oriented version control system along with source code. These advantages come at the expense of more numerous but less complex entity types. A repeating group is replaced by a set of child entity instances associated with each parent by fkeys. (The pr_load utility produced by GEN adds first- and last-child pointers in the parent, and parent and next-child references in each child during data import.) - See IV(4) Here are some useful links on my homepage.http://www.cs.uml.edu/~lechner/: IV(1) A list of API user guides for CHGEN versions V8-13 is at http://www.cs.uml.edu/~lechner/COOL-GEN/genDocsList.txt IV(2) A Users Guide to BDE is at http://www.cs.uml.edu/~lechner/COOL-BDE/BDE_UG2000/bdeUG_2000.htm IV(3) Slides showing the evolution of BDE's graphic data model are at http://www.cs.uml.edu/~lechner/COOL-BDE/BDEUG_2005DataModels.ppt IV(4) For some metrics and pairings of GEN internal subroutines and the pr_* routines which they generate, see http://www.cs.uml.edu/~lechner/COOL-GEN/chgenv12r1.doc IV(5) For details of the metadata content produced by CHGEN and how it is used, see the Data Modeling Tutorial at http://www.cs.uml.edu/~lechner/DataModels2htm/index.htm IV(6) Possible extensions to BDE's data model for subschema views, candidate keys, and namespaces http://www.cs.uml.edu/~lechner/COOL-GEN/TTTA_metadata_jk_rl.ppt Links to many David Hay papers are at http://www.cs.uml.edu/~lechner/DavidHay IV(7) I recommend Hay's comparison of alternate data modeling styles: http://www.cs.uml.edu/~lechner/DavidHay/DHay_ComparingDModTechniques.pdf [and .htm] (52 pp) Hay recommends the Barker/Oracle notation for analysts who must communicate with users, and the UML notation for designers and programmers. IV(8) In particular, http://www.cs.uml.edu/~lechner/DavidHay//DHayReply030907.doc expresses his regrets for not knowing more about Bachman's DSD, and expresses his frustraton at OMG for their UML class diagrams which 'reinventing the wheel' for analysts.). GEN data model diagrams are close to Barker notation. They are a superset of Bachman's Data Structure Diagram (DSD) notation. DSD's were used in all network database products, which preceded relational ones. [I worked at Honeywell IS Chen visited there one summer in the 1980's. He learned from Bachman.] . DSD's are unfairly criticized as being 'too close to the metal' of hard disk storage allocation. I believe DSD's are still the most clear and simple way to describe the properties of relational databases to users as well as implementers. DSD's arguably simplify the analyst's job if they become a common metamodel that is shared with users and implementers.