THis report is not official - it merely adds bracketed comments about possible clarificatnos of gendb. See my message re: these comments at the end. -RJL 8/92. GENDB - A Data-Modeling Tool For Database Application Development ================================================================= by Stephen C. Smith & Craig E. Smith University of Massachusetts - Lowell May 1, 1992 Abstract - GENDB is a library of C++ routines designed to allow programmers to develop database applications. It provides the convenience of the relational model along with the performance advantages of network databases. The programmer can modify the data model (the Schema) on-the-fly, since it is not static, or compiled into the application. Child tables can inherit attributes from parent tables, either automatically, or explicitly as specified by the programmer. It remains to be seen how GENDB performs in comparison to its predecessor, CHGEN. Although CHGEN featured schema-specific compiled code, it produced very large programs, and was rather inflexible and unforgiving. Background - CHGEN was developed over three years time at the University of Lowell - Massachusetts (formally the University of Lowell). T.C. Cheng and C.Y. Chou created the initial version during the summer of 1990. S.C. Smith and C.E. Smith continued development during the spring and summer of 1991. The name CHGEN stands for "C" and "H" generator, since it started as a sub-project under a much larger CASE tool project. The purpose of CHGEN was to aid programmers in developing database code for the CASE tool project. In the Spring of 1992, Smith and Smith began development of a complete rewrite of CHGEN, using the C++ object oriented language. This version, called GENDB, would be completely dynamic and run-time interpreted, there would be no compiled schema-specific code. All access to the stored data would be through access routines. The object oriented features of the C++ language would be exploited to simplify the application programmer's task. The Data Model - The data-model used is that of a network database, where tables are arranged in a parent-child tree. Implicitly added to each tables explicit definition are relation chains. These chains are collections of pointers that allow programmers to freely traverse between records of related data in the tables. The model enforces that all relationships, when drawn as a tree of nodes (tables), parents are above and children below. The cardinality of the relation can be 1-1, or 1-M, but never M-M. This restriction is common in the data-modeling literature, and can be easily replaced by two 1-M relations. In addition,the parent side can never be the M side. The relation is explicitly defined in the child table: CHGEN implicitly defines it in the parent tables internal implement- ation. Finally, each record in a table is assigned a unique object identifier, called the primary key (or simply pkey). The pkey is an 8 character alphanumeric ASCII string. The first 2 letters identify the table the record belongs to (called the table abbreviation). The next two digits are the version of the dataset the record belongs to, and the final 4 digits are the record, or sequence number of the record within the table and version. For example, the pkey "AA030042" belongs to record number 42, in version 3 of the data within the "AA" table. Do not assume that the record number implies the records are stored in an array. In fact, a doubly linked pre-sorted list is used. The number simply comes from an ever increasing counter used to assign record numbers as records are created (within each table/version set). An Example - To support further examples, the following sample schema will be used. This schema (students.sch) is taken from the CHGEN Users Manual: students AA /* table AA holds student records */ { AAid student_id c8 1 /* primary (surrogate) key field */ fname first_name c30 0 /* first name of student */ lname last_name c30 0 /* last name of student */ } roster BB /* table BB associates students with courses */ { BBid roster_id c8 1 /* primary (surrogate) key field */ AAid student_id c8 1 /* foreign key of student record */ cname course_name c30 0 /* catalog name of course */ }; Schema Files - A schema file is used to define all tables in the data model. In the example,to Each "block" defines a table. The first line for each table identifies the name of the table, and its (all too important) two character abbreviation (table-id). Each line within the braces identifies a field within the table. The first field in the line specifies the name of the field. The second field is the fields alternate name, or language- specific type name (currently unused). The third field is the type of the field (fixed length character string, integer, float, date, or text). The last (forth) field indicates whether or not the field is a data field or an fkey. The first field in every table must be the pkey for that table. Next come the foreign keys, which specify a relationship chain upwards in the schema to a parent table. For optimum run-time performance, it is recomended (although not required) that no forward references exist in the schema. That is, foreign keys should always point back (upwards) to previously defined tables. Finally, a naming convention is enforced for foreign key field names. The name must start with a tables abbrev- iation, and, if multiple foreign key fields point to the same table, there names must end with a digit. Some example valid foreign keys are AAid, AAptr, AAid1, and AAid2. By convention, data fields are listed as the last fields in the table. Note: The forth field (the is-key field) actually takes on values other than 0 and 1. Information about the relation itself is coded into this value, such as cardinality, conditionality, and whether code should be generated for this field to handle forward references. These concepts are discussed fully in the CHGEN User's Manual, and will not be addressed further in this paper. Data Files - The data that makes up the database is stored in ASCII files. Each line in the file represents a single record in a table. Since the first field in each table is always the records pkey, the system can determine which table the record belongs to. Next, it simply parses the rest of the data line into the fields of the properly typed record it is building up in memory. After reading the full record, it is next linked into the system data-structures representing the database. If any relationships exist to other tables, the appropriate records are found and linked in. A number of restrictions apply to the the format of the data. First, whitespace (tabs, blanks, and form-feeds) is ignored between records and fields. Because of this, fields cannot contain spaces. This is a serious restriction. For example, the sample schema could not store a single "name" field, since the space between first and last names would be interpreted as a field break. To help alleviate this, the last field of any table is considered "free form". Specifically, everything between the first character of the field (in the data-file) and the end-of-line character is considered to be the contents of the field. Thus, developers typically make the last field some form of record description or comment. A sample data file for the sample schema follows: AA030001 John Doe AA030002 Sue Smith BB030001 AA030001 Calculus BB030002 AA030001 Physics BB030003 AA030002 Calculus CHGEN Programming Model - Developing programs for use with CHGEN is a difficult task. First, a schema file is created manually. Next, CHGEN is run on this schema file. This produces a collection of "C" modules that contain code (specific to that schema) for manipulating data of that schema. Routines are provided to load the data from a file, dump it back out to a file, add rows to tables, delete rows from tables, and various constructs for traversing data. Users can simply traverse pointer chains (always doubly linked) to navigate around related data. Macros are provided to simplify this, so the programmer does not have to know the actual internal pointer names and meanings. Advantages of CHGEN - The biggest advantage of CHGEN is that schema-specific code is generated. Actual "C" structures are used to represent records for each table, and the data manipulation routines are custom written (by CHGEN itself) to manage the links between related data (keep all the pointers up to date). For example, when a record is deleted, the delete routine has a custom block of code for deletes from that table. This code specifies exactly what other tables, if any, may have pointers into this record. The code then goes out and un-links this record from its relatives. Another advantage is that the programmer has direct access to the data. No intermediate routines need to be called to work with the data, so performance is not sacrificed. One of CHGEN's most useful aspects is that of "access macros". Access macros are generated by CHGEN, enumerating all unambiguous paths upward from child to parent tables. This provides a form of attribute inherit- ance. When working with a lower level (child) table, if the code needs to obtain an attribute from a parent table (even if it is not a direct ancestor), it simply places the access macro in front of the attribute in the code. The macro expands to a chain of parent pointer references whose length depends on the path length up to that parent. This will fail if any intermediate parents are not loaded in memory (specifically, the program will crash after trying to traverse a null pointer). The advantage to using access macros is that they provide the programmer with a limited form of schema independence. If more tables develop along the path up to the parent, re-running CHGEN will produce new access macros, and the application code will not have to be changed. Disadvantages of CHGEN - Unfortunately, each of CHGEN's advantages can also be shown to be a disadvantage. Because schema specific code is generated, the size of the generated code grows rapidly as the size of the schema grows (either number of tables or number of relationships between tables). This expansion seriously limits how much logic can go into the generated code. For example, suppose the repeating piece of code needed to add a row to a table is 10 lines long. For 4 tables, it expands to 40 lines. Now suppose the underlying logic expands to 50 lines (to support, say, storing the records in pkey sorted order automatically). Suddenly the generated code expands to 200 lines! This problem is considered by the authors as the most severe, and inspired the creation of GENDB. Allowing programmers direct access to the data is also a disadvantage. There is no way for the system to enforce many things, such as indices, access modes (readonly, nullable) and so on. In fact, a programmer could simply change primary keys of records behind the systems back. The overall effect is that developing and debugging applications becomes extremely difficult. Even if logic to enforce certain conditions could be developed, it would again lead to the growing code size problem discussed earlier. The final disadvantage relates to the access macros. First, chgen only provides macros for unambiguous paths upward in the schema. Even if there are multiple paths that lead to the same goal, CHGEN could not resolve this. The end result is that few macros are generated and they are usually short. The authors also argue that the form of schema independence supported by the macros is not usually very useful. Specifically, it only helps when tables along the access path are added or removed. However, adding a table usually means changes for the application anyway, or, more likely, makes the path ambiguous. Similarly, removing a table may actually remove the destination of the path in the first place, forcing the application to change. The point we wish to make is that any change to the schema requires programmers to, at a minimum, review the applications involved. Overview of GENDB - GENDB represents an attempt to develop a completely new implementation of the functionality orignally provided by CHGEN. C++ was chosen as the development language. No attempt was made to remain syntactically compat- able with CHGEN applications, since the language changed, and object oriented features were to be used. It was important however, to support the same fundamental features that CHGEN supported. Goals of the system were to overcome as many of the drawbacks of CHGEN as possible, while providing acceptable performance. C++ objects are used to represent certain objects in GENDB. For example, the "database" object is the highest level object, and it holds the schema definition. The objects constructor takes in a schema filename, and creates an internal schema datamodel. Similarly, it provides methods to load and dump the database to and from files. Within the database object are several "table" objects, one for each table in the schema. The object used most often by programmers is the "iterator". An iterator is the programmer's only means of accessing the data in the database. One is declared for each table, although several can be decalred for the same table when performing complex operations. Iterators will be discussed in great detail during the remainder of this paper. The key difference betwen GENDB and CHGEN is that GENDB is fully dynamic: there is no schema-specific pre-compiled code. Instead, a run-time model of the schema is developed, and generic logic is used to manipulate data described by that model. The programmer never has direct access to the data, in the spirit of data-hiding and abstract data types. In this way, intermediate logic can be placed between the user and the data, providing a means to support access control, indices, etc... Because there is no schema specific code, the size of the system routines is static, and does not change with the complexity of the schema. Other previously impossible ideas are now possible, such as loading multiple databases at once. As always, there is a price to pay, performance. GENDB does not perform (speed-wise) nearly as well as CHGEN. For this reason, performance optimization is always a key issue during GENDB development. It remains to be seen if the added power, simplicity and flexability of GENDB outweighs its loss of performance over CHGEN. GENDB Application Programming ----------------------------- Schema File Format - The first step in application development is to create the Schema file. The format of schema files is slightly different from that of CHGEN. >I would say completely diffent! and that is too bad. There should be NO >differences; each should parse the same .sch file, ignore irrelevant fields. Using the example shown earlier in this paper, consider the following schema file (stored in, say, students.sch) : create table students AA { first_name c(30) last_name c(30) } create table roster BB { member parent(students,classes,1,100) course_name c(30) }; >You left off the explanatory dictionary comment again! NO GOOD! >The 2nd arg of parent(.) is not explained or defined below. >What justifies the omission of other cardinality meta-attribs of fkeys? The words "create table" are required, they indicate the beginning of a >The word 'create' is wasteful and misleading. It should be eliminated. table's definition. Next is the name of the table, followed by its two character abbreviation. As mentioned earlier, this abbreviation is >ttabbrev is CURRENTLY 2 chars; it should be #defined so it can be changed. >If internal keys are integers, external keys can be 3 chars. >Version number is still missing! >What are your plans for versions and views? extremely important. The fields that make up the table are now listed between the curly brace pair. First is listed the field's name; next is its type. In the example, the first_name field of the students table is defined to be of type c(30). This is simply a character string of any length from 0 (null string) to 30 characters. This is a true 30 characters, the system allocates space for the terminating NULL character. Other types include "i" (integer) and "f" (floating point). User defined types >Are different sizes avail? e.g. i2, i4, f4, d8? can also be created. It is important to mention at this point that the programmer does not explicitly assign a primary key (pkey) field for each table. The system creates one automatically for each table, called pkey. It can only be accessed through access routines, and cannot be manipulated as a data field. >Is it visible after pr_dump? Is it preserved thru pr_dump anmd pr_load? The roster table demonstrates how relationships are described in the schema. The field member is a relation field, not a data field. It points to a parent table students (the first parameter within the parenthesis). Up in the students table, this relationship can be accessed as a child, under the name classes. In fact, a relationship field is actually added to the students table under that name automatically. The names given to the relationship fields do not have to be different from the table names. It is wise, however, to choose names that describe the relationship itself. The two numbers shown as the third and fourth parameters represent the cardinality of the relationship. The first number is the cardinality at the >No - it is the minimum cardinality: the max card is always 1. parent side. It must be the value 0 or 1. A 0 implies that children can still exist without the parent. The second number represents the card- inality at the child side. It can be any non-negative number. A 0 implies that children are not mandatory. A value of 100 means that up to 100 >But how are BOTH min and max childcounts defined simultaneously?????? >I believe you need TWO child-end card fields >And how is occupancy definable as sparse (list) or dense (array)??? >Two dense parents (a direct product relation) allow a matrix rep. children can exist for each parent. Please note however that cardinality is not enforced in the current implementation of GENDB (ie, the values in the schema are parsed for correctness, but then ignored). Data File Format - Data file format is the same under GENDB as it was for CHGEN. >Can it support the meta-schema? i.e. old tables TT and TA as data? Each line of a datafile represents one record in a table in the schema. Blank lines and other extraneous whitespace is ignored. Whitespace is the separator of fields within a record (thus, fields cannot hold whitespace within them, except for the last field in a record). The fields expected to be found in the data file correspond exactly to those specified in the schema file, except that the record's primary key is always listed first. When a parent relation is specified in a table definition, the parent's primary key must be specified in the data file. >Unclear! add: as a fkey in each child table row The rules for primary keys (and hence, foreign keys) are the same as in CHGEN, although versions are not currently supported. It is suggested that "01" is used as the version (third and forth position in pkeys) for >Why? are views and version numbers difficult extensions? >(bytes 3-4 may become bytes 4-6 in a 12-byte external key format.) all data. The example datafile given earlier in this paper is repeated here for convenience (as version one of the dataset) : AA010001 John Doe AA010002 Sue Smith BB010001 AA030001 Calculus BB010002 AA030001 Physics BB010003 AA030002 Calculus Defining the Schema and Loading / Dumping the Data - To define the schema in your program, you declare a C++ variable, an instance of the database object type. For example, in the program's global variable section, the following statement is used: database db("students","students.sch"); The first parameter is simply a name for the database, and doesn't really affect anything. The second parameter is the name of the file that describes the schema. When this variable is declared, the system will read this file and build a data model for the database in memory. The resulting database object (db in our example) is the programmer's "hook" >YOu have a bad habit of omitting the apostrophe so every possessive >adjective becomes a plural noun! into the data model, and will be passed around in subsequent calls and >What data model? the metamodel is not addressable, is it? declarations. To load the datafile into memory, the Load method of the database object is used. This is an executable statement so to speak, so it appears in the body of the program, not in a declaration section. In our example, this might be: db.Load("students.dat"); >Your OOP naming conventions (UC/lc use) should be defined up top This statement will open the specified file and read record values from it into memory, filling the tables of the schema. You are not required to store all of the data in one file: multiple calls to Load are allowed. Duplicate pkeys, however, will generate an error. >What error: where documented? What about empty .dat files? undeclared tables? >There is no mention of gathering statistics. How does pr_gen_pkey know? >if you delegate new key generation to c++, how will version numbers be added? After the program has changed the data, the programmer may wish to dump the data back out to a file (saving the changes made). The Dump method is used for this: db.Dump("students.out"); This will write out all of the records in all of the tables to the specified file. A second (optional) parameter specifies the file mode. This is the same as the mode specifier for the "C" language fopen() call. For example, "a" means to append at the end of an existing file. >What happens if the input file is to be rewritten? a warning? >Is a mode param of the .Load method appropriate? It would serve as a lock >(or unlock) mechanism to enable/disable write-protection on the input file. Iterators - The "iterator" is the single most important object type in the GENDB system. An iterator is a programmer's "hook" into tables and individual records in tables. They are declared in a program's declaration section (local iterators within functions are allowed and very useful as well), usually where the database is declared. For example, these lines might follow the database declaration referenced previously in the example: iterator AA(&db,"students"); iterator BB(&db,"roster"); The first line declares an iterator called AA for the students table, in the previously defined database pointed to by db. The name of the iterator can be any valid variable name, but we prefer to use the table's abbreviation. This is strictly out of habit from previous CHGEN programming. The table's name could also be used. The purpose of an iterator is to be the single point of access to data within a table. It provides methods to append and delete records, read and write the values of fields, and to traverse along relationship chains. An iterator is tied to one, and only one table. In addition, it maintains a "context" within the table. This can be thought of as an index in an array: it marks a certain point in the array, and implies a current record. When you declare the iterator, its context is reset to the >the possessive "its" should NOT have an apostrophe; "it's" means "it is" beginning of the table it is connected to (ie, the first record in primary key sorted order). After a reset, the current record is the first record. Calling the Next method advances the context to the next record. The Done method checks to see if the context points past the last record in the table. The following complete program demonstrates counting the number of records in the students table: main() { database db("students","students.dat"); iterator AA(&db,"students"); int count; db.Load("students.dat"); count = 0; AA.Reset(); while (!AA.Done()) { count++; AA.Next(); }; printf("The number of rows in the students table is %d\n",count); }; Looping Macros - Because traversing a table in this fashion is very common, a convenience macro is provided. This macro simply handles the looping. In this example, the six lines after count=0; (the Reset line down to the closing brace of the while loop) can be replaced with: table_loop(AA) count++; Note that if multiple statements needed to be run within the loop, curly braces could be used. They would be redundant however in this example. All of the logic to support the loop is within the macro. To remove any mystery from the macro, here is its definition: #define table_loop(x) for (x.Reset(); !x.Done(); x.Next()) As you can see, it is simply a for loop without a body. It is important to understand that it is in fact a for loop, so the break and continue statements apply. This use of the for loop in a macro is an incredibly powerful programming technique! Reading the Data in Records - As mentioned previously, an iterator identifies a record as being the current record for that iterator. While a record is current, many other iterator methods can be applied to that record. For example, the Delete method deletes the current record. It is at this point where the user can access the values of the individual fields in the record. An "access" method is provided for accessing each possible field type. For example, to print out the last name of each student in the above program, the loop might be: table_loop(AA) { count++; printf("Student last name is %s\n",AA.StrVal("last_name")); }; In this example, we wish to get the value of the last_name field of the current student record being traversed. Since the field is a string type field, the StrVal method is used. The method itself takes in the name of the field being accessed (as a quoted string, since it is not technically a variable in the program). The method returns a pointer to the string value itself. Although not enforced, it is recommended that you do not change the data pointed at directly. Other methods are provided to write into field values. If you do not use the methods, the heap storage (memory) in your computer will be corrupted, and your program will probably crash. Other field access methods are IntVal for integer fields, FloatVal for floating point fields, and UserVal for user defined data type fields. Accessing the values of non-data fields (relationship fields) will be discussed later. Writing Data into Records - Just as for reading fields, methods are provided to write into the values of fields. In this case, however, only one method is needed, called Set. For example, the following loop code will traverse the student table as before. This time, however, the goal is to change Sue Smith's first name to Susan: table_loop(AA) { count++; if ((strcmp(AA.StrVal("last_name"),"Smith") == 0) && (strcmp(AA.StrVal("first_name"),"Sue") == 0)) AA.Set("first_name","Susan"); }; It is important that you understand this example at this point. If you do not, please take the time to review the examples. Technical Note: You may be wondering why there are four methods to read values of fields, but only one is needed to write values. The answer lies in the limits of function overloading in C++. There are in fact four versions of the Set method, each distinguished from the other by the type of its parameters. For reading however, if four versions of a Value method were to be created, C++ would only be able to distinguish them by how the return value of the function was used (since they would all only take in one parameter of the same type). While this is technically possible, it is not recommended, since several (very typical) cases would appear ambiguous to the compiler. For example, if there was Value(fld_name) method, consider the following: printf("the value is %s\n",AA.Value("last_name")); Even though the programmer or anyone else can tell that the "string" version of the method should be called, the compiler cannot (it doesn't correlate the method call with the %s specifier). To resolve it, the programmer would have to manually type-cast the call. >do you mean type-cast the value returned by the AA.Value call?Please say so. Accessing Relationship Fields - The previous sections discussed how to access the values of data fields. In the example schema however, the roster table has a parent relation up to the student table. Several methods are provided for dealing with relation fields. The most important relation concept is the child loop. As mentioned earlier, an iterator is connected to a table, and maintains a current record concept. Actually, an iterator has another purpose: to traverse a chain of children under a parent record. When the Reset method is called, optional parameters can be used to cause the iterator to traverse children instead of the entire table. For example: BB.Reset(&AA,"member"); This statement sets the current record of the BB iterator to point to the first child under the current AA record, using the relation field member (got that!). The BB and AA iterators are now connected to some extent. When the Next method is called, it moves the BB current record pointer to the next child under AA. Similarly, the Done method checks to see if the last child has been exceeded (ie, beyond the end of the chain). Again, a convenience macro is provided, called child_loop, to simplify the programming and to minimize the chance of error. The following example, which prints the courses each student is taking, further demonstrates this: table_loop(AA) { printf("Student %s %s :\n",AA.StrVal("first_name"), AA.StrVal("last_name")); child_loop(AA,BB,"member") printf(" %s\n",BB.StrVal("course_name")); }; Again, to remove any mystery, the child_loop macro expands to: #define child_loop(x,y,fld) for (y.Reset(&x,fld); !y.Done(); y.Next()) In some cases it is necessary to obtain the primary-key value of the parent. The ParentVal method is provided for this. It is syntactically similar to the StrVal method. In the previous example, another print statement in the child_loop might be: printf("parent students pkey is: %s\n",BB.ParentVal("member")); Notice however that this would print the parent student's primary-key multiple times, once for each course. A better approach would be to access the AA current record's pkey value outside the child_loop. This introduces another method, called PkeyVal. This method also returns a string, but takes no parameters. The outer printf statement in the example could be modified to be: printf("Student %s %s (pkey : %s) :\n",AA.StrVal("first_name"), AA.StrVal("last_name"), AA.PkeyVal()); Deleting Records - The Delete method is used to delete the current record in an iterator. When it is used, the current pointer is backed-up to the previous record so that, when Next is called, it will point to the record just after the deleted one. In the case where the first record is deleted, current is >WHY? why not set current to the parent, so its first-child ptr is returned >by .Next? set to the next record, and the Next method is smart enough to realize this and does not actually advance the pointer imediately after deleting the first record. The important point is that all these details are handled automatically inside the methods. As an example of using Delete, suppose the course Calculus has been cancelled. The program must delete all roster entries that refer to it. The following example does this: table_loop(BB) if strcmp(BB.StrVal("course_name"),"Calculus) == 0) { printf("Please notify student %s\n",BB.ParentVal("member")); BB.Delete(); }; Inheritance - The example above for Delete actually points out the need for inheritance. It would be much more clear which student needed to be notified if the student's name could be printed instead of his/her primary key! GENDB transparently supports attribute inheritance by simply referring to the attribute. For example, consider this rewritten version of the preceeding examples printf statement: printf("Please notify student %s %s\n",BB.StrVal("first_name"), BB.StrVal("last_name")); When the system attempts to find a field called first_name in the roster table (remember, BB is not a table), it is not found. Instead of producing a "field not found" error, it embarks on a search for the field upwards in the schema lattice. Specifically, it performs a breadth-first search backwards, up the lattice. In the example, it >It is only a tree if the backward tree is a pure chain. checks the students table, finds the field, and returns its value. Notice that this is all done automatically. In later sections it will be shown how to use paths to specify exactly where to look in the inheritance, improving performance dramatically. In this system, there is no such concept as an ambiguous inheritance: >BUT: schema diagrams should allow the programmer to predict which name >will be found first! Otherwise, chaos???? gendb should give an info msg.. the first field found with the specified name completes the search. Paths can also be used to force the resolution of an ambiguous inheritance to be what the programmer really wants, not simply the luck of the search! >Luck and how! that's why its relationship to the schema should be UNambiguous. NOTE: Because of inheritance, naming conventions become important. You should try to give unique names to attributes of different tables. For example, if your schema has 50 tables, each with a field called description, it becomes impossible to inherit any description field: the internal field search will always find the description field of the current record. Paths - A path is a mechanism to explicitly identify how inheritance should be performed. Specifically, it is a list of parent field names that should be followed from a starting point, upwards through the schema, to any finishing level. This is because there is no actual finishing point of a path in GENDB: inheritance will stop at any level once a match is found. It simply specifies what relations should be followed while trying to find a match. Paths are created inside the schema file, after any tables referenced by the path. The formal syntax is : create path pathname from tblname via tblname [tblname...] >Again, 'create' is superfluous. In our example schema, suppose another relation field was added to the roster table, called partner. In our hypothetical school system, many >THis is unclear- partner relates students in a M:N relation, because >one may have a new partner for each course. >But what about the cardinality? courses require each student to pick a partner for labs and so on. The table is: create table roster BB { member parent(students,classes,1,100) partner parent(students,partner,1,100) course_name c(30) }; >Again, what are classes and partner? why is partner used twice? With this new table, a path would have to be used if an application needed to print the last names of each of the "teams" in Physics. Default inheritance would find the name of the member student (since it is the first relation field listed, it would be found first by the search). >Aha! here is where the schema diagram correspondence comes alive. Consider the following two paths added to the schema file: create path member_student from roster via member create path partner_student from roster via partner These new paths can be used to implement the example as follows: table_loop(BB) if (strcmp(BB.StrVal("course_name"),"Physics") == 0) printf("Physics team is %s %s with %s %s\n", BB.StrVal("first_name","member_student"), BB.StrVal("last_name","member_student"), BB.StrVal("first_name","partner_student"), BB.StrVal("last_name","partner_student")); Notice that the field access methods take an optional second parameter, path-name. Also, the first two StrVal calls (that use the member_student) path did not technically need to specify the path, since default inherit- ance would have given the correct result. This is not good practice however: applications should never depend on the order of field definitions within tables. >I agree - as long as defaults are obscure (invisible) or arbitrary ones. >View-specific defaults are better: hypothesis: downward iteration >followed by upward (inherited) access is a transaction-specific >paradigm, and each transaction group with similar tables and access paths >can have its own view. So declare a direction of each fkey in a viewdef. >Then the schema diagram can add arrowheads or tails at child ends >of links to draw a view-specific schema. Appending New Records - Several methods must be called to append a new record to a table. First, the Append method will append an empty record (except for its primary-key) to a table. The primary-key to be used can be passed in as a parameter. If not specified, the next sequential primary-key to be assigned to that table will be used automatically. Append also makes the new record the current record. The following example appends a new roster record, and demonstrates a new method for generating primary-keys: >Why not make keys 32-bit integers? proj3/92s523/gen/base/doc/* suggests a way, >specifies i-convert and o-convert routines and adds pr_*.c calls to >them which worked on a hand-programmed test case. char temp[9]; BB.GenNextPkey(&temp); BB.Append(temp); Actually, the program did not actually have to do this. If append was called without a value, the system would have called GenNextPkey itself. Now that the record has been appended, values for its fields should be assigned. This is done with the Set method discussed earlier. To set the relation fields however requires a new method: SetParent. SetParent takes in the parent relation field name, and the primary-key of the record >Still not clear whether a 1:M relation has two names, which differ >between child and parent? That's good for naming up and down >directions of the same relation independently (and more sensibly). (up at the parent) that is being referred to. The parent does not have to be loaded in memory at this point (although it usually is in most applications). Here is an example: >If not loaded, its existence cannot be verified.????Warning msg? BB.SetParent("member",AA.PkeyVal()); In this example, it is clear why the PkeyVal method is needed. This specifies the "current" student as being the parent record of this new roster entry. The following complete example signs up Sue Smith for Physics, without a partner for now (using the revised schema): main() { database db("students","students.sch"); iterator AA("students"); iterator BB("roster"); table_loop(AA) if ((AA.StrVal("first_name"),"Smith") == 0) && (AA.StrVal("last_name"),"Sue") == 0)) break; if (AA.Done) { printf("Sue Smith not found\n"); exit(0); }; BB.Append(); BB.SetParent("member",AA.PkeyVal()); BB.Set("course_name","Physics"); printf("Sue Smith has been signed up successfully.\n"); }; Miscellaneous Methods - This section discusses some of the other methods that have not been mentioned above. There is a Print method at the iterator level that simply prints out the contents of the record, exactly as it would appear in a data file (in fact, this is how the Dump method for a database works). It prints to the screen, unless the optional file variable parameter is specified. This must be a "C" file variable, already opened and positioned for writing. This same method is provided at the table level, and at the database level. Table objects have not yet been discussed since they are rarely used directly, but they do exist and will be discussed later. The AddRow method of an iterator can be used to append and fill-in a new record to a table. It takes in a single string as a parameter, formatted exactly as a line of a datafile would be (the database Load method obviously uses this method too). Using AddRow is not recommended since it promotes the "hard-coding" of the format of the string. If a new field is added to the table, the specified string would no longer be valid. On the other hand, using Append, SetParent and Set would fill in all fields known to the programmer at the time, and any new fields would safely remain blank. >THis implies std default values - please add a section specifying them. The TopDownOnly method for a database tells the system what behavior is allowed on the database as a whole. Specifically, must all references be strictly top-down. In other words, parents must be loaded into memory first, and then children. if a child is ever added specifying a parent that does not exist, the call to SetParent would fail. In addition, the Load would fail too, since it is simply a collection of calls to Append, SetParent, and Set. Finally, the Help method of a database prints out to the screen a very cryptic description of the database and all its components. This is only provided as a development tool, since its output is not very well organized. >Do you mean the schema declarations here or row instances? >The schema SHOULD have a comment field and should be displayable. >Metatables should be instantiated for runtime computations >e.g. integrity checking for valid field value (membership in a domain). Future Enhancements - This section discusses some of the future enhancements and work that could be done to the system. The most beneficial enhancement that could be made would be the addition of indices to tables/fields. A new "create" statement syntax could be >YOU might explain why - i.e. what is current search strategy? invented for the schema. For example: create index on table students field last_name >This looks anomalous - not like other schema file appendages.??? Some initial work on indices already exists in the code, but is disabled. Another enhancement would help to minimize the amount of space used to hold records at run-time. Currently, static, fixed size arrays are used to hold various aspects of each record. A better but more complex approach would size the array according to the data that will be stored there. This should not be implemented with a dynamic array class (for example), due to their poor performance. Instead, the schema should be analyzed in a "first-pass" of the schema file (to determine how many relationships will finally exist for each table). In a second pass, the arrays for each case can be sized appropriately. The area of user-defined types could be improved significantly by a developer more fluent in C++, perhaps using parameterized classes (templates) introduced in version 2.1 of the C++ language proper. A sound method to support versioning of data-sets must be designed. This problem exists for both CHGEN and GENDB. Only then can a good implementation be developed. >What's wrong wiith CHGEN's approach? >Some of your previous criticisms of CHGEN may also apply to GENDB, >and are therefore misleading. Benchmarks could be developed that compare CHGEN and GENDB. Sample applications could be specified and then implemented in both systems. The benchmarks would have to be fair, and push the limits of performance AND functionality in both systems for a proper comparision. One advantage to this enhancement is that is doesn't actually change either of the systems, and may make a nice project for a student (or team). Finally, production quality DBMS features such as concurrency, locking, transactions, and recovery management could be added (although this is clearly not a simple task). >WHy? do you envision shared access to the same version of one view of a db? >I suppose high transaction rates require this - such as an OS resource >manager - but probably not a CASE toolset. Conclusions - GENDB provides a set of mechanisms to manipulate related sets of data in C++, and is an experimental successor to the CHGEN system currently being used at the University of Massachusetts - Lowell. The new system sacrifices performance for simplicity and flexibility. It remains to be seen if this trade-off leads to acceptable results. Messages to the Smiths re: gendb.doc: #1 10-MAR-1992 01:21:32.73 GENDB From: CS::"eagle.ulowell.edu!lechner" To: @92s523.dis CC: CS::LECHNER Subj: CHGEN in C++ Steve Smith will be here Wed. around 530PM for a status report on their C++ version of CHGEN. There is only C++ source code and handwritten notes right now. It runs on Zortech C++ currrently. ANyone wanting to sit in can do so. However there wont be much to look at- one copy of source text. He is a good explainer however. HRachapa, AFrederics, and the GEN tem members are particularly invited. From our phone talk tonight on GENPP(?): Their new version is fully dynamic - less efficient to gain flexibility. GENPP is linked to the application and reads the schema at runtime. It builds data structures that define the schema as states of very general classes, which build lists of tuples of each table type, and lists of field names/types which access methods depend on. Fields of tables are concatenated in a tuple, but accessed separately through their own field class definitions. This class structure supports atribute inheritance from ancestors of is-part-of relations, on top of C++ classes but not using their compiled is-a inheritance from superclasses. My initial reaction is that a happy medium between compiled C and fully dynamic C++ has not yet been reached. However they and others at GE/Lynn feel that if CHGEN generated compilable/linkable C++ code without runtime dynamic binding, that it would not be able to support the polymorphism which current CHGEN macros and ttabbr-dependent switched procedures provides. This questionneeds further study. Meanwhile we will have a much more sophisticated product to work with next year. -------------------------------------------------------------------- #2 15-MAY-1992 04:53:05.99 GENDB From: EAGLE::LECHNER To: CSMITH,SSMITH CC: LECHNER Subj: gendb.doc and test.cc I went over gendb.doc and corrected many spelling errors while making comments (lines starting with >....) on your design. Generally I picked on differences between schema files (CHGEN vs GENDB) and suggestions for clarification. ALso possible extensions. C++ simplifies many things, but lack of reveng tools increase documentation importance. e.g. a call tree? perhaps your PC tools can generate such doc aids for inclusion in gendb.doc? ------------------------------------ NOTE ADDED 9/25: Detailed specs have been produced by HRachapalli and A Frederick for gendbd.cc in Spring 1992. The documentatin spans two files in MSWord and MACWrite. ------------------------------------ I didnt read test.cc but I wonder about its length? It seems very wordy when the goal of CHGEN was to simplify programming. Again,.dsocumentation of driver structure and functionality would help. You did a great job as usual. But future students will have a hell of a problem comprehending gendb to maintain/enhance it, as usual. This semester's GEN team did some useful things, but not very extensive due to their learning curve on chgen.c code.