saturn.cs.uml.edu(291)> cat  NamespaceAndSchemaIntegration050915.txt

From lechner@cs.uml.edu  Thu Sep 15 02:35:29 2005

Subject: NamespaceAndSchemaIntegration (.msdat with SV, TT, TA and app tables)

To: jtan@cs.uml.edu, jingyoko@hotmail.com

Cc: 05f523

 

Notes on integration of metaschema- and schema-derived

metadata in chgen/gencpp applications:

RJLRef: $PH/COOL-GEN/NamespaceAndSchemaIntegration.htm

 

[This began as a  response to comments by Jing Tan in 05f523.

 It ended up articulating a proposal for adding namespace ids

as prefixes to schema ttabbrevs (table-type mnemonics) in the

12-byte-to-32-bit pfkey compression mode supported since genv8

but never used.]

 

 

Metaschema is tables SV, TT and TA + tables proposed in

$PH/COOL-GEN/hcg_struct_migrationR1.ppt (discussed later).

Schema is any other tables needed by the application.

These may be spread over system service and application

domains (e.g. JPsim has tables from BDE and LCP as well

as passive layout and active class subschema within its

process-control application).

 

.sch and .msdat files are information-equivalent,

therefore one is redundant.  The content of .dat files

and .msdat files both include flat files wih the same

format (field name and type sequence) depending on table

row (object instance) type or class.

 

Since .msdat and .dat tables have identical meta-data

formats, .msdat should be used instead of .sch.

Entity types are assigned int codes as surrogates

based on their declared order in the schema (or in the TT table).

The [meta-data] tables (SV,TT,TA) are read-only, and should

be first so they have the same position over all applications.

 

So we can append .msdat for application tables AFTER

.msdat tables SV,TT,TA. Both levels of specification

will have the same format.

 

ALL tables are parsed by chgen in the same way:

------------------------

Read the pkey field to identify the table type of each row.

IFF an internal object with the same pkey exists,

skip this record. This prevents TT-rows 1 2 3 and their

TA-children from being over-writen or duplicated

if chgen already has pre-loaded copies of these tables

to refer to (as a boot-strapped version will).

 

The parser allocates memory for an instance (row) of that

table-type then repeats fscanf for the intermediate fields

in that table-type's TA-child sequence: When the last TA_child

is reached, do the same thing unless it is of type tnn

(text string of length nn). In that case, read up to EOLine

into the last field's buffer (All tnn or cnn fields are

truncated if necessary to avoid buffer overflow).

 

//pr_create(//field of type specified in ttabbrev byte of pkey

child_loop(TT, TA, TAid, TTid){

        if (first_child(TT, TA) //Syntax Not Checkedxa - RJL)

                fscanf(fp, *keybuf,"%s"); //get pkey;

        else if (! last_child(TT, TA, TAid, TTid))

                //get middle-children (fkeys first, if any)

                fscanf(fp,

                        stringfor(TAcurr->fieldname),

                        formatfor(TAcurr->fieldtype));

        else //last child - needed for field-type tnn

                getchars_untilEOL(fp, *lastfieldbuf);

}

 

 

Note 1: formatfor() is a compile-time or run-time lookup

of a format string equivalent to the TA field type.

stringfor() needs the stringized value of the TA field name.

 

(This is a run-time-data-to-compile-time identifier translation,

traditionally done by a switch(data){...case ident: ... } .

You win a prize if you can avoid some form of switch

or indexed table lookup here :-).

 

Note 2: C++ Stream I/O could absorb all middle fields

by <<field2 << field3 << field4, etc. if it detects whitespace

field separators. However, this list of typed field names

is only known AFTER the subclass is detected

whose method overrides the abstract one;

This class is unknown until the first pkey field1 is read.

 

------------------

Name-space extensions (chgen14?):

 

Real apps will have both system service domains as well as

application domains, with separate schemas. chgen13 assumes they

are non-conflicting disjoint mnemonic 2- or 4-letter abbrevs;

then they can be concatenated (always in same order).

 

If they overlap or order is not predictable or if you just want

to meet common-sense scale-up requirements, then a namespace abbrev

must be assigned to each domain and be prepended to each table type

abbrev; chgen CAN support this by its option ??? to use alternate

4-letter table-type mnemonic, as part of a 12-byte pfkey format

(256 subschemas, each with 256 tables). The remaining 8 bytes

include 3 version digits [0..999] and 5 row digits[0..99999]).

 

[chgen's current 32-bit unsigned int hcg_key typedef

for the binary encoding of pfkeys (since chgenv7) constrains

these fields to 8+8+16 bits, enough to hold only 256 not 26*26 = 676

namespace and table type codes; 64-bit keys would scale up:

12-byte ASCII keycodes are limited to 676*676*100,000,000

instances if 2 letters define (26*26) namespaces and table types.

 

Instead of 676 domains, I would prefer 26 domains (A..Z],

each with 10 [schema] versions [0..9]: this constrains

the first 4 type-selecting bytes to 260 domain*version

codes, each with 26*26=676 table types.]

 

Domains could have up to 10 schema versions; first 2 bytes

defines schema and version (stored in a row of table SV).

Next two bytes defines table type (stored as a row of table TT).

 

 

---------------------------------------------------------

 

To appreciate the compactness of this representation, go to

the .ppt show at $PH/DataModels05fr1.ppt

[In an earlier email I suggested you all need to be familiar

with it this month.]  First, read slides 51; Reflective databases

and slide 52: Metatables TT and TA, of $PH/DataModels05fr1.ppt.)

 

Slides 55-58 illustrate how schema tables are drawn in bde,

converted to .sch form by b2t|t2s, and augmented by chgen -metafile

with next-row, parent, first-child and next-sibling pointers,

before generating schema.h and pr_*.c and schema.msdat.

 

Slide 59: MetaSchema Tables TT and TA defines the .msdat file

CONTENT of tables TT and TA for the sample application schema

(tables SU, WH, IT on slides 55-59) as produced by chgen -metafile.

 

Slide 60: Meta-tables TT, TA are Self-Describing defines the

content of tables TT and TA when they are DESCRIBING THEMSELVES!).

 

What I meant by putting meta-schema tables first is to make chgen

concatenate the content of slide 59 AFTER the content of slide 60.

(SV was later defined as the FIRST table and parent of TT

so it should PRECEDE what's in slide 60.) This can be done

by concatenating meta-schema.sch and application.sch and

feeding it to chgen.

 

BTWay, fkey  is_key values 1/-1 and s (not c - that was my typo)

They are explained in slide 41: Pkeys and fkeys, of

$PH/DataModels05fr1.ppt

 

==================================================

hcg_structure migration to .msdat.

 

$PH/COOL-GEN/TTTA_metadata_jk.ppt (2 slides) documents

the 'hcg-structures' currently used in chgen to

store schema.sch content(including views and versions).

 

From these, chgen11+ builds the .msdat file, but does not use it.

 

$PH/COOL-GEN/hcg_struct_migration.ppt (7 slides)

outlines a project (not done but TBD - any volunteers?)

to migrate chgen's own source code away from

hcg_structs and use the meta-schema content instead.

 

This set also shows other possible additions to

the runtime metadata  (View and Version info, table

statistics such as row count and pkey range,

and fkey traversal paths for inheritance.)

These can be saved and reloaded to avoid calling

pr_init to find out the same information.

(chgenv14 should make pr_init obsolete.)

 

 

Slide 18: Current Work-arounds, of $PH/COOL-FAQ/COOL_FAQv6.PPT

shows chgen(v14?) as two phases: GENmeta and GENcode

(TBD: partition genv13 this way).

 

The (little) chgen phase1 (GENmeta) would parse the .sch file,

load metadata tables [SV,] TT and TA, and write them to the  .msdat file.

 

 

The (big) chgen phase 2 (GENcode) would reload

the .msdat file and work from it instead of hcg-structures.

 

(chgen since v11 has been using pr_*.c code internally:

i.e. it is a (bootstrapped) application of itself.)

 

(grep hcg_ in chgen/src to see how complex

the hcg_struct references are. That may motivate

you to want this refactoring too ;-):

 

PS: You don't want ALL 1299 hcg_ refs - (most are trivial):

mercury.cs.uml.edu(44)> cd $CASE/gen/ver_13/chgen/src

mercury.cs.uml.edu(45)> grep hcg_ *.c | wc

   1299    8403  107160

 

But search for the 16 refs to items below (from TTTA_metadata_jk.ppt),

and only in the pr_*.c files which process tables TT and TA in chgen/src:

mercury.cs.uml.edu(70)> grep '(hcg_ts_list|hcg_table_seqlist|ts_list|ts_type)' pr_*.c | wc

     16      68    1553

----------------------------------