Every component that can be annotated.
Every component that can be annotated. Review Note: It's no longer clear that this separation is strictly speaking needed. It's possible that this could be collapsed back into AnnotatedSchemaComponent or made smaller anyway.
Shared characteristics of any annotated schema component.
Shared characteristics of any annotated schema component.
Not all components can carry DFDL annotations.
Convenience class for implemening AnnotatedSchemaComponent trait
flat chain of format annotations connected by dfdl:ref (short form) or ref (long form) references to named defined formats.
flat chain of format annotations connected by dfdl:ref (short form) or ref (long form) references to named defined formats.
Could be the nonDefault formats being chained together, or could be the default formats being chained together.
ChoOrd = Choice or Ordered
ChoOrd = Choice or Ordered
This is a tree representing elements as they appear in ordered sequences or as alternatives to each other.
This does not directly correspond to the xs:choice and xs:sequence constructs of a DFDL Schema, though that can be the origin.
The purpose of these ChoOrd trees is to establish some robust invariants about the way these are nested.
These invariants include: * Elements are at the leaves. * There is no unnecessary nesting. So an Ord only contains leaves or Cho, and a Cho only contains leaves or Ord. * There are no degenerate Cho or Ord with only 1 thing in them. Those get created but squeezed out.
Captures concepts common to all choice definitions, which includes both local and global choice definitions, but not choice group references.
Captures concepts common to all choice definitions, which includes both local and global choice definitions, but not choice group references.
Choices are a bit complicated.
They can have initiators and terminators. This is most easily thought of as wrapping each choice inside a sequence having only the choice inside it, and moving the initiator and terminator specification to that sequence.
That sequence is then the term replacing the choice wherever the choice is.
Choices can have children which are scalar elements. In this case, that scalar element behaves as if it were a child of the enclosing sequence (which could be the one we just injected above the choice.
Choices can have children which are recurring elements. In this case, the behavior is as if the recurring child was placed inside a sequence which has no initiator nor terminator, but repeats the separator specification from the sequence context that encloses the choice.
All that, and the complexities of separator suppression too.
There's also issues like this:
<choice> <element .../> <sequence/> </choice>
in the above, one alternative is an empty sequence. So this choice may produce an element which takes up a child position, and potentially requires separation, or it may produce nothing at all.
Base class for any DFDL annotation
Base class for any DFDL annotation
Note about SchemaComponent as a base class: Many things are now derived from SchemaComponent that were not before. Just turns out that there is a lot of desirable code sharing between things that aren't strictly-speaking SchemaComponents and things that previously were not. Accomplishing that sharing with mixins and self-typing etc. was just too troublesome. So now many things are schema components. E.g., all annotation objects, the Include and Import objects which represent those statements in a schema, the proxy DFDLSchemaFile object, etc.
Unlike say GlobalElementDecl, Defining annotations don't have a factory, because they don't have any characteristics that depend on context, i.e., that have to access the referring context to compute.
Base class for annotations that carry format properties
represents one schema document file
represents one schema document file
manages loading of it, and keeping track of validation errors
Base class for assertions, variable assignments, etc
Shared by all forms of elements, local or global or element reference.
Shared by all element declarations local or global
elementFormDefault is an attribute of the xs:schema element.
elementFormDefault is an attribute of the xs:schema element. It defaults to 'qualified'. That means nested local element definitions, their names are in the target namespace. So, if you have
<foo xmlns="myURI>">42
That is, you must explicitly go to the no-namespace syntax. It doesn't happen implicitly. This trait is mixed into things that are affected by elementFormDefault. Namely the local element declaration class.
<tns:foo><bar>42</bar></tns:foo>
In this case you really don't want to setup xmlns='myURI' because this happens:
<foo xmlns="myURI"><bar>42</bar></foo>
But if elementFormDefault='unqualified', the instance doc would be like:
<tns:foo><tns:bar>42</tns:bar></tns:foo>
or the possibly nicer (for a large result)
<schema elementFormDefault='qualified' targetNamespace="myURI" xmlns:tns="myURI"...> <element name='foo'...> <complexType> <sequence> <element name='bar'.../> ...
Now a DFDL/Xpath expression to reach that 'bar' element looks like /tns:foo/tns:bar Contrarywise, if elementFormDefault='unqualfied'...
... }}} Now a path to reach element bar would look like /tns:foo/bar. See how 'bar' isn't preceded by the tns prefix. That's becasue the child elements are all 'no namespace' elements. This also affects what a result document is like from namespaces perspective. Suppose the above 'bar' element is an xs:int. Then with elemenFormDefault='qualified', an instance would look like:
There are 3 first-class concrete children of ElementBase.
There are 3 first-class concrete children of ElementBase. Root, LocalElementDecl, and ElementRef
For unit testing purposes, the element argument might be supplied as null.
All global components share these characteristics.
All global components share these characteristics. The difference between this and the not-Global flavor has to do with the elementFormDefault attribute of the xs:schema element. Global things are always qualified
Factory to create an instance of a global element declaration either to be the root of the data, or when referenced from an element reference, in which case a backpointer from the global element decl instance will point back to the element reference.
Factory to create an instance of a global element declaration either to be the root of the data, or when referenced from an element reference, in which case a backpointer from the global element decl instance will point back to the element reference.
This backpointer is needed in order to determine some attributes that refer outward to what something is contained within. E.g., nearestEnclosingSequence is an attribute that might be the sequence containing the element reference that is referencing this global element declaration.
Global Group Defs carry annotations that are combined wiht those of the corresponding group reference that refers to them.
Global Group Defs carry annotations that are combined wiht those of the corresponding group reference that refers to them.
These are not carried on the xs:group element itself, but the xs:sequence or xs:choice XML child. When we refer to the annotations on a global group definition, we are referring to the annotations on the xs:sequence or xs:choice.
The instance type for global simple type definitions.
The factory is sharable even though the global object it creates cannot be shared.
The factory is sharable even though the global object it creates cannot be shared.
Call forElement(element) and supply the element referring to the global type, then you get back an instance that is one-to-one with the element.
This then allows attributes of the type to refer to the element in deciding things. I.e., the context is clear and kept separate for each place a global type is used.
Common concepts for components that define groups.
Common concepts for components that define groups.
This includes both global group definitions, and local sequence and choice groups.
A GroupRefFactory (group reference) is an indirection to factories to create a SequenceGroupRef, or ChoiceGroupRef.
A GroupRefFactory (group reference) is an indirection to factories to create a SequenceGroupRef, or ChoiceGroupRef.
The refXMLArg is the xml for the group reference.
This factory exists in order to make error messages refer to the right part of the schema.
Include/Import = "II" for short
An import statement.
An import statement.
The enclosingGoalNamespace argument is Some(noNamespace) for a topLevel schema file that has no targetNamespace attribute.
Now consider that we could be an import which is inside an included schema which includes another included, etc. A nest of included schemas the innermost of which then contains an import. We have to verify that the ultimate goal namespace at the start of that chain of includes is different from this imported schema's goalNamespace.
enclosingGoalNS is None if this include is being included (by one include hop, or several) into a schema having 'no namespace'
enclosingGoalNS is None if this include is being included (by one include hop, or several) into a schema having 'no namespace'
enclosingGoalNS is Some(str) if this include is being included (by one include hop, or several) into a schema having a targetNamespace.
A single annotation which combines short form, long form, and element form property bindings together.
A single annotation which combines short form, long form, and element form property bindings together.
From this perspective, there are no ref chains connecting format annotations together.
Common to local element decls and element references
Base class for all model groups, which are term containers.
Base class for all model groups, which are term containers.
There are ultimately 4 concrete classes that implement this: Sequence, Choice, SequenceGroupRef, and ChoiceGroupRef
Common Mixin for things that have a name attribute.
Mixin for all schema factories and schema components with no backpointers, just a lexical parent.
Mixin for all schema factories and schema components with no backpointers, just a lexical parent. This means all the non-global schema components.
Mixin for all global schema components
PrimType nodes are part of the runtime.
PrimType nodes are part of the runtime. For compilation, we need a notion of primitive type that derives from the same base a SimpleTypeBase and ComplexTypeBase, and it needs to have methods that take and return compiler-only object types; hence we can't define a base in the runtime because it can't have those methods; hence, can't achieve the polymorphism over all sorts of types.
So for the compiler, a PrimitiveType is just a wrapper around a PrimType object.
The other kind of DFDL annotations are DFDL 'statements'.
The other kind of DFDL annotations are DFDL 'statements'. This trait is everything shared by schema components that can carry statements.
Factory for creating the corresponding DFDLAnnotation objects.
Only objects from which we generate processors (parsers/unparsers) can lookup property values.
Only objects from which we generate processors (parsers/unparsers) can lookup property values.
This avoids the possibility of a property being resolved incorrectly by not looking at the complete chain of schema components contributing to the property resolution.
The only objects that should resolve properties are ElementRef, Root, LocalElementDecl, Sequence, Choice, SequenceRef, ChoiceRef
These are all the "real" terms. Everything else is just contributing properties to the mix, but they are not points where properties are used to generate processors.
A schema component for simple type restrictions
Root is a special kind of ElementRef that has no enclosing group.
Root is a special kind of ElementRef that has no enclosing group.
This is the entity that is compiled by the schema compiler.
A schema is all the schema documents sharing a single target namespace.
A schema is all the schema documents sharing a single target namespace.
That is, one can write several schema documents which all have the same target namespace, and in that case all those schema documents make up the 'schema'.
The core root class of the DFDL Schema object model.
The core root class of the DFDL Schema object model.
Every schema component has a schema document, and a schema, and a namespace.
Anything that can be computed without reference to the point of use or point of reference can be computed here on these factory objects.
Mixin for all SchemaComponents
Mixin for SchemaDocument
Handles only things specific to DFDL about schema documents.
Handles only things specific to DFDL about schema documents.
I.e., default format properties, named format properties, etc.
Common to both types we use for dealing with schema documents.
A schema set is exactly that, a set of schemas.
A schema set is exactly that, a set of schemas. Each schema has a target namespace (or 'no namespace'), so a schema set is conceptually a mapping from a namespace URI (or empty string, meaning no namespace) onto schema.
Constructing these from XML Nodes is a unit-test interface. The real constructor takes a sequence of file names, and you can optionally specify a root element via the rootSpec argument.
A schema set is a SchemaComponent (derived from that base), so as to inherit the error/warning accumulation behavior that all SchemaComponents share. A schema set invokes our XML Loader, which can produce validation errors, and those have to be gathered so we can give the user back a group of them, not just one.
Schema set is however, a kind of a fake SchemaComponent in that it doesn't correspond to any user-specified schema object. And unlike other schema components obviously it does not live within a schema document.
Mixin for SchemaSet
Represents a local sequence definition.
Captures concepts associated with definitions of Sequence groups.
Captures concepts associated with definitions of Sequence groups.
Used by GlobalSequenceGroupDef and local Sequence, but not by SequenceGroupRef. Used on objects that can carry DFDLSequence annotation objects.
Term, and what is and isn't a Term, is a key concept in DSOM.
Term, and what is and isn't a Term, is a key concept in DSOM.
From elements, ElementRef and LocalElementDecl are Term. A GlobalElementDecl is *not* a Term. From sequences, Sequence and SequenceGroupRef are Term. GlobalSequenceGroupDef is *not* a Term. From choices, Choice and ChoiceGroupRef are Term. GlobalChoiceGroupDef is *not* a Term.
Terms are the things we actually generate parsers/unparsers for. Non-Terms just contribute information used by Terms.
Captures concepts around dfdl:encoding property and Terms.
Captures concepts around dfdl:encoding property and Terms.
Just factored out into a trait for isolation of related code.
A schema component for simple type unions
Handles everything about schema documents that has nothing to do with DFDL.
Handles everything about schema documents that has nothing to do with DFDL. Things like namespace, include, import, elementFormDefault etc.
Note about DSOM design versus say XSOM or Apache XSD library.
Note about DSOM design versus say XSOM or Apache XSD library.
Some XSD object models have a single Element class, and distinguish local/global and element references based on attributes of the instances.
Our approach is to provide common behaviors on base classes or traits/mixins, and to have distinct classes for each instance type.
Maps an optional namespace and optional schemaLocation to an Include or Import object.
Maps an optional namespace and optional schemaLocation to an Include or Import object.
As we include/import schemas, we append to one of these, and before we include/import we check to see if it is already here.
About use of Delay[T]:
This is fairly deep function programming stuff, but it let's us have our cake and eat it too for one thing. In processing of import statements like this <xs:include schemaLocation="..."/>, the chicken/egg problem arises about namespaces. We have to read the file just in order to know the namespace in order to be able to decide if we have seen this (NS, URL) pair before, and therefore don't need to load the file....
So we maintain this growing map of (NS, URL) => file called an IIMap.
We use delay on this, because it lets us construct the DFDLSchemaFile, construct the XMLSchemaDocument object, both of which require that we pass in the IIMap. Then we can ask the XMLSchemaDocument for the targetNamespace of the file, which will cause the file to be read. But none of this needs the IIMap argument yet.
We then look at this new (tns, url) pair, and see if it is already in the map. If not, we extend the IIMap,... and by the magic of Delayed evaluation, that map is the one being passed to the DFDLSchemaFile and XMLSchemaDocument above.
Seems cyclical, but it isn't. We can call the constructors, passing them a promise (aka Delayed IIMap) to deliver the IIMap when it is needed. Turns out it isn't needed for the constructed object to answer the question "what is the targetNamespace". But that target namespace information IS needed to determine the IIMap which will be supplied when demanded.
From an ObjectOriented programing perspective, we don't pass an IIMap, we pass an IIMap factory (a delayed IIMap is effectively that). That factory isn't being called yet, and by the way it has pointers back to data structures that will be filled in later, so it can't be called yet. You wouldn't write an OO program this way usually.
Note that we must use a map that maintains insertion order, of which ListMap is one of them.
A factory for model groups.
A factory for model groups.
Takes care of detecting group references, and constructing the proper SequenceGroupRef or ChoiceGroupRef object.
Factory for Terms
Information needed specifically for unparsing.
DSOM - DFDL Schema Object Model
Overview
DSOM is the abstract syntax "tree" of a DFDL schema. It is not actually a tree, it is a graph, as there are back-pointers, and shared objects.
A schema is made up of SchemaComponent objects. A SchemaSet is a collection of Schema. A schema is a collection of SchemaDocument that have a common namespace. The SchemaSet is the ultimate root of all the objects in a compilation unit. The Term class is the base for everything that can have a representation in the data stream. .
Many SchemaComponent carry DFDL annotations; hence, AnnotatedSchemaComponent is a key base trait.
UML Class Diagram
See the Daffodil Wiki for class diagrams.
Terminology
Parsing - in this description we are talking about the Daffodil Schema Compiler. So when we refer to "parsing" the XML, we are referring to the recursive descent walk of the DFDL schema, with that schema represented as Scala's
scala.xml.Node
objects. Strictly speaking, the string text in files of the DFDL schema's XML is already parsed into Scala'sscala.xml.Node
objects, but it is the walk through that structure constructing the DSOM tree/graph that we refer to as "parsing" the DFDL schema.Principles of Operation
Constructing the DSOM Graph
The DSOM object graph must be constructed by looking at only the XML without examining any DFDL annotations. The DSOM structure is required in order to implement DFDL's scoping rules for finding annotations including both properties (like dfdl:byteOrder) and statements (like dfdl:assert); hence, one must have the DSOM graph before one can begin accessing DFDL annotations or you end up in cycles/stack-overflows.
This requires a careful consideration of class/trait members and methods that are used when constructing the DSOM graph, and those used after the DSOM graph has been created, in order to compile it into the runtime data structures.
There are a few exceptions to the above. The dfdl:hiddenGroupRef attribute is one such. It must be local to the Sequence object, and has implications for the parsing of the XML as it implies there should be no children of that xs:sequence. Since it is not scoped, the DSOM graph is not needed in order to access it. Only the local Sequence object and it's DFDLSequence annotation object. The AnnotatedSchemaComponent trait provides methods for this local-only property lookup.
The DSOM object graph is also needed in order to issue good diagnostic messages from the compiler; hence, Daffodil validates the DFDL schema before parsing it into the DSOM graph. Careful consideration must be given if a SchemaDefinitionError (SDE) is issued while constructing the DSOM graph.
If you run into stack-overflows while the DSOM graph is being constructed, the above is a common cause of them, as the SDE diagnostic messaging uses DSOM graph information to construct context information about the error for inclusion in the messages. If the DSOM graph is still being constructed at that time, then this can be circular.
Using the DSOM Graph
DSOM supports Daffodil schema compilation by way of the
OOLAG
pattern which is an object oriented way of using the attribute grammars compiler technique.Many attributes (in the attribute grammar sense, nothing to do with XML attributes) are simply Scala lazy val definitions, but some are declared as OOLAG attributes (using
org.apache.daffodil.oolag.OOLAG.OOLAGHost.LV
) which provides for gathering of multiple diagnostic messages (SchemaDefinitionError
) before abandoning compilation.DFDL schema compilation largely occurs by evaluating lazy val members of DSOM objects. These include the members of the grammar traits (@see org.apache.daffodil.grammar package), which are mixed in to the appropriate DSOM traits/classes.
FAQ
Q: Why invent this? Why not use XSOM, or the Apache XML Schema library?
A:We had trouble with other XML-schema libraries for lack of adequate support for annotations, non-native attributes, and schema documents as first class objects. So DSOM is specific to Daffodil Basically these libraries are more about implementing XML Schema and validation, and not so much about a complex language built on the annotations of the schema. DSOM is really mostly about the annotations.