Friday, November 6, 2009

Schematron: Enforce String Patterns in Schematron

In the general area of XML schemas, XSD “patterns” are commonly used to enforce special string formatting constraints.  This is a very powerful tool when a document recipient wishes to ensure that the sender provides string data in a consistent format.  A common example is the usage of a string constraint is to validate the structure of a Social Security Number (SSN).  This would be expressed in a typical schema in the following manner:

<xsd:simpleType name="SsnSimpleType">
    <xsd:restriction base="xsd:string">
        <xsd:pattern value="[0-9]{3}[\-][0-9]{2}[\-][0-9]{4}" />
    </xsd:restriction>
</xsd:simpleType>

As with most parts of NIEM, much of the model is based on inheritance which makes enforce of simple data types, such as that shown above, cumbersome and awkward.  Semantically, the correct element for an SSN would be under:

nc:Person/nc:PersonSSNIdentification/ nc:IdentificationID

Since nc:PersonSSNIdentification is an nc:IdentificationType, if one were to enforce SSN formatting on nc:IdentificationID, any other part of the schema that is derived from nc:IdentificationType would also need to abide by the same rules (e.g. Driver License Number, State ID Number, Document Identification, etc.).  In the past this situation led to one thing. . . extension.

With Schematron, extension for this purpose could be avoided.  Rather than enforcing the string constraints in the XSD file, instead the IEPD publisher could enforce this constraint within the Schematron rules document instead.  The following is an example of what code would be required in Schematron to accomplish this purpose:

<pattern id="ePersonSSN">
  <title>Verify person social security number is in the correct format.</title>
  <rule context="/ns:SomeDocument/nc:Person/nc:PersonSSNIdentification">
    <assert test=
      "count(tokenize(nc:IdentificationID,'[0-9]{3}-[0-9]{2}-[0-9]{4}')) 
      - 1 = 1">
       Social security number must be in the proper format (e.g. 11-222-3333).
    </assert>
  </rule>
</pattern>

By using the Schematron approach, the semantically equivalent element is preserved in the schema and only the appropriate identifier is subjected to the constraint.

This approach can be further extended to address any number of string constraints.  Another example would be ensuring an identification number only contains digits and has a string length of 5 or more.  This could be done by using the following XQuery count() query instead:

count(tokenize(nc:IdentificationID, '\d')) &gt; 5

This very powerful approach to constraining strings is yet another reason to take a real good look at Schematron in conjunction with your NIEM IEPDs.

Wednesday, November 4, 2009

Schematron: Correct nc:DateRepresentation Usage

The inherent flexibility of NIEM proves to be an incredibly beneficial when used correctly, however this benefit can also be one of its largest banes.  Sometimes this flexibility can lead to confusion when implementers attempt to deploy a NIEM exchange which is “valid” according to the XSD, yet not what the recipient is expecting. 

One such example is NIEM’s usage of substitution groups where a variety of data elements are legal according to the schema, but rarely are all of these legal options accounted for by the recipient’s adapter.  Take NIEM’s DateType as an example.  It employs the explicit substitution group (abstract data element) of nc:DateRepresentation which can be one of several different data types.  This representation can be replaced with a date (2009-01-01), a date/time (2009-01-01T12:00:00), a month and a year (01-2009), etc. 

Lets assume for a minute that a document has two different dates: a document filed date, and a person’s birth date.  The publisher’s intention is that filed date be a “timestamp” which includes both a date and a time, while the birth date is simply a date including a month, day and year.  A valid sample XML payload would look something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<ns:SomeDocument>
  <nc:DocumentFiledDate>
    <nc:DateTime>2009-01-01T01:00:00</nc:DateTime>
  </nc:DocumentFiledDate>
  <nc:Person>
    <nc:PersonBirthDate>
      <nc:Date>1970-01-01</nc:Date>
    </nc:PersonBirthDate>
  </nc:Person>
</ns:SomeDocument>

The Schematron code to enforce the publisher’s intentions could appear as the following:

<pattern id="eDocumentDateTime">
  <title>Verify the document filed date includes a date/Time</title>
  <rule context="ns:SomeDocument/nc:DocumentFiledDate">
    <assert test="nc:DateTime">
      A date and a time must be provided as the document filed date.
    </assert>
  </rule>
</pattern>
<pattern id="ePersonBirthDate">
  <title>Ensure the person's birth date is an nc:Date.</title>
  <rule context="ns:SomeDocument/nc:Person/nc:PersonBirthDate">
    <assert test="nc:Date">
      A person's birth date must be a full date.
    </assert>
  </rule>
</pattern>

This is a great example of how Schematron can help clarify a publisher’s intent as NEIM-conformant services are developed and deployed.