Monday, November 23, 2009

Schematron: Nesting XPath Values Within an XPath Predicate

In previous examples, we have seen the usage of a temporary variable or <let> tag to store a value which is later used in an XPath predicate (the square brackets surrounding the index of an element array).  It is important to note that this is not required.  A simple XPath statement can be used in the predicate for any other XPath statement.  For example see the following:

<pattern id="wEmptyMetadataComment">
  <title>Ensure person metadata comment is not blank.</title>
  <rule context="/ns:SomeDocument/nc:Person">
    <assert test="string-length(/ns:SomeDocument/nc:Metadata[@s:id=current()/@s:metadata]/nc:CommentText) &gt; 0">
      Comments regarding a person should not be blank.
    </assert>
  </rule>
</pattern>

In the above example, simply the attribute @s:id=/ns:SomeDocument/nc:Person/@s:metadata is used to identify which specific Metadata element should be examined.  With the context defined as /ns:SomeDocument/nc:Person, the rule will loop through each nc:Person element and use the appropriate @s:metadata value in each subsequent pass.

[Updated: Corrected Syntax on 04-01-2010]

Friday, November 13, 2009

Schematron: Validating NIEM Documents Against Non-Conformant Code Lists

Schematron rules and assertions are based upon XPath statements, which allow for a number of powerful XML querying capabilities. Two XPath capabilities leveraged and outlined in this section are doc() and XPath predicates which allow us to validate data captured in an NIEM XML instance against external code list of any kind.

Lets assume a scenario where we would like to validate an exchange document’s category against a predefined list of enumerated values.  This list is maintained by an outside party in a format other than NIEM and changes on a fairly regular basis. 

Traditionally, a NIEM practitioner would take this list and define an enumeration within an extension schema to enforce this code list.  Each time the third party makes a change to that code list, an updated NIEM extension schema would be created and redistributed.  This maintenance-intensive process could become overwhelming therefore the team chose instead to simply adopt the third-party list and keep it in the following non-conformant format relying instead on Schematron to perform the validation:

<?xml version="1.0" encoding="UTF-8"?>
<!-- List of Valid code Values -->
<CategoryList>
  <Category>a</Category>
  <Category>b</Category>
</CategoryList>

As shown in the above, the valid categories include the values “a” and “b”.  An example of a NIEM-conformant XML payload would look something like the following:

<ns:SomeDocument 
    xmlns:nc="http://niem.gov/niem/niem-core/2.0"    
    xmlns:ns="http://www.niematron.org/SchematronTestbed"
    schemaLocation"http://www.niematron.org/SchematronTestbed  ./SomeDocument.xsd">
  <nc:DocumentCategoryText>A</nc:DocumentCategoryText>
  <!-- Remaining Document Elements Omitted -->
</ns:SomeDocument>

In this example, the developers would like to perform the validation ignoring case, therefore the Schematron rule to validate the nc:DocumentCategoryText against the third-party-provided list would look something like the following:

<pattern id="eDocumentCategory">
  <title>Verify the document category matches the external list of valid categories.</title>
  <rule context="/ns:SomeDocument">
    <let name="sText" value="lower-case(nc:DocumentCategoryText)"/>
    <assert test="count(doc('./CategoryList.xml')/CategoryList/Category[. = $sText]) &gt; 0">
      Invalid document category.
    </assert>
  </rule>
</pattern>

Lets look at some of the key statements in the above Schematron example breaking it into individual parts. 

  • lower-case(nc:DocumentCategoryText) – This statement encapsulated in a <let> tag converts the text in the NIEM payload to lower case thereby ignoring deviations from the code list due to case.  It is then stored in a temporary variable named $sText.
  • doc('.CategoryList.xml')/… – This effectively points the parser at the third-party provided file (in this example assumed to be in the same directory as the .sch file) so that elements from that file can be referenced using the XPath in addition to elements in the source payload document. 
  • …/Category[. = $sText] – The usage of the square brackets ([ and ])  in  an XPath statement is considered a predicate.  Any number of predicate statements can be made to help filter values contained within an XPath, but in this case, the expression tells the parser to select all of the Category elements with the value contained in the variable $sText.
  • count(…) &gt; 0 – The XQuery count function returns the number of elements contained in the XPath.  If no match to the category existed, the count would return a value of zero, therefore we want to ensure the value is greater than zero meaning a match existed in the external code list.

Friday, November 6, 2009

Schematron: Enforce String Patterns in Schematron

In the general area of XML schemas, XSD “patterns” are commonly used to enforce special string formatting constraints.  This is a very powerful tool when a document recipient wishes to ensure that the sender provides string data in a consistent format.  A common example is the usage of a string constraint is to validate the structure of a Social Security Number (SSN).  This would be expressed in a typical schema in the following manner:

<xsd:simpleType name="SsnSimpleType">
    <xsd:restriction base="xsd:string">
        <xsd:pattern value="[0-9]{3}[\-][0-9]{2}[\-][0-9]{4}" />
    </xsd:restriction>
</xsd:simpleType>

As with most parts of NIEM, much of the model is based on inheritance which makes enforce of simple data types, such as that shown above, cumbersome and awkward.  Semantically, the correct element for an SSN would be under:

nc:Person/nc:PersonSSNIdentification/ nc:IdentificationID

Since nc:PersonSSNIdentification is an nc:IdentificationType, if one were to enforce SSN formatting on nc:IdentificationID, any other part of the schema that is derived from nc:IdentificationType would also need to abide by the same rules (e.g. Driver License Number, State ID Number, Document Identification, etc.).  In the past this situation led to one thing. . . extension.

With Schematron, extension for this purpose could be avoided.  Rather than enforcing the string constraints in the XSD file, instead the IEPD publisher could enforce this constraint within the Schematron rules document instead.  The following is an example of what code would be required in Schematron to accomplish this purpose:

<pattern id="ePersonSSN">
  <title>Verify person social security number is in the correct format.</title>
  <rule context="/ns:SomeDocument/nc:Person/nc:PersonSSNIdentification">
    <assert test=
      "count(tokenize(nc:IdentificationID,'[0-9]{3}-[0-9]{2}-[0-9]{4}')) 
      - 1 = 1">
       Social security number must be in the proper format (e.g. 11-222-3333).
    </assert>
  </rule>
</pattern>

By using the Schematron approach, the semantically equivalent element is preserved in the schema and only the appropriate identifier is subjected to the constraint.

This approach can be further extended to address any number of string constraints.  Another example would be ensuring an identification number only contains digits and has a string length of 5 or more.  This could be done by using the following XQuery count() query instead:

count(tokenize(nc:IdentificationID, '\d')) &gt; 5

This very powerful approach to constraining strings is yet another reason to take a real good look at Schematron in conjunction with your NIEM IEPDs.

Wednesday, November 4, 2009

Schematron: Correct nc:DateRepresentation Usage

The inherent flexibility of NIEM proves to be an incredibly beneficial when used correctly, however this benefit can also be one of its largest banes.  Sometimes this flexibility can lead to confusion when implementers attempt to deploy a NIEM exchange which is “valid” according to the XSD, yet not what the recipient is expecting. 

One such example is NIEM’s usage of substitution groups where a variety of data elements are legal according to the schema, but rarely are all of these legal options accounted for by the recipient’s adapter.  Take NIEM’s DateType as an example.  It employs the explicit substitution group (abstract data element) of nc:DateRepresentation which can be one of several different data types.  This representation can be replaced with a date (2009-01-01), a date/time (2009-01-01T12:00:00), a month and a year (01-2009), etc. 

Lets assume for a minute that a document has two different dates: a document filed date, and a person’s birth date.  The publisher’s intention is that filed date be a “timestamp” which includes both a date and a time, while the birth date is simply a date including a month, day and year.  A valid sample XML payload would look something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<ns:SomeDocument>
  <nc:DocumentFiledDate>
    <nc:DateTime>2009-01-01T01:00:00</nc:DateTime>
  </nc:DocumentFiledDate>
  <nc:Person>
    <nc:PersonBirthDate>
      <nc:Date>1970-01-01</nc:Date>
    </nc:PersonBirthDate>
  </nc:Person>
</ns:SomeDocument>

The Schematron code to enforce the publisher’s intentions could appear as the following:

<pattern id="eDocumentDateTime">
  <title>Verify the document filed date includes a date/Time</title>
  <rule context="ns:SomeDocument/nc:DocumentFiledDate">
    <assert test="nc:DateTime">
      A date and a time must be provided as the document filed date.
    </assert>
  </rule>
</pattern>
<pattern id="ePersonBirthDate">
  <title>Ensure the person's birth date is an nc:Date.</title>
  <rule context="ns:SomeDocument/nc:Person/nc:PersonBirthDate">
    <assert test="nc:Date">
      A person's birth date must be a full date.
    </assert>
  </rule>
</pattern>

This is a great example of how Schematron can help clarify a publisher’s intent as NEIM-conformant services are developed and deployed.