Friday, November 6, 2009

Schematron: Enforce String Patterns in Schematron

In the general area of XML schemas, XSD “patterns” are commonly used to enforce special string formatting constraints.  This is a very powerful tool when a document recipient wishes to ensure that the sender provides string data in a consistent format.  A common example is the usage of a string constraint is to validate the structure of a Social Security Number (SSN).  This would be expressed in a typical schema in the following manner:

<xsd:simpleType name="SsnSimpleType">
    <xsd:restriction base="xsd:string">
        <xsd:pattern value="[0-9]{3}[\-][0-9]{2}[\-][0-9]{4}" />
    </xsd:restriction>
</xsd:simpleType>

As with most parts of NIEM, much of the model is based on inheritance which makes enforce of simple data types, such as that shown above, cumbersome and awkward.  Semantically, the correct element for an SSN would be under:

nc:Person/nc:PersonSSNIdentification/ nc:IdentificationID

Since nc:PersonSSNIdentification is an nc:IdentificationType, if one were to enforce SSN formatting on nc:IdentificationID, any other part of the schema that is derived from nc:IdentificationType would also need to abide by the same rules (e.g. Driver License Number, State ID Number, Document Identification, etc.).  In the past this situation led to one thing. . . extension.

With Schematron, extension for this purpose could be avoided.  Rather than enforcing the string constraints in the XSD file, instead the IEPD publisher could enforce this constraint within the Schematron rules document instead.  The following is an example of what code would be required in Schematron to accomplish this purpose:

<pattern id="ePersonSSN">
  <title>Verify person social security number is in the correct format.</title>
  <rule context="/ns:SomeDocument/nc:Person/nc:PersonSSNIdentification">
    <assert test=
      "count(tokenize(nc:IdentificationID,'[0-9]{3}-[0-9]{2}-[0-9]{4}')) 
      - 1 = 1">
       Social security number must be in the proper format (e.g. 11-222-3333).
    </assert>
  </rule>
</pattern>

By using the Schematron approach, the semantically equivalent element is preserved in the schema and only the appropriate identifier is subjected to the constraint.

This approach can be further extended to address any number of string constraints.  Another example would be ensuring an identification number only contains digits and has a string length of 5 or more.  This could be done by using the following XQuery count() query instead:

count(tokenize(nc:IdentificationID, '\d')) &gt; 5

This very powerful approach to constraining strings is yet another reason to take a real good look at Schematron in conjunction with your NIEM IEPDs.

No comments:

Post a Comment