Document Type

Top Contents Index Glossary

5a. Creating a Document Type Definition (DTD)

Link Summary

Exercise Links

API Links

Resolver

Glossary Terms

declaration, DTD, external subset, local subset, mixed-content model, prolog, root, URL, URN, valid

After the XML declaration, the document prolog can include a DTD, which lets you specify the kinds of tags that can be included in your XML document. In addition to telling a validating parser which tags are valid, and in what arrangements, a DTD tells both validating and nonvalidating parsers where text is expected, which lets the parser determine whether the whitespace it sees is significant or ignorable.

Basic DTD Definitions

When you were parsing the slide show, for example, you saw that the characters method was invoked multiple times before and after comments and slide elements. In those cases, the whitespace consisted of the line endings and indentation surrounding the markup. The goal was to make the XML document readable -- the whitespace was not in any way part of the document contents. To begin learning about DTD definitions, let's start by telling the parser where whitespace is ignorable.

Note: The DTD defined in this section is contained in slideshow1a.dtd.

Start by creating a file named slideshow.dtd. Enter an XML declaration and a comment to identify the file, as shown below:

<?xml version='1.0' encoding='us-ascii'?>

<!-- DTD for a simple "slide show". -->

Next, add the text highlight below to specify that a slideshow element contains slide elements and nothing else:

<!-- DTD for a simple "slide show". -->

<!ELEMENT slideshow (slide+)>

As you can see, the DTD tag starts with <! followed by the tag name (ELEMENT). After the tag name comes the name of the element that is being defined (slideshow) and, in parentheses, one or more items that indicate the valid contents for that element. In this case, the notation says that a slideshow consists of one or more slide elements.

Without the plus sign, the definition would be saying that a slideshow consists of a single slide element. Here are the qualifiers you can add to an element definition:

Qualifier Name Meaning

?
Question Mark Optional (zero or one)

*
Asterisk Zero or more

+
Plus Sign One or more

You can include multiple elements inside the parentheses in a comma separated list, and use a qualifier on each element to indicate how many instances of that element may occur. The comma-separated list tells which elements are valid and the order they can occur in.

You can also nest parentheses to group multiple items. For an example, after defining an image element (coming up shortly), you could declare that every image element must be paired with a title element in a slide by specifying ((image, title)+). Here, the plus sign applies to the image/title pair to indicate that one or more pairs of the specified items can occur.

Defining Text and Nested Elements

Now that you have told the parser something about where not to expect text, let's see how to tell it where text can occur. Add the text highlighted below to define the slide, title, item, and list elements:

<!ELEMENT slideshow (slide+)>
<!ELEMENT slide (title, item*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT item (#PCDATA | item)* >

The first line you added says that a slide consists of a title followed by zero or more item elements. Nothing new there. The next line says that a title consists entirely of parsed character data (PCDATA). That's known as "text" in most parts of the country, but in XML-speak it's called "parsed character data". (That distinguishes it from CDATA sections, which contain character data that is not parsed.) The "#" that precedes PCDATA indicates that what follows is a special word, rather than an element name.

The last line introduces the vertical bar (|), which indicates an or condition. In this case, either PCDATA or an item can occur. The asterisk at the end says that either one can occur zero or more times in succession. The result of this specification is known as a mixed-content model, because any number of item elements can be interspersed with the text. Such models must always be defined with #PCDATA specified first, some number of alternate items divided by vertical bars (|), and an asterisk (*) at the end.

Limitations of DTDs

It would be nice if we could specify that an item contains either text, or text followed by one or more list items. But that kind of specification turns out to be hard to achieve in a DTD. For example, you might be tempted to define an item like this:

<!ELEMENT item (#PCDATA | (#PCDATA, item+)) >

That would certainly be accurate, but as soon as the parser sees #PCDATA and the vertical bar, it requires the remaining definition to conform to the mixed-content model. This specification doesn't, so you get can error that says:

Illegal 
mixed content model for 'item'. Found &#x28; ...

, where the hex character 28 is the angle bracket the ends the definition.

Trying to double-define the item element doesn't work, either. A specification like this:

<!ELEMENT item (#PCDATA) >
<!ELEMENT item (#PCDATA, item+) >

produces a "duplicate definition" warning when the validating parser runs. The second definition is, in fact, ignored. So it seems that defining a mixed content model (which allows item elements to be interspersed in text) is about as good as we can do.

In addition to the limitations of the mixed content model mentioned above, there is no way to further qualify the kind of text that can occur where PCDATA has been specified. Should it contain only numbers? Should be in a date format, or possibly a monetary format? There is no way to say in the context of a DTD.

Finally, note that the DTD offers no sense of hierarchy. The definition for the title element applies equally to a slide title and to an item title. When we expand the DTD to allow HTML-style markup in addition to plain text, it would make sense to restrict the size of an item title compared to a slide title, for example. But the only way to do that would be to give one of them a different name, such as "item-title". The bottom line is that the lack of hierarchy in the DTD forces you to introduce a "hyphenation hierarchy" (or its equivalent) in your namespace. All of these limitations are fundamental motivations behind the development of schema-specification standards.

Special Element Values in the DTD

Rather than specifying a parenthesized list of elements, the element definition could use one of two special values: ANY or EMPTY. The ANY specification says that the element may contain any other defined element, or PCDATA. Such a specification is usually used for the root element of a general-purpose XML document such as you might create with a word processor. Textual elements could occur in any order in such a document, so specifying ANY makes sense.

The EMPTY specification says that the element contains no contents. So the DTD for email messages that let you "flag" the message with <flag/> might have a line like this in the DTD:

<!ELEMENT flag EMPTY>

Referencing the DTD

In this case, the DTD definition is in a separate file from the XML document. That means you have to reference it from the XML document, which makes the DTD file part of the external subset of the full Document Type Definition (DTD) for the XML file. As you'll see later on, you can also include parts of the DTD within the document. Such definitions constitute the local subset of the DTD.

Note: The XML written in this section is contained in slideSample05.xml.

To reference the DTD file you just created, add the line highlighted below to your slideSample.xml file:

<!--  A SAMPLE set of slides  -->

<!DOCTYPE slideshow SYSTEM "slideshow.dtd">

<slideshow

Again, the DTD tag starts with "<!". In this case, the tag name, DOCTYPE, says that the document is a slideshow, which means that the document consists of the slideshow element and everything within it:

<slideshow>
  ...
</slideshow>

This tag defines the slideshow element as the root element for the document. An XML document must have exactly one root element. This is where that element is specified. In other words, this tag identifies the document content as a slideshow.

The DOCTYPE tag occurs after the XML declaration and before the root element. The SYSTEM identifier specifies the location of the DTD file. Since it does not start with a prefix like http:/or file:/, the path is relative to the location of the XML document. Remember the setDocumentLocator method? The parser is using that information to find the DTD file, just as your application would to find a file relative to the XML document. A PUBLIC identifier could also be used to specify the DTD file using a unique name -- but the parser would have to be able to resolve it

The DOCTYPE specification could also contain DTD definitions within the XML document, rather than referring to an external DTD file. Such definitions would be contained in square brackets, like this:.

<!DOCTYPE slideshow SYSTEM "slideshow1.dtd" [
  ...local subset definitions here...
]>

You'll take advantage of that facility later on to define some entities that can be used in the document.

Note:
If a public ID (URN) is specified instead of a system ID (URL), then the parser has to be able to resolve it to an actual address in order to use it. To do that, the parser can be configured with a com.sun.xml.parser.Resolver using the parser's setEntityResolver method, and the URN can be associated with a local URL using the resolver's registerCatalogEntry method.

Top Contents Index Glossary

Qualifier	*Name*	Meaning
?	Question Mark	Optional (zero or one)
*	Asterisk	Zero or more
+	Plus Sign	One or more