![]() ![]() ![]() ![]() ![]() |
Top Contents Index Glossary |
Link Summary
|
|
After the XML declaration, the document prolog can include a DTD, which lets you specify the kinds of tags that can be included in your XML document. In addition to telling a validating parser which tags are valid, and in what arrangements, a DTD tells both validating and nonvalidating parsers where text is expected, which lets the parser determine whether the whitespace it sees is significant or ignorable.
When you were parsing the slide show, for example, you saw that the characters
method was invoked multiple times before and after comments and slide elements.
In those cases, the whitespace consisted of the line endings and indentation
surrounding the markup. The goal was to make the XML document readable -- the
whitespace was not in any way part of the document contents. To begin learning
about DTD definitions, let's start by telling the parser where whitespace is
ignorable.
Note: The DTD defined in this section is contained in
slideshow1a.dtd
.
Start by creating a file named slideshow.dtd
. Enter an
XML declaration and a comment to identify the file, as shown below:
<?xml version='1.0' encoding='us-ascii'?> <!-- DTD for a simple "slide show". -->
Next, add the text highlight below to specify that a slideshow
element contains slide
elements and nothing else:
<!-- DTD for a simple "slide show". --> <!ELEMENT slideshow (slide+)>
As you can see, the DTD tag starts with <!
followed by the
tag name (ELEMENT
). After the tag name comes the name of the element
that is being defined (slideshow
) and, in parentheses, one or more
items that indicate the valid contents for that element. In this case, the notation
says that a slideshow
consists of one or more slide
elements.
Without the plus sign, the definition would be saying that a slideshow
consists of a single slide
element. Here are the qualifiers you
can add to an element definition:
Qualifier Name Meaning ?Question Mark Optional (zero or one) *Asterisk Zero or more +Plus Sign One or more
You can include multiple elements inside the parentheses in a comma separated list, and use a qualifier on each element to indicate how many instances of that element may occur. The comma-separated list tells which elements are valid and the order they can occur in.
You can also nest parentheses to group multiple items. For an example, after
defining an image
element (coming up shortly), you could declare
that every image
element must be paired with a title
element in a slide by specifying ((image, title)+)
. Here, the plus
sign applies to the image/title
pair to indicate that one or more
pairs of the specified items can occur.
Now that you have told the parser something about where not to expect
text, let's see how to tell it where text can occur. Add the text highlighted
below to define the slide
, title
, item
,
and list
elements:
<!ELEMENT slideshow (slide+)>
<!ELEMENT slide (title, item*)> <!ELEMENT title (#PCDATA)> <!ELEMENT item (#PCDATA | item)* >
The first line you added says that a slide consists of a title
followed by zero or more item
elements. Nothing new there. The
next line says that a title consists entirely of parsed character data
(PCDATA
). That's known as "text" in most parts of the
country, but in XML-speak it's called "parsed character data". (That
distinguishes it from CDATA
sections, which contain character data
that is not parsed.) The "#" that precedes PCDATA
indicates
that what follows is a special word, rather than an element name.
The last line introduces the vertical bar (|
), which indicates
an or condition. In this case, either PCDATA
or an item
can occur. The asterisk at the end says that either one can occur zero or more
times in succession. The result of this specification is known as a mixed-content
model, because any number of item
elements can be interspersed
with the text. Such models must always be defined with #PCDATA
specified first, some number of alternate items divided by vertical bars (|
),
and an asterisk (*
) at the end.
It would be nice if we could specify that an item
contains either
text, or text followed by one or more list items. But that kind of specification
turns out to be hard to achieve in a DTD. For example, you might be tempted
to define an item
like this:
That would certainly be accurate, but as soon as the parser sees<!ELEMENT item (#PCDATA | (#PCDATA, item+)) >
#PCDATA
and the vertical bar, it requires the remaining definition to conform to the mixed-content
model. This specification doesn't, so you get can error that says: Illegal
mixed content model for 'item'. Found ( ...
, where the hex character
28 is the angle bracket the ends the definition.
Trying to double-define the item element doesn't work, either. A specification like this:
<!ELEMENT item (#PCDATA) > <!ELEMENT item (#PCDATA, item+) >
produces a "duplicate definition" warning when the validating parser
runs. The second definition is, in fact, ignored. So it seems that defining
a mixed content model (which allows item
elements to be interspersed
in text) is about as good as we can do.
In addition to the limitations of the mixed content model mentioned above,
there is no way to further qualify the kind of text that can occur where PCDATA
has been specified. Should it contain only numbers? Should be in a date format,
or possibly a monetary format? There is no way to say in the context of a DTD.
Finally, note that the DTD offers no sense of hierarchy. The definition for
the title
element applies equally to a slide
title
and to an item
title. When we expand the DTD to allow HTML-style
markup in addition to plain text, it would make sense to restrict the size of
an item
title compared to a slide
title, for example.
But the only way to do that would be to give one of them a different name, such
as "item-title
". The bottom line is that the lack of
hierarchy in the DTD forces you to introduce a "hyphenation hierarchy"
(or its equivalent) in your namespace. All of these limitations are fundamental
motivations behind the development of schema-specification standards.
Rather than specifying a parenthesized list of elements, the element definition
could use one of two special values: ANY
or EMPTY
.
The ANY
specification says that the element may contain any other
defined element, or PCDATA
. Such a specification is usually used
for the root element of a general-purpose XML document such as you might create
with a word processor. Textual elements could occur in any order in such a document,
so specifying ANY
makes sense.
The EMPTY
specification says that the element contains no contents.
So the DTD for email messages that let you "flag" the message with
<flag/>
might have a line like this in the DTD:
<!ELEMENT flag EMPTY>
In this case, the DTD definition is in a separate file from the XML document. That means you have to reference it from the XML document, which makes the DTD file part of the external subset of the full Document Type Definition (DTD) for the XML file. As you'll see later on, you can also include parts of the DTD within the document. Such definitions constitute the local subset of the DTD.
Note: The XML written in this section is contained in
slideSample05.xml
.
To reference the DTD file you just created, add the line highlighted below
to your slideSample.xml
file:
<!-- A SAMPLE set of slides --> <!DOCTYPE slideshow SYSTEM "slideshow.dtd"> <slideshow
Again, the DTD tag starts with "<!
".
In this case, the tag name, DOCTYPE
, says that the document is
a slideshow
, which means that the document consists of the slideshow
element and everything within it:
<slideshow> ... </slideshow>
This tag defines the slideshow
element as the root
element for the document. An XML document must have exactly one root element.
This is where that element is specified. In other words, this tag identifies
the document content as a slideshow
.
The DOCTYPE
tag occurs after the XML declaration
and before the root element. The SYSTEM
identifier specifies the
location of the DTD file. Since it does not start with a prefix like http:/
or file:/
, the path is relative to the location of the XML
document. Remember the setDocumentLocator
method? The parser is
using that information to find the DTD file, just as your application would
to find a file relative to the XML document. A PUBLIC
identifier
could also be used to specify the DTD file using a unique name -- but the parser
would have to be able to resolve it
The DOCTYPE
specification could also contain DTD definitions within
the XML document, rather than referring to an external DTD file. Such definitions
would be contained in square brackets, like this:.
<!DOCTYPE slideshow SYSTEM "slideshow1.dtd" [ ...local subset definitions here... ]>
You'll take advantage of that facility later on to define some entities that can be used in the document.
Note:
If a public ID (URN) is specified instead of a system ID (URL), then the parser has to be able to resolve it to an actual address in order to use it. To do that, the parser can be configured with acom.sun.xml.parser.Resolver
using the parser'ssetEntityResolver
method, and the URN can be associated with a local URL using the resolver'sregisterCatalogEntry
method.
![]() ![]() ![]() ![]() ![]() |
Top Contents Index Glossary |