![]() ![]() ![]() ![]() ![]() |
Top Contents Index Glossary |
Link Summary
|
API Links Glossary Terms |
The next thing we want to with the parser is to customize it a bit, so you can see how to get information it usually ignores. But before we can do that, you're going to need to learn a few more important XML concepts. In this section, you'll learn about:
In XML, an entity is an XML structure (or plain text) that has a name. Referencing the entity by name causes it to be inserted into the document in place of the entity reference. To create an entity reference, the entity name is surrounded by an ampersand and a semicolon, like this:
Later, when you learn how to write a DTD, you'll see that you can define your own entities, so that&entityName;
&yourEntityName;
expands to all the text
you defined for that entity. For now, though, we'll focus on the predefined entities
and character references that don't require any special definitions.
An entity reference like &
contains a name (in this case,
“amp”) between the start and end delimiters. The text it refers to (&) is
substituted for the name, like a macro in a C or C++ program. The following
table shows the predefined entities for special characters.
Character Reference && << >> "" ''
A character reference like “
contains a hash mark (#
)
followed by a number. The number is the Unicode value for a single character,
such as 65 for the letter A, 147 for the left-curly quote, or 148
for the right-curly quote. In this case, the "name" of the entity
is the hash mark followed by the digits that identify the character.
Suppose you wanted to insert a line like this in your XML document:
Market Size < predicted
The problem with putting that line into an XML file directly is that when the
parser sees the left-angle bracket (<), it starts looking for a tag name,
which throws off the parse. To get around that problem, you put <
in the file, instead of "<
".
Note: The results of the modifications below are contained in
slideSample03.xml
. The results of processing it are shown inEcho07-03.log
.
If you are following the programming tutorial, add the text highlighted below
to your slideSample.xml
file:
<!-- OVERVIEW --> <slide type="all"> <title>Overview</title> ... </slide> <slide type="exec"> <title>Financial Forecast</title> <item>Market Size < predicted</item> <item>Anticipated Penetration</item> <item>Expected Revenues</item> <item>Profit Margin </item> </slide> </slideshow>
When you run the Echo program on your XML file, you see the following output:
ELEMENT: <item> CHARS: Market Size CHARS: < CHARS: predicted END_ELM: </item>
The parser converted the reference into the entity it represents, and passed the entity to the application.
When you are handling large blocks of XML or HTML that include many of the
special characters, it would be inconvenient to replace each of them with the
appropriate entity reference. For those situations, you can use a CDATA
section.
Note: The results of the modifications below are contained in
slideSample04.xml
. The results of processing it are shown inEcho07-04.log
.
A CDATA
section works like <pre>...</pre>
in HTML, only more so -- all whitespace in a CDATA
section is significant,
and characters in it are not interpreted as XML. A CDATA
section
starts with <![CDATA[
and ends with ]]>
. Add
the text highlighted below to your slideSample.XML
file to define
a CDATA
section for a fictitious technical slide:
... <slide type="tech"> <title>How it Works</title> <item>First we fozzle the frobmorten</item> <item>Then we framboze the staten</item> <item>Finally, we frenzle the fuznaten</item> <item><![CDATA[Diagram: frobmorten <------------ fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten--------------------+ <3> = frenzle <2> ]]></item> </slide> </slideshow>
When you run the Echo program on the new file, you see the following output:
ELEMENT: <item> CHARS: Diagram: frobmorten <------------ fuznaten | <3> ^ | <1> | <1> = fozzle V | <2> = framboze staten--------------------+ <3> = frenzle <2> END_ELM: </item>
You can see here that the text in the CDATA section arrived as one entirely uninterpreted character string.
The existence of CDATA makes the proper echoing of XML a bit tricky. If the text to be output is not in a CDATA section, then any angle brackets, ampersands, and other special characters in the text should be replaced with the appropriate entity reference. (Replacing left angle brackets and ampersands is most important, other characters will be interpreted properly without misleading the parser.)
But if the output text is in a CDATA section, then the substitutions should not occur, to produce text like that in the example above. In a simple program like our Echo application, it's not a big deal. But any realistic kind of XML-filtering application will want to keep track of whether it is in a CDATA section, in order to treat characters properly.
One other area to watch for is attributes. The text of an attribute value could also contain angle brackets and semicolons that need to be replaced by entity references. (Attribute text can never be in a CDATA section, though, so there is never any question about doing that substitution.)
Later in this tutorial, you will see how to use a LexicalEventListener
to find out whether or not you are processing a CDATA section. Next, though,
you will see how to define a DTD.
![]() ![]() ![]() ![]() ![]() |
Top Contents Index Glossary |