XML Lecture

XML Primer

Creating a Fictitous Menu Markup Language

As you've seen and studied, HTML is a specific language for creating general purpose Web documents. HTML 4 was completed as a specification in December 1997. XML, the eXtensible Markup Language, has continued development as a technology to implement specific purpose Web documents. XML has been designed specifically with information management in mind. Like HTML, XML has tags, attributes, and values. XML lets you create other languages by standardizing the data structures you want to share on the Web.

Although we will implement project 2 by connecting directly to databases for data retrieval, I want you to consider XML as a technology that could be used to hold data for use on the Web. An application running on our Web server could easily take the information from our databases and reformat it according to the XML standard as an XML Web document that then could be inserted into HTML by templates following the XSLT specification (chapter 10 in your XML textbook). I am hedging a bet that XML will become a dominant technology in at least some important industries (Medical and Chemical perhaps come to mind).

The first thing to notice is that XML is much stricter in terms of capitalization and punctuation. The stricter the rules for writing a valid language, the easier it is to build tools around the language. The tools require less conditional programming as there are less valid cases to have to process. So, pay careful attention to the XML examples I have put on the class Web page.

The word schema refers to the rules embedded in your XML document that allow for a custom markup language. If you want a particular markup language to handle chemical formulae better on the Web, you could define a Chemical Equation Markup Language whereby you use XML to create the schema for your language. You would explicitly define your markup rules in a DTD (Document Type Definition) or the XML Schema language (for our class, we'll just consider the DTD).

So, we'll look at XML, DTDs, and XSLT to prepare for the short in-class exam on May 8th. You might want to read appendix A in your book also to look at the XHTML approach to intermediate support of XML in today's Web browsers.

XML

Like HTML, we can write valid XML with a simple text editor. XML documents should be stored with an .xml extension.

Like HTML, the basic entity of the language is the element. In HTML, we discussed IMG elements, A elements, P elements, etc. With XML, we can create any element we want. If we want to be able to save food item data for use on the Web, we can create a food_item element. If we want to nest other information in the food_item element, we can just define a pair of tags such as <food_item> </food_item> whereby our schema would define which other tags should or must be inside of the food_item element. If we want to associate a name to each food item, we could create a name element such as <name>Grilled Cheese Sandwich</name>. We could then nest the name element inside of the food_item element:

<food_item>
<name>Grilled Cheese Sandwich</name>
</food_item>

Add a long list of sophisticated rules to be more specific about the elements we create, and you have the gist of XML.

Feel free to look at the chapters that follow chapter 4 to get a fuller appreciation for the level of sophistication of XML, but focus on the basics in chapters 1-4 and 10 for the test.

You can add white space (spaces and line feed/carriage returns) around your XML code to make it easier for human comprehension. The XML parser will ignore extra white space outside of your tags.

There are some simple rules I want you to remember when considering XML documents:

Every XML document should begin with a technology identifier such as <?xml version="1.0" ?> but note there is no closing tag associated with the XML identifier.
Outside of this first line, all other lines in your XML document must be inside of a single element (like the role the HTML element plays in an HTML document). You, the XML author, decide what that element should be called. In our class example, I've created an all-encompassing element named menu in which all other elements reside. A menu markup language could standardize how all menus are stored on the Web.
You close all XML elements with a closing tag even if the contents of the tag are a single object. You can close a tag with / > (forward slash then space then greater than symbol) at the end of the opening tag as a shortcut if you don't need to nest anything inside it.
Like valid HTML documents, all XML tags must be nested properly inside of other tags. Don't cross closing tags.
XML is case sensitive. Menu, menu, and MENU are all considered different potential elements in XML.
All XML attribute values (whether one word or more) MUST be enclosed in quotation marks. Remember this is not true of HTML attribute values.
All the element entities you want to include in your XML document MUST be declared in your schema ahead of time (in the DTD for class purposes).

Let's build a valid XML document to store menu information.

The first line should be:
<?xml version="1.0" ?>

followed by our <menu></menu> tags.

Then, we nest the specifics for what we want to store about a menu inside the menu element:

<menu>
<meal>
<name language="English">Meatloaf</name>
<name language="Spanish">Loaf de Carne</name>
<food_items>
<food_item>Beef</food_item>
<food_item>Potatoes</food_item>
<food_item>String Beans</food_item>
</food_items>
<description>A tasty treat</description>
<cost>12.95</cost>
<prep_time>20 minutes</prep_time>
<picture filename="meatloaf.jpg" x="200" y="197" />
<category>American</category>
<calories>900</calories>
</meal>
<meal>
<name language="English">Black Bean Soup</name>
<name language="Spanish">Sopa de Frijoles Negros</name>
<food_items>
<food_item>Black Beans</food_item>
<food_item>Onions</food_item>
<food_item>Sour Cream</food_item>
</food_items>
<description>A dark tasty treat</description>
<cost>3.45</cost>
<prep_time>10 minutes</prep_time>
<picture filename="blackbeansoup.jpg" x="200" y="197" />
<category>Caribean</category>
<calories>400</calories>
</meal>
</menu>

So, in this case, we are storing meal items inside the menu. You can see the hierarchical structure that most XML documents contain. A menu has multiple meal elements (but the meals share the same menu element). Each meal element has one or more food_items (but a food_item is nested in only one meal element). The hierarchy could continue to contain additional levels (and usually does for more complete schemas). The example above has only two meal items but the structure is in place for thousands of them should we expand our restaurant menu offerings.

DTD

The DTD (Document Type Definition) document explicitly captures the schema you want to use in your related XML documents. There are two ways to tell your XML document about the DTD. You can declare it internally or externally (the benefit of an external DTD is you can refer to it from multiple XML documents). To declare your DTD internally in the XML document, place a container into your document like:

<!DOCTYPE menu [
]>

after the opening <?xml version="1.0" ?> line of text.

To declare your DTD externally in the XML document, place a tag like:

<!DOCTYPE menu SYSTEM "URL">

where the "URL" attribute value would be a valid absolute or relative URL to your DTD document (which should be stored with a .dtd filename extension).

Take a look at an example DTD for the menu XML page above.

Note there is a <!ELEMENT element_name (childlist)> tag for each element type. These lines define the element for use in the XML document. As mentioned previously, our menu element has three additional levels of nesting within it: meal -> food_items -> food_item so menu, meal, and food_items all have child elements to declare. All other elements do NOT have children. Elements that take regular typed text (whether numeric, alpha, or alphanumeric) have the keyword #PCDATA appearing where the child list would normally appear. If children can be more than one element type, you separate the element names with commas). The picture element takes the word EMPTY in place of the parentheses and child list because it is always empty between the open and closing tags and thus has no children.

Elements that accept attributes within their opening tags declare them with:

<!ATTLIST element_name
attribute_name CDATA
>

If the attribute is required, you can type #REQUIRED after the CDATA keyword.

Please refer to chapter 3 of your XML textbook to study the details of the DTD specification. I'll expect you to know the most important ones for exam purposes with an emphasis on those in the example DTD on our class Web site.

Chapter 4 in the book explains entities and notations in DTDs. Entities are a way of packaging details in the DTD and then referencing the details efficiently in your XML document. Entities work similarly to the Character Entities we looked at in HTML. You type entities into your XML document exactly as they are defined. The Web browser interprets anything in the XML document that begins with an ampersand (&) and ends with a semi-colon (;) as an entity. To make the entity available to your XML document, you declare it in your DTD document. An example is the line:

<!ENTITY bbs_recipe SYSTEM "bbs.ent">

where the entity &bbs_recipe; would expand to the contents of the bbs.ent (a relative URL) file. You can include the entity details within the XML document itself by leaving out the SYSTEM keyword and putting the details directly within quotation marks.

Entities that specify external chunks of DTD are called external parameter entities. You use those by starting with a percent sign (%) instead of the ampersand. External parameter entities also end with a semi-colon (and with your text literal in between the % and ;).

You embed non-text or non-XML content into an XML document with an unparsed entity. An unparsed entity could contain anything your browser is capable of displaying (with or without a plug-in application). An example is adding a picture of a meal to the XML document. Set up the entity in the DTD like:

<!ENTITY bbs_pic SYSTEM "bbs.jpg" NDATA jpg>

where the bbs_pic is the name of your entity (so you'll use bbs_pic as the value of an XML attribute (for example <meal_pic source="bbs_pic"/>). You do NOT need to place a preceding & or trailing ; when using an entity as an attribute value. The SYSTEM refers to a URL where your content resides. The NDATA refers to a notation which in this case tells the browser what type of content you want loaded. You would have to explicitly create the notation in your DTD for the browser to be able to interpret the NDATA jpg. A proper example is:

<!NOTATION jpg SYSTEM image/jpg>

Remember that there are problems with the XML implementations in the Web browsers. Focus on the significance of data storage on the Web and the need for standards. We'll use SQL instead but I'm hopeful that XML will catch on (or something will come along that does).

XLST

XLST, the specification for eXtensible Style Language Transformation, can be used to incorporate XML documents into HTML presentations. We will work through an example here which you can then study for the XML exam on May 8th.

To actually make the XSLT work, you need an XSLT processor. You can download Instant Saxon and give it a try but it isn't necessary for our class. I'm more interested in you understanding how it works theoretically and why such a standard is needed.

An XSLT processor analyzes an XML document and converts it to a node tree (a hierarchical representation of the relationships between elements). See page 136, Figure 10.2, in your book for one interpretation of a node tree. It then looks for an XSLT style sheet for instructions on what to do with the nodes. The instructions are contained in files called templates. Each template connects the appropriate XML node types with instructions on the transformations that should take place.

Take a look at the class example XLST document code:

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<html>
<head>
<title>Our Menu</title>
</head>
<body bgcolor="white">
<xsl:apply-templates select="menu" />
</body>
</html>
</xsl:template>
<xsl:template match="menu">
<xsl:for-each select="meal">
<xsl:sort select="category" data-type="text" />
<xsl:sort select="name" data-type="text" />
<table width="100%" border="2">
<tr>
<th>Name</th>
<th>Category</th>
<th>Preparation Time</th>
<th>Calories</th>
<th>Cost</th>
<th></th>
</tr>
<tr>
<td>
<font size="+1">
<xsl:value-of select="name" />
</font>
</td>
<td>
<xsl:value-of select="category" />
</td>
<td>
<xsl:value-of select="prep_time" />
</td>
<td>
<xsl:value-of select="calories" />
</td>
<td>
<xsl:value-of select="cost" />
</td>
<td>
<xsl:value-of select="picture" />
</td>
</tr>
</xsl:template>
</xsl:for-each>
<tr>
<td align="right">
<b>Total:</b>
</td>
<td>
<xsl:value-of select="count(meal)" />
meal
</td>
<td>
<xsl:value-of select="sum(meal/calories)" />
</td>
<td>
<br />
</td>
</tr>
</table>
</xsl:template>
</xsl:stylesheet>

Note the XSLT document is an XML document (the first line identifies an XML document). The second line tells the XML to apply the XSLT style sheet contained within the XML. XSLT style sheets have two components: instructions and literals. The literals print out on the HTML document and the instructions describe how the XML will be transformed to change the HTML appearance.

The third line is required of all XSLT document as it connects the structure of the XML document to the node tree root. Remember all node trees in XML have a single parent of all other nodes (elements). The next five lines are perfect examples of XSLT literals as they are directly from the HTML specification (and thus, the Web browser will treat them as HTML instructions). The line:

<xsl:apply-templates select="menu" />

is a critical XSLT instruction that asks the XML document to find information about what a menu element is (and hence takes advantage of the schema (the DTD document in our case). Once it pulls out the definition of the element, XSLT can print the values for that element as the data is contained in the XML. Other XSLT instructions (which are the most interesting part of the XSLT specification) simplify the process of styling the content and inserting it into the flow of the HTML document that is being assembled by the Web browser.

One interesting instruction in the example above is the

<xsl:for-each select="meal">

instruction where the XSLT will actually loop through the following code (the loop ends with the </xsl:for-each> instruction), looking at the structure of the meal schema and extracting the actual values for presentation.

Another interesting pre-processing XSLT instruction is:

<xsl:sort select="category" data-type="text" />

where node (element) values are actually sorted before following the presentation instructions.

The <xsl:value-of select="calories" /> instruction actually prints the value within the select element (calories in this case) out in the flow of the HTML document.

And, some XSLT instructions do aggregation and calculation like the <xsl:value-of select="sum(meal/calories)" /> instruction that sums all the calories for the output meal items.

If you have ever dabbled in spreadsheet software, you can see the features within XSLT contain the same kinds of processing abilities as the basics in a spreadsheet program.