XML VS. Relational  

Acknowledgement: Some parts on this page (not all) have been lifted from: http://www.rpbourret.com/xml/XMLAndDatabases.htm#datavdocs  web site including the authors conceptual mistakes from which we can learn.

 

Main Idea:

Use XML for document-centric and complex data. Use relational database for data-centric, simple transaction type data.

Simple uniform data is usually stored in a relational database and irregular document type data is usually stores in XML type database.

To transport data from one application to another (web services) XML is used in both cases: document-centric and data-centric data.

 Data-Centric Documents for transport between software.

Data-centric documents are documents that use XML as a data transport between remote software, i.e., web services. They are designed for machine consumption and the fact that XML is used at all is usually superfluous. That is, it is not important to the application or the database that the data is, for some length of time, stored in an XML document. Examples of data-centric documents are sales orders, flight schedules, scientific data, and stock quotes.

For example, the following sales order document is data-centric:

   <SalesOrder SONumber="12345">
      <Customer CustNumber="543">
         <CustName>ABC Industries</CustName>
         <Street>123 Main St.</Street>
         <City>Chicago</City>
         <State>IL</State>
         <PostCode>60609</PostCode>
      </Customer>
      <OrderDate>981215</OrderDate>
      <Item ItemNumber="1">
         <Part PartNumber="123">
            <Description>
               <p><b>Turkey wrench:</b><br />
               Stainless steel, one-piece construction,
               lifetime guarantee.</p>
            </Description>
            <Price>9.95</Price>
         </Part>
         <Quantity>10</Quantity>
      </Item>
      <Item ItemNumber="2">
         <Part PartNumber="456">
            <Description>
               <p><b>Stuffing separator:<b><br />
               Aluminum, one-year guarantee.</p>
            </Description>
            <Price>13.27</Price>
         </Part>
         <Quantity>5</Quantity>
      </Item>
   </SalesOrder>

Professor’s comment:

Can you see why this is a bad design to keep track of orders?

It’s a bad design because if customer makes many orders then we would record the customer’s address many times. So if an address correction has to be made then it must be made many times.  It is a sure way to create inconsistent database.

Relational model with its Normalization and integrity rules can help us make better data-centric XML databases.

What would be a better design?

Assignment:  Create a good Relational Schema then map it into a better XML Schema than the one above.

In addition to such obviously data-centric documents as the sales order shown above, many prose-rich documents are also data-centric. For example, consider a page on Amazon.com that displays information about a book. Although the page is largely text, the structure of that text is highly regular, much of it is common to all pages describing books, and each piece of page-specific text is limited in size. Thus, the page could be built from a simple, data-centric XML document that contains the information about a single book and is retrieved from the database, and an XSL stylesheet that adds the boilerplate text. In general, any Web site that dynamically constructs HTML documents today by filling a template with database data can probably be replaced by a series of data-centric XML documents and one or more XSL stylesheets.

For example, consider the following document describing a flight:

   <FlightInfo>
      <Airline>ABC Airways</Airline> provides <Count>three</Count>
      non-stop flights daily from <Origin>Dallas</Origin> to
      <Destination>Fort Worth</Destination>. Departure times are
      <Departure>09:15</Departure>, <Departure>11:15</Departure>,
      and <Departure>13:15</Departure>. Arrival times are minutes later.
   </FlightInfo>

This could be built from the following XML document and a simple stylesheet:

   <Flights>
      <Airline>ABC Airways</Airline>
      <Origin>Dallas</Origin>
      <Destination>Fort Worth</Destination>
      <Flight>
         <Departure>09:15</Departure>
         <Arrival>09:16</Arrival>
      </Flight>
      <Flight>
         <Departure>11:15</Departure>
         <Arrival>11:16</Arrival>
      </Flight>
      <Flight>
         <Departure>13:15</Departure>
         <Arrival>13:16</Arrival>
      </Flight>
   </Flights>

Document-Centric Documents

Document-centric documents are (usually) documents that are designed for human consumption. Examples are books, email, advertisements, and almost any hand-written XHTML document. They are characterized by less regular or irregular structure, larger grained data (that is, the smallest independent unit of data might be at the level of an element with mixed content or the entire document itself), and lots of mixed content. The order in which sibling elements and PCDATA occurs is almost always significant.

Document-centric documents are usually written by hand in XML or some other format, such as RTF, PDF, or SGML, which is then converted to XML. Unlike data-centric documents, they usually do not originate in the database. (Documents built from data inserted into a template are data-centric; for more information. For information on software you can use to convert various formats to XML, see the links to various lists of XML software.

For example, the following product description is document-centric:

   <Product>
 
   <Intro>
   The <ProductName>Turkey Wrench</ProductName> from <Developer>Full
   Fabrication Labs, Inc.</Developer> is <Summary>like a monkey wrench,
   but not as big.</Summary>
   </Intro>
 
   <Description>
 
   <Para>The turkey wrench, which comes in <i>both right- and left-
   handed versions (skyhook optional)</i>, is made of the <b>finest
   stainless steel</b>. The Readi-grip rubberized handle quickly adapts
   to your hands, even in the greasiest situations. Adjustment is
   possible through a variety of custom dials.</Para>
   
   <Para>You can:</Para>
 
   <List>
   <Item><Link URL="Order.html">Order your own turkey wrench</Link></Item>
   <Item><Link URL="Wrenches.htm">Read more about wrenches</Link></Item>
   <Item><Link URL="Catalog.zip">Download the catalog</Link></Item>
   </List>
   
   <Para>The turkey wrench costs <b>just $19.99</b> and, if you
   order now, comes with a <b>hand-crafted shrimp hammer</b> as a
   bonus gift.</Para>
   
   </Description>
   
   </Product>

 

Complex Data

The reason XML is so good at modeling complex data is that the same building blocks for narrative documents—elements and attributes—can apply to any composition of objects and properties. Just as a book breaks down into chapters, sections, blocks, and inlines, many abstract ideas can be deconstructed into discrete and hierarchical components. Vector graphics, for example, are composed of a finite set of shapes with associated properties. You can represent each shape as an element and use attributes to hammer down the details.

SVG is a good example of how to represent objects as elements. Take a gander at the simple SVG document in Example below. Here we have three different shapes represented by as many elements: a common rectangle, an ordinary circle, and an exciting polygon. Attributes in each element customize the shape, setting color and spatial dimensions.

An SVG document

<?xml version="1.0"?>
<svg>
  <desc>Three shapes</desc>
  <rect fill="green" x="1cm" y="1cm" width="3cm" height="3cm"/>
  <circle fill="red" cx="3cm" cy="2cm" r="4cm"/>
  <polygon fill="blue" points="110,160 50,300 180,290"/>
</svg>

Vector graphics are scalable, meaning you can stretch the image vertically or horizontally without any loss of sharpness. The image processor just recalculates the coordinates for you, leaving you to concentrate on higher concepts like composition, color, and grouping.

SVG adds other benefits too. Being an XML application, it can be tested for well-formedness, can be edited in any generic XML editor, and is easy to write software for. DTDs and Schema are available to check for missing information, and they provide an easy way to distinguish between versions.

Are there limitations? Of course. XML is not so good when it comes to raster graphics. This category of graphics formats, which includes TIFF, GIF and JPEG, renders an image based on pixel data. Instead of a conceptual representation based on shapes, it's a big fat array of numbers. You could store this in XML, certainly, but the benefits of markup are irrelevant since elements would only increase the document's size without organizing the data well. Furthermore, these formats typically use compression to force the huge amount of data into more manageable sizes, something markup would only complicate. (Video presents similar, larger, problems.)

What other concepts are ideally suited to XML representation? How about chemicals? Every molecule has a unique blueprint consisting of some combination of atoms and bonds. Languages like the Chemical Markup Language (CML) and Molecular Dynamics Language (MoDL) follow a similar strategy to encode molecules.

Example below shows how a water molecule would be coded in MoDL. Notice the separation of head and body that is reminiscent of HTML. The head is where we define atomic types, giving them size and color properties for rendering. The body is where we assemble the molecule using definitions from the head.

A molecule definition in MoDL

<?xml version="1.0"?>
<modl>
   <head>
     <meta name="title" content="Water" />
     <DEFINE name="Hydrogen">
       <atom radius="0.2" color="1 1 0" />
     </DEFINE>
     <DEFINE name="Oxygen">
       <atom radius="0.5" color="1 0 1" />
     </DEFINE>
   </head>
   <body>
      <atom id="H0" type="Hydrogen" position="1 0 0" />
      <atom id="H1" type="Hydrogen" position="0 0 1" />
      <atom id="O" type="Oxygen" position="0 1 0" />
      <bond atom1="O" atom2="H0" color="0 0 1" />
      <bond atom1="O" atom2="H1" color="0 0 1" />
   </body>
</modl>

For each atom instance, there is an element describing its type, position, and a unique identifier (e.g., "H0" for the first hydrogen atom). Each bond between atoms also has its own element, specifying color and the two atoms it joins. Notice the interplay between atoms and bonds. The unique identifiers in the first group are the "hooks" for the second group, which use attributes that refer to them. Unique identifiers are another invaluable technique in expressing relationships between concepts.

MoDL is a project by Swami Manohar and Vijay Chandru of the Indian Institute of Science. The goal is not just to model molecules, but to model their interactions. The language contains elements to express motion as well as the static initial positions. Elements can represent actions applied to molecules, including translate and rotate.

Software developed for this purpose converts MoDL documents into a temporal-spatial format called Virtual Reality Markup Language (VRML). When viewed in a VRML reader, molecules dance around and bump into each other! Read more about MoDL at http://violet.csa.iisc.ernet.in/~modl/ and VRML at http://www.web3d.org.

Again, there are limitations. Movies, just like graphics, can be vector-based or rasterized. Formats like MPEG and MOV are compressed sequences of bitmaps, a huge amount of pixel information that XML would not be good at organizing. Simple shapes bouncing around in space are one thing, but complex scenes involving faces and puppy dogs are probably never going to involve XML.

Presentation Versus Conceptual Encoding

Moving up in complexity is mathematics. The Mathematics Markup Language (MathML) attacks this difficult area with two different modes of markup: presentational and conceptual. If we were describing an equation, we could do it in two ways. I could say "the product of A and B" or I could write on a chalkboard the more compact "A × B," both conveying the same idea. MathML allows you to use either style and mix them together in a document.

Consider the mathematical expression in Figure below. This example was generated with MathML and displayed with Mozilla, which recognizes MathML as of version 1.1.

Figure  A complex fraction

is the MathML document used to generate this figure.

Example  Presentation encoding in MathML

<?xml version="1.0"?>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mn>1</mn><mo>-</mo>
  <mfrac><mrow><mn>1</mn></mrow>
    <mrow><mn>1</mn><mo>-</mo>
      <mfrac><mrow><mn>1</mn></mrow>
        <mrow><mn>1</mn><mo>-</mo>
          <mfrac><mrow><mn>1</mn></mrow>
            <mrow><mi>x</mi></mrow>
          </mfrac>
        </mrow>
      </mfrac>
    </mrow>
  </mfrac>
</math>

mfrac, as you may have guessed, sets up a fraction. It contains two elements called mrow, one each for the top and bottom. Notice how the denominator can itself contain a fraction. Take this recursively as far as you wish and it's perfectly legal in MathML. At the atomic level of expression are numbers, variables, and operators, which are marked up with the simple elements mn (number), mi (identifier), and mo (operator).

Conceptual encoding (also known as content encoding) is the name given for the other mode of MathML. It resembles functional programming, notably LISP, in that every sum, fraction, and product is represented as an operator followed by arguments all wrapped up in an apply element. Example below  shows how the equation (2a + b)3 looks in MathML's content mode.

Example Content encoding in MathML

<?xml version="1.0"?>
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <apply><power/>
    <apply><plus/>
      <apply><times/>
        <cn>2</cn>
        <ci>a</ci>
      </apply>
      <ci>b</ci>
    </apply>
    <cn>3</cn>
  </apply>
</math>

Why the two modes of MathML? One reason is flexibility of authoring. But a more important reason is that each lends itself to a different means of processing. Presentational encoding is easier to render visually, and so is better supported in browsers and such. Content encoding, because it's more regular and closer to the meaning of the expression, is easier to process by calculator-type programs.

With support for MathML in browsers and other programs increasing, its popularity is growing. For more information, read A Gentle Introduction to MathML by Robert Miner and Jeff Schaeffer at http://www.dessci.com/en/support/tutorials/mathml/default.htm.

 

Comparing XML and Relational Databases

XML was not designed with databases in mind. XML's origins lie in document meta tagging, and the XML language was developed to infuse structure and meaning into the vast amount of presentation-oriented content.

Now that it has evolved into a core application development technology, it is being used for a variety of sophisticated data representation and transportation purposes.

XML documents and relational databases represent and structure data in very different ways. This draws an invisible border between the two environments, and getting these data platforms to cooperate efficiently can challenging .

 Comparing XML and relational databases

We are not making this comparison to provide a choice between the two platforms. We are only assessing the features of each to gain an understanding their differences.

 Data representation

These two platforms face significant integration challenges because XML document hierarchies are difficult to recreate within relational databases, and relational data models are difficult to represent within XML documents.

Data representation comparison

 

Databases

XML

Data model

Relational data model, consisting of tabular data entities (tables), with rows and columns.

Hierarchical data model, composed of document structures with element and attribute nodes.

Data types

A wide variety of data types typically are provided, including support for binary data.

XSD schemas are equipped with a comparable set of data types.

Data element relationships

Column definitions can interrelate within and between tables, according to DDL rules.

References can be explicitly or intrinsically defined between elements.