Learn XML Document Type Definition’s

As I explained at the end of the last article. You create a Document Type to define the rules that must be followed in your XML Markup. A DTD defines what elements are required and what attributes can be set and what their potential values can be.

Every DTD must start with the line <?xml version=”1.0” encoding=”UTF-8” standalone=”no”?>. This tells the processor that this is an XML version 1.0 file. The character encoding used is of type UTF-8. And, that no external file is required for this document to work.

Also make sure that when you are creating DTD’s that !ELEMENT, !ATTLIST, #REQUIRED, #PCDATA, #CDATA, etc. are all capitalized.

There you are all up to speed. Now on to DTD Prologs…

DTD Prolog

In the DTD Prolog you describe your DTD in the following ways:

  • With comments you give a description of what the document contains. You start a comment with <!– and end it with the symbol –>
  • You tell the interpreter what the version of XML, encoding and whether you require any outside files to operate with a statement like this: <?xml version=”1.0” encoding=”UTF-8” standalone=”no”?>
  • You tell the interpreter the location of any outside required files with this statement: <!DOCTYPE customers SYSTEM “customerData.dtd”>

Note: If you don’t require any outside files standalone would have the value of “yes”. Also, you can refer to the location of a needed file like I did in the last example only if it resides in the same file folder. You could also replace the file name “customerData.dtd”, with a URL.


Everything that lies between two tags on a HTML page is referred to as an element. So, <p>Paragraph of text…</p> is an element. You can use DTD to define a great deal of rules that must be followed for all the elements on the page.

!ELEMENT is used to define what data can be placed in an element. It also defines how many times that data can be placed and in what order.

Here is a sample !ELEMENT definition:

<!ELEMENT customer (customerID, customerName, suffix?, products+, visits*)>

  • This statement is effecting data that is placed in the customer element
  • It demands that the data be entered in the same order they are listed
  • By not having any Occurrence Indicators defined, I’m stating that there can only be 1 customerID
  • With the ? Occurrence Operator I’m stating that suffix can either be defined once or not at all
  • With the + Occurrence Operator I’m stating that there can be one or more products
  • With the * Occurrence Operator I’m stating that visits can be defined zero or more times

Here are the ways you can define what types of data can be placed in an element:

  • ANY: Allows the user to place any type of content in the element. Ex: <!ELEMENT Name ANY>
  • Child Elements: Allows the user to place only other elements in the element. You don’t define this rule with the word Child Element, but instead like this: <!ELEMENT Name (ChildElement1,ChildElement2)>
  • EMPTY: Doesn’t allow any information to be placed in an element. Ex: <!ELEMENT Name Empty>
  • OR: Allow either of multiple different types of data to be entered in an element. In the example we are saying it’s ok to add any type of character data (#PCDATA) or child elements. Ex: <!ELEMENT Name (#PCDATA|ChildElement1)*> This is known as a Mixed Content Element

Special Note:

#PCDATA is known as parsed character data. This is character data that contains codes that the xml interpreter must decode. Parsed character data would contain the following codes &amp; &lt; and &gt; instead of &, <, or >.

#CDATA on the other hand is just straight text that contains no character codes.


You define what attributes can be assigned to your elements with !ATTLIST. You also use it to define the data type and any default values with it. The standard format is:

<!ATTLIST elementName attributeName dataType defaultValue>

A real world example might look like this

<!ATTLIST customer firstName CDATA #REQUIRED>

Attribute Data Types

The first to elements used in the above !ATTLIST definition are self explanatory so I’ll define the data types available:

  • CDATA: Excepts any character data that doesn’t include the symbols &, “, <, or >. You can use their escaped versions which are &amp for &, &quot for “, %lt for <, and &gt for >.
  • ID: An identification number is generated to uniquely identify an element.
  • NOTATION: Non-XML data that was specified with a Notation Declaration. More on this in a second!
  • ENTITY: An alias to a previously defined block of text. More on this in a second!

You also can use an enumerated list of values instead of a data type. An Enumerated List is an all inclusive list of every possible value. Here is an example:

<!ATTLIST customer suffix (BA | MA | Ph.D.) #IMPLIED>

Finally, you define the default value for the attribute. All of the possible values for default value are:

  • #REQUIRED: Means exactly what it says, there must be a value.
  • #FIXED: Means the value is optional, but must be the value assigned if used. <!ATTLIST customer newCust #FIXED “new”> states that if a customer is marked as new they must be assigned the value of “new”.
  • #IMPLIED: Means the attribute is totally optional.
  • Value: Defines a specific default value. <!ATTLIST customer newCust CDATA “new”>


With an entity you can declare a type of variable name that would represent a block of text. Here is an example of an entity:

<!ENTITY bizAddress “123 Main St, Irwin, PA 15147”>

With this defined I can now place an address any where by just using the code &bizAddress;. This is known as an Internal Entity, if it’s defined in the DTD that is referencing it.

An External Entity is defined in an included file. You would define it with this sample format:

<!ENTITY entityName SYSTEM “urlOfData”>

One of the great things about External Entities is that they can reference images, and other none XML data.

Just so I’m completely clear, you reference the Entity by typing a &, followed by the Entity name and then a semi-colon (;). Also you cannot make a call to an entity until you have already defined it in your code.

External Entities are either Parsed or Unparsed:

  • Parsed: Data that is of either type text or XML. This data is interpreted by the XML parser.
  • Unparsed: Any data that is not of type text or XML. It can be pretty much any type of data. The XML parser will ignore this data, because it doesn’t understand it.

Please note that Unparsed External Entities are defined differently, so that they are passed on to the proper helper. You could define an image with this code:

<!ENTITY imageName SYSTEM “urlOfImage” NDATA jpeg>

  • We define the name of the Entity, followed by the word SYSTEM.
  • Then we include the URL address of the image.
  • NDATA tells the parser that it needs outside help to work with this image.
  • jpeg is a Notation. I’m covering these next.


The Notation Element describes the format of non-XML data within an XML document. The basic format of a Notation follows:

<!NOTATION notationName SYSTEM typeContent>

or for the image I was talking about above

<!NOTATION jpeg SYSTEM “image/jpeg”>

That’s All Folks

That is pretty much all there is to know about Document Type Definition’s (DTD). If you have any questions leave them below. Next up I’ll talk about Schema’s.

Till Next Time

– Think Tank

Leave a Reply

Your email address will not be published.