CS 683 Emerging Technologies Spring Semester, 2003 XML Introduction |
||
---|---|---|
© 2003, All Rights Reserved, SDSU & Roger Whitney San Diego State University -- This page last updated 09-Feb-03 |
Introduction
XML References
Sun's XML site
History
In the beginning was Guttenberg
But computers changed how to produce
How to represent a document?
Imbed commands or tags in the text
What should the commands do?
<bold><center>See the cat run</center></bold>
<ChapterHeader>See the cat run</ChapterHeader>
<header>Short History of Tags</header>
GenCode
HTML
Markup language for WWW
Wide spread use
Fixed set of tags
Some tags are presentational
<CENTER> <B>
Web Browsers permit poorly formed HTML
<A NAME="WhichOne"></a> <b><center>Hello World</b></CENTER><Br> <A NAME="WhichOne"></A>
These problems with HTML restrict Web functionality
XML
XML creators wanted
The Main Point of XML
XML is about
<?xml version="1.0" ?> <CATALOG> <CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD> <CD> <TITLE>Hide your heart</TITLE> <ARTIST>Bonnie Tyler</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>1988</YEAR> </CD> </CATALOG>
Developing XML
Defining the XML tags
Creating documents using XML tags
Displaying or processing the documents
Creating XML tags requires thinking about
Document Structure Example
Which is better?
<Paragraph> This is a short paragraph. It will be used as an XML example </Paragraph>
<Paragraph> <Sentence> This is a short paragraph. </Sentence> <Sentence> It will be used as an XML example </Sentence> </Paragraph>
<Paragraph> <Author>Roger Whitney</Author> <DateCreated>July 20, 2001</DateCreated> <Title>XML Example</Title> <Sentence> This is a short paragraph. </Sentence> <Sentence> It will be used as an XML example </Sentence> </Paragraph>
Data Example
Which is better?
<CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD>
<CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> < DURATION >89</DURATION> <TYPE>Rock</TYPE> <TRACKS>12</TRACKS> <INSTRUMENT>Guitar</INSTRUMENT> <INSTRUMENT>Drum</INSTRUMENT> <INSTRUMENT>Banjo</INSTRUMENT> </CD>
The XML Universe
Basic Syntax
XML 1.0 spec, XML 1.1
XLinks
The XML UniverseDocument Modeling
Document Type Definitions (DTDs)
The XML UniverseParsing, Programming
Document Object Model (DOM)
Simple API for XML (SAX)
XML Information Set
XML Fragment Interchange
Network Protocols
XML-RPC
Simple Object Access Protocol (SOAP)
XML Syntax
Example
<!-- A simple XML document with comment --> <greetings> Hello World! </greetings>
XML Terms
Tag
A piece of text that describes a unit of data
Tags are surrounded by angle brackets (< and >)
Tags are case sensitive
<GREETINGS> <greetings> <Greetings>
<slide title="XML Slide"> <slide title="Who's on First"> <name position='First'>
XML TermsElement
Unit of XML data, delimited by tags
<greetings>Hello World!</greetings>
<name> <firstName>John</firstName> <lastName>Fowler</lastName> </name>
Elements can be nested inside other elements
Markup
Tags and comments in an XML document
Content
Anything that is not markup in a document
Document
An XML structure in which one or more elements contains text intermixed with subelements
XML Document
<greetings> <from> <nnammee> <firstName>Roger</firstName> <lastName>Whitney</lastName> </nnammee> </from> <to> <name> <firstName>John</firstName> <lastName>Fowler</lastName> </name> </to> <message> How are you? </message> </greetings>
Issues
Is that a typo or a legal tag?
How would we know?
Levels of XML
Well-formed
XML document that satisfies basic XML structure
Valid
XML document that is well-formed and
Well-Formed XML Documents
Basic Structure
Optional Prolog
Root Element
<?xml version="1.0" ?> <!-- A simple XML document with comment --> <greetings> Hello World! </greetings>
Prolog
For well-formed documents the prolog
<?xml version="1.0" encoding='US-ASCII' standalone='yes' ?>
<?xml version="1.0" encoding='iso-8859-1' standalone=no ?>
Comments
Comments can be placed nearly anywhere outside of tags
Comments can not come before <?xml version="1.0" ?>
<?xml version="1.0" ?> <!-- Another comment --> <greetings> <from>Roger<!-- Legal comment --></from> <to>John</to> <message>Hi</message> </greetings> <!-- Comments at the end -->
Root Element
Each XML Document has a single root element
Legal XML Document
<?xml version="1.0" ?> <greetings> <from>Roger</from> <to>John</to> <message>Hi</message> </greetings>
Illegal XML Document
<?xml version="1.0" ?> <from>Roger</from> <to>John</to> <message>Hi</message>
XML Document as a Tree
<?xml version="1.0" ?> <greetings> <from>Roger</from> <to>John</to> <message>Hi</message> </greetings>
Basic Rules for Well-Formed Documents
Legal |
Illegal
|
<greetings> <from>Roger</from> <to>John</to> <message>Hi</message> </greetings> |
<body> <p>Hello world <p>How are you? </body> |
<greetings></greetings>
<greetings/>
Basic Rules For Well-Formed Documents
Legal |
Illegal
|
<tag name='sam'> |
<tag name=sam> |
<b><center>Bad XML, but ok HTML</b></center>
White Space
Are the following the same?
<greetings> Hello World! </greetings>
<greetings>Hello World!</greetings>
White Space
For some applications white space may be important
XML parsers are to pass white space to applications
The application decides if the white space in important
Special Characters
What happens if we need to use < inside an element?
This is illegal XML
<paragraph> Everyone knows that 5 < 10 & 1 > 0. </paragraph>
Need to encode the < and & symbols
<paragraph> Everyone knows that 5 < 10 & 1 > 0. </paragraph>
You can used a CDATA section
<paragraph><![CDATA[ Everyone know that 5 < 10 & 1 > 0. ]]> </paragraph>
Standard element content can not contain: < ]]> &
Entities
Predefined
Entity |
Character |
<
|
< |
>
|
> |
& |
& |
' |
' |
" |
" |
Valid XML Documents
XML document that is well-formed and
<?xml version="1.0" ?> <!DOCTYPE greetings [ <!ELEMENT greetings (#PCDATA)>]> <greetings> Hello World! </greetings>
Why use DTDs & Valid XML?
Valid XML helps insure the XML is correct
<greetings> <from> <nnammee>Roger</nnammee> </from> <to> <name>World</name> </to> <message> Hi </message> </greetings>
In the above example humans know that the tag <nnammee> is an error. However, computer programs use XML and how would the program know this is a mistake? If the XML is specified bye a DTD an XML parser would catch the mistake.
Greetings Example
<?xml version="1.0" ?> <!DOCTYPE greetings [ <!ELEMENT greetings ( from, to, message, date?)> <!ELEMENT from ( name )> <!ELEMENT to ( name )> <!ELEMENT message ( #PCDATA )> <!ELEMENT date ( #PCDATA )> <!ELEMENT name ( #PCDATA )> ]> <greetings> <from> <name>Roger</name> </from> <to> <name>World</name> </to> <message> Hi </message> </greetings>
Greetings Example Explained
Basic DTD structure
<!DOCTYPE rootElementName [ definitions ]>
<!DOCTYPE greetings [
What happens here?
<?xml version="1.0" ?> <!DOCTYPE greetings [ <!ELEMENT greetings ( from, to, message, date?)> <!ELEMENT from ( name )> <!ELEMENT to ( name )> <!ELEMENT message ( #PCDATA )> <!ELEMENT date ( #PCDATA )> <!ELEMENT name ( #PCDATA )> ]> <greetings> <from> <nnammee>Roger</nnammee> </from> <to> <name>World</name> </to> <message> Hi </message> </greetings>
Validating and Non-validating Parsers
Validating XML Parsers
DTD Declarations
What can be declared in a DTD?
Elements
Attributes
Entities
Processing Instructions (PI)
Element Declarations
Empty Element
<!ELEMENT emptyExample EMPTY>
<emptyExample></emptyExample>
<emptyExample/>
Element with no restrictions
<!ELEMENT free ALL>
<free></free>
<free>Anything goes here</free>
<free>Some text <name>Roger</name> <free>
Element DeclarationsElements containing only character data
<!ELEMENT name ( #PCDATA )>
#PCDATA stands for parsed-character data
Entities are expanded in parsed-character data
Elements containing only other elements
<!ELEMENT greetings ( from, to, message, date?)>
Elements containing mixed content
<!ELEMENT sample ( #PCDATA | name)* >
<sample>hi</sample> <sample> <name>Roger</name> Some text <name>Pete</name> <name>Carman</name> More Text </sample>
Element Content OperatorsAnd
<!ELEMENT andSample ( A , B)>"andSample" must contain "A" followed by "B"
Or
<!ELEMENT orSample ( A | B)>
"orSample"
Element Content OperatorsOptional
<!ELEMENT optionalSample ( A?)>
"optionalSample"
<!ELEMENT onePlusSample ( A+)>
<!ELEMENT anyNumberSample ( A*)>
Example
<!ELEMENT article (title, abstract?, author*, (paragraph | table | list )+, reference*) >
What does this mean?
Must start with a title
Then may be one abstract
Then a list of zero or more authors
Followed by any number of paragraphs, tables and lists
They can be in any order
One of them must occur
At the end there can zero or more references
Attribute List DeclarationsGeneral Format
<!ATTLIST elementName tagName Type Modifier>
Example
<!DOCTYPE student [ <!ELEMENT student ( #PCDATA )> <!ATTLIST student name CDATA #REQUIRED hometown CDATA "none specified" college (arts | agr | law | vet| ilr | engr) #REQUIRED > ]> <student name='Roger' college='arts'> A Good student </student>
Example Parsed <student name="Roger" college="arts" hometown="none specified"> A Good student </student>
Common Attribute Types
Attribute types determine the type of values an attribute
CDATA
Character data - any sequence of characters
<!ATTLIST student name CDATA #REQUIRED> <student name="Roger & Whitney"></student>
NMTOKEN
String of characters starting with a letter
Can contain numbers, letters and some punctuation
<!ATTLIST student name NMTOKEN #REQUIRED> <student name="Roger1Whitney"></student>
NMTOKENS
List of tokens
<!ATTLIST student name NMTOKENS #REQUIRED> <student name="Roger Whitney"></student>
Common Attribute Types
Enumerated List
List of all possible values the attribute can have
<!ATTLIST student college (arts | agr | law | vet| ilr | engr) #REQUIRED >
<student college="arts"></student>
Attribute Modifiers
#REQUIRED
The attribute must be in the tag
<!ATTLIST student name NMTOKENS #REQUIRED> <student name="Roger Whitney"></student>
Default value
The default value of the attribute
<!ATTLIST student hometown CDATA "none specified"> <!ATTLIST student college (arts | agr | law | engr) "engr" >#IMPLIED
The attribute is optional
<!ATTLIST student hairColor CDATA #IMPLIED>
Attributes verses Subelements
The following XML documents
<?xml version="1.0" ?> <greetings> <from>Roger</from> <to>John</to> <message>Hi</message> </greetings>
<?xml version="1.0" ?> <greetings from="Roger" to="John" message="Hi"/>
Attributes verses SubelementsSome guidelines [1]
Use an element when:
Entity Declaration
Entities are like macros that are expanded by the parser
<!DOCTYPE roger [ <!ELEMENT roger ( #PCDATA )> <!ENTITY address "1234 Maple Street, San Diego, CA"> ]> <roger> He lives at: &address; </roger>
XML after being parsed
<roger> He lives at: 1234 Maple Street, San Diego, CA </roger>
PCDATA Verses CDATA
XML parsers expand entities in PCDATA content
XML parsers do not expand entities that in CDATA
CDATA Example
<!DOCTYPE roger [ <!ELEMENT roger ( #PCDATA )> <!ENTITY address "1234 Maple Street, San Diego, CA"> ]> <roger><![CDATA[ He lives at: &address;]]> </roger>
XML after being parsed
<roger> He lives at: &address; </roger>
[1] Ray, pp. 59-61
Copyright ©, All rights reserved.
2003 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA.
OpenContent license defines the copyright on this document.
Previous    visitors since 09-Feb-03    Next