- Back to Home »
- XML
Posted by : Unknown
Thursday, April 25, 2013
INTRODUCTION:
XML, the
Extensible Markup Language, which is the best described as means of structuring
data. XML was developed by the XML
working group, which started out as “the SGML Editorial Review Board”, formed by the W3C (world wide web
consortium) in1996 and chaired by John Bosak. XML has ability to work with HTML
and definitively not an extension of HTML.
XML has ability to work with HTML for data display and presentation. XML
is a markup language that can run on any platform, operating system environment
and is designed to provide developers with a mechanism to better describe their
content. XML was developed to provide a
structured environment for developers to create DTDs (Document Type
Definitions) for content that doesn’t fit into the HTML mold. Both Internet
explorer 4 and5 as well as Netscape Navigator 5, have support for XML.
ABOUT
MARK-UP LANGUAGES
Ø
Markup is commonly
used to change the look of text by adding formatting, such as bold or italic
fonts, text indents, font sizes and font weights.
Ø
Markup languages are
used to define the structure and meaning of a document and to modify the look
and formatting of text.
Ø
In addition to
formatting text, markup can also be used to determine the structure and meaning
of textual element.
Ø
Markup languages use
tags embedded directly into the text to describe the various pieces and parts
of the text.
Ø
Two types of markup
languages are in use today:
ª
Specific markup
ª
Generalized markup
SPECIFIC MARK-UP
LANGUAGES:
Ø Specific markup languages are used to generate
code that is specific to a particular
application.
Ø HTML and RTF were developed for specific
purpose
Some limitations common to specific
markup languages.
1.
Authors are limited
to a particular set of tags.
2.
A document might not
be portable to other applications, as
the data is not self-describing.
3.
The language probably
has a proprietary of marking up text is not compatible with other markup
languages.
Ø GENERALISED MARK-UP LANGUAGE:
·
Generalized
mark-up language describes the structure and meaning of the text in a document,
but it doesn’t define the usage of text.
·
GMLs are used to
generate code that is portable to most applications.
·
In the 1970’s
DR.C.F.Goldfard and two of his colleagues proposed a method has 2 basic
structures, describing text that was not specific to an application.
·
The markup should
describe the structure of a document and not its formatting or style
characteristics.
·
The syntax of the
markup should be strictly enforced so that a software program can clearly read
the code or by a human being. The result of these suggestions was the GML
developed for IBM.
SGML:
GML was
the precursor to the Standard Generalized
Markup language (SGML) that was adopted as a standard by ISO in
1986. SGML is meta-language that
facilitates the creation of other languages.
SGML is intended to be absolutely
independent of any application. SGML DTD
(Document Type Definition). SGML is
extensible, which means that it allows an author to define a particular
structure. Almost all languages that have been created to manipulate documents
can trace at least a portion of their roots back to SGML. HTML is an application of SGML. The HTML is not extensible, which means the
HTML cannot be used to create another markup language with its own rules and
purpose
XML Vs HTML
Ø
XML was designed to
describe data and HTML was designed to display data.
Ø
XML is extensible
i.e. user can use their own tags for markup whereas HTML use predefined tags to
markup.
Ø
XML starting and
ending element names must be exactly same (case of each char is same) incase of
HTML tag names may be either of the case.
Ø
XML not display
information as like HTML document in IE it display tree like structuring the
documents.
Ø
XML uses parsers to
check the well formedness of the document it is not necessary for HTML.
Ø
XML uses XSL, XLL to
display the data as HTML displays in the browser.
Ø
XML have namespace to
prevent ambiguity in element tags it is not necessary in HTML.
Components of XML:
There are five classes of markup in XML
1.Elements.
2.Contents.
3.Attributes.
4.
Comments.
Elements:
XML
Elements describe the meaning of the text they contain. Elements typically
occur in pairs with start tag and an end tag that enclose the text they markup.
Inside the start tag, a keyword indicate the meaning of the markup .The
end tag contains the same keyword
with ‘/’ in front of it.
Ex:
<Title>
My Title </Title>
Here
Title is an element in HTML, which is used to display the title in the window
of a Web Browser. In XML you can create your own set of elements.
EX:
<empname>John Rambo
</empname>
Here
empname is the element that is enclosed between the opening tag(< >) and the closing tag(</ >). The
data between these tags defines an XML element.
There
are some tags that do not come in pairs. Such tags are called empty elements.
Contents:
The information represented by the
elements in an XML document is called content of that element. In the example
<Title>
Lion King </Title>
Here
Lion King is the content represented by the Title element.
Attributes:
Attributes
provide additional information about the elements. Each attribute has a name
and a value. The value can be a number, a word, or a URL.
In HTML, you can use an attribute such as color for the font
element, as given below.
<Font color =
“red” >Displayed in red </Font>
Here color is the
attribute name with attribute value, red.
Comments:
Comments
are used to add notes to an XML document. The browser and XML processors ignore
comments.
To create a
comment, type a less than sign (<) followed by an exclamation point and two
dashes (<! --). Type the text you want to use as a comment entry. Now end
the comment with a closing tag which is two dashes of a greater than sign as
given below (-->). For example
<! - -Writing
Comments in XML Documents - - >
XML
Documents
XML is a text-based
format, similar to HTML in many respects, designed specifically to store and
transmit data. An XML source is made up of XML elements, each of which consists
of a start tag (<title>), an end tag (</title>), and the
information between the two tags (referred to as the content). Like HTML, an
XML document holds text annotated by tags. However, unlike HTML, XML allows an
unlimited set of tags, each indicating not how something should look, but what
something means. For example, an XML element might be tagged as a price, an
order number, or a name. It is up to each document's author to determine what
kind of data to use and which tag names fit best.XML documents are easy to
create. If you are familiar with HTML, you can quicklylearn to author in XML.
In this example, XML is used to describe a weather report. This file can be
saved with an extension of XML, like ADDRESS.XML
<?xml version="1.0"?>
<addressBook>
<person>
<name>
<first> Ranjith
</first>
<last>
Kumar </last>
</name>
<e-mail>prasad_minna@hotmail.com</e-mail>
<person>
</addressBook>
Namespaces
XML
namespaces let developers qualify element names in a recognizable manner to
avoid conflicts between elements with the same name. Elements referenced in one
document, such as a purchase order, can be defined in different schemas on the
Web. Namespaces ensure that element names do not conflict and clarify their
origins, but do not determine how to process elements. Parsers must know what
elements mean and how to process them. Tags from multiple namespaces can be
mixed, which is essential with data coming from multiple
sources across the Web. With namespaces, both elements could exist in the same
XML-based document instance but could refer back to two different schemas,
uniquely qualifying their semantics. For instance, in a bookstore purchase
order, one "title" element could contain a book title, and another "title"
element could contain the author's title.
The W3C has released XML namespaces as a
recommendation, allowing elements to be subordinate to a URI. This ensures that
names remain unambiguous even if chosen by multiple authors. Just as anyone can
publish their own Web page or view those of others, the namespace facility
allows users to define private dictionaries of terms, or use a public namespace
of common terms.
<orders xmlns:person="http://www.schemas.org/people"
xmlns:dsig="http://dsig.org">
<order>
<sold-to>
<person:name>
<person:last-name>Layman</person:last-name>
<person:first-name>Andrew</person:first-name>
</person:name>
</sold-to>
<sold-on>1997-03-17</sold-on>
<dsig:digital-signature>1234567890</dsig:digital-signature>
</order>
</orders>
This
code tells any reader that if a name begins with "dsig:" its meaning
is defined by whoever owns the "http://www.dsig.org" namespace.
Similarly, elements beginning with the "person:" prefix have meanings
defined by the "http://www.schemas.org/people" namespace.
Namespaces ensure that element names do not
conflict, and clarify who defined which term. They do not give instructions on
how to process the elements. Readers still need to know what the elements mean
and decide how to process them. Namespaces simply keep the names straight.
An author can specify an element's data type
(it's a number, a date, and so on) and the format of the string's contents. One
can use the dt attribute from the data types namespace at "urn:
schemas-microsoft-com: datatypes" for this purpose.
<Sold-on dt:dt="date"
Xmlns: dt="urn: schemas-microsoft-com: datatypes">1997-03-17</sold-on>
Here, "date" specifies that the "sold-on" element's contents are a date in the standard format specified by the data types namespace. As with element names, authors will eventually be able to design their own data types, and also use types shared publicly. Microsoft is working with the W3C to define a set of standard types, and has provided an initial list as part of XML Schema support in Internet Explorer 5.
Document Type Definition
Document Type Definitions
(DTDs) can accompany a document, essentially defining the rules of the
document, such as which elements are present and the structural relationship
between the elements. DTDs help to validate the data when the receiving
application does not have a built-in description of the incoming data. With
XML, however, DTDs are optional.
Data sent along with a DTD is known as
valid XML. In this case, an XML parser could check incoming data against the
rules defined in the DTD to make sure the data was structured correctly.
<?
Xml version="1.0"? >
<! DOCTYPE
note [
<! ELEMENT note
(to,from,heading,body)>
<!
ELEMENT to (#PCDATA)>
<!
ELEMENT from (#PCDATA)>
<!
ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
<note>
<to>Toe</to>
<from>Jani</from>
<heading>Reminder</heading>
<Body>Don’t forget me this weekend</body>
</note>
Well-Formed XML document
A XML document said to well-formed
Ø It contains one or more elements.
Ø It has one root element which contain
s all the other elements.
Ø The names used in element start tags
and end tags match exactly.
Ø
Attribute values must be enclosed in quotes .
Ø
Entities are declared before they are used.
Schemas:
A
schema is a formal specification of the rules of an XML document, namely the
element names, that indicates which elements are allowed in a document and in
what combinations
Schemas
are the successors of DTDs
·
XML Schema was originally proposed by Microsoft, but is now a
W3C proposal.
·
XML Schemas will be used in Web applications as a
replacement for DTDs.
Here are the reasons why:
XML Schemas are easier to learn than DTD
XML
Schemas are extensible to future additions
XML
Schemas are richer and more useful than DTDs
XML
Schemas are written in XML
XML
Schemas support data types
XML Schemas support namespaces
XML Parsers
The XML parser in Internet Explorer 5
can read a string of XML data, process it, generate a structured tree, and
expose all data elements as objects using the DOM. The parser displays this
data using a CSS or XSL style sheet, or makes the data available for further
manipulation by script or hands it off to other applications or objects for
further processing. Namespaces, data types, queries, and XSL transformations
are supported with extended methods available in the DOM.
·
XML parsers are implementation of two basic APIs
specified by W3C
- SAX (Simple API for XML)
- DOM (Document Object Model)
SAX
·
An event-based API,
such as SAX uses callbacks to report parsing events to the application
·
Events include the
start and end of elements and characters.
·
Applications that do
not require complex manipulations of the
XML structure will find the SAX interfaces very useful
DOM
·
A
tree-based API builds an in-memory tree representation of the XML document.
·
It provides classes
and methods for an application to navigate and process the tree.
·
In general, the DOM
interface is most useful for structural manipulations of the XML tree, such as
reordering elements, adding or deleting elements and attributes, renaming
elements, and so on.
Linking in XML
XML document linking is done in two way.
1.Xlink.
2.Xpointer.
Xlink:
Xlink defines two basic
types of links.
Simple Links:
Ø
Simple links effectively emulate the hypertext linking of html.
Ø
Simple links are only in one direction.
Extended links:
Ø
A group of possible destinations from a single can be defined
using this links.
Ø
Extended link groups provide a way to manage related information
by setting up elements that contains lists of related documents.
XPointer:
Xpointer
supports addressing into the internal structure of the XML documents.
XPointer references to elements,
character strings and other parts of document.
XSL
(eXtensible Style sheet Language)
An XSL style sheet contains instructions for
how to pull information out of an XML document and transform it into another
format, such as HTML. The transformation of XML into formats, such as HTML, is
done in a declarative way, making it often easier and more accessible than
through scripting. In addition, XSL uses XML as its syntax, freeing XML authors
from having to learn another markup language.
CSS can still be used for simply
structured XML data—and in such situations, it will be useful. However, CSS
does not provide a display structure that deviates from the structure of the
data source. With XSL, it is possible to generate presentation structures (in
HTML for instance) that are very different from the original XML data
structures, as shown here.
BENEFITS OF XML
Ø
XML promises to
simplify and lower the cost of data interchange and publishing in a Web
environment.
Ø
XML is a text-based
syntax that is readable by both computer and humans.
Ø
XML offers data
portability and reusability across different platforms and devices.
Ø
It is also flexible
and extensible, allowing new tags to be added without breaking an existing
document structure. Based on Unicode, XML provides global language support.
Ø
By using XML we can
create a new language. Eg: WML, IML, and XHTML
APPLICATIONS
OF XML
XML is poised to play a prominent role as a data interchange
format in business-to-business Web applications such as e-commerce,
supply-chain management, workflow, and application integration. Another use of
XML is for structured information management, including information from
databases. XML also supports media-independent publishing, allowing documents
to be written once and published in multiple media formats and devices. On the
client, XML can be used to create customized views into data.
META
CONTENT
"Meta content" refers to information
about contents, such as title, author, file size, creation date, revision
history, keywords, and so on. Meta content can be used for searching, filtering
information, managing documents, and so on.
To show the usefulness of explicit meta
content, let's assume that we want to search for documents that were written by
Bill Clinton. We will get thousands of hits that contain "Bill
Clinton" if we input "Bill Clinton" as the search keyword for
the current search engines. Most of the hits merely mention "Bill
Clinton" in the body of the article and are not written by Bill Clinton
himself. The search would be much more productive if we could express the
search query as "find documents whose Author element contains 'Bill
Clinton'".
HTML
extensions do not solve all the problems with meta content. Other resources
such as image files, audio and video files, and other content types may require
meta content extensions as well.
Since these tags are inside an HTML document, search engines
cannot refer to the information without downloading the entire HTML file. It is
not efficient to download a 100-KB HTML file just to check whether the TITLE
tag contains a certain character string, particularly when there are hundreds
of such files available from a Web site. However, consider the improvement in
search performance if we put meta information of all the documents available on
the site into a single meta file.
Because of these reasons, an external meta
content description has received a lot of attention. Because of its
extensibility, flexibility, and readability, XML is considered to be the best
formalism to define a meta content syntax.
Rich document
HTML is powerful enough for formatting Web
pages, but may not be powerful enough for creating large documents that are to
be printed on paper. For example, automatic numbering of chapters and sections
is not supported. And we cannot control page breaking. Further, HTML's
formatting capability for mathematical formulas is severely limited.
With XML it is expected that richer document markup languages
can be defined. With a properly designed style sheet, it should be relatively
easy to design your own markup language.
One trick that is often used is to define
such a language by using the existing HTML tags. Since both XML and HTML
originated from the general document markup language SGML, they have many
things in common. Given an HTML document, it is usually possible to convert it
into a well-formed XML document by adding appropriate end tags. Because normal
browsers ignore any non-HTML tags, they are capable of displaying XML documents
that contain some of the HTML tags such as <h1> and <p>. Such
documents can be processed properly with special-purpose word processors and
can also be displayed on a normal browser with certain fidelity. Microsoft and
Lotus are planning to use an "XML document with mixed HTML and non-HTML
tags" as the native document format for their word processors.
Many three-tier applications extract data from back-end database systems. Usually the results are transformed into the <table> tag of HTML and displayed on the screen. If data is delivered as an XML document that preserves the original information, such as column names and data types, the data can be used by the client for purposes other than just displaying on the screen. For example, it might be possible to load the data into a spreadsheet and do some computations such as calculating sums and averages.
Oracle provides a set of XML parsers for Java,
C, C++, and PL/SQL. Each of these parsers is a stand-alone XML component that
parses an XML document (or a standalone DTD) so that it can be processed by an
application. The parsers support the DOM (Document Object Model) and SAX
(Simple API for XML) interfaces, XML Namespaces, validating and non-validating
modes, and XSL transformations. The parsers are available on all Oracle
platforms.
The v2 versions of the XML Parsers include an
integrated XSL Transformation (XSLT) Processor for transforming XML data using
XSL style sheets. Using the XSLT processor, you can transform XML documents
from XML to XML, HTML, or virtually any other text-based format.
Messaging:
The hottest application area of XML is
messaging. "Messaging" refers to message exchange between
organizations, or between application systems within an organization.
Messaging among companies is represented by
Electronic Data Interchange (EDI). EDI has been widely used in industries like
finance and manufacturing since the 1970s.
EDI has greatly contributed to automating
business-to-business (B2B) transactions. However, even though virtually every corporation
is now connected to a single network and can send messages to anybody else, not
all of them are using EDI. They use traditional means such as fax and telephone
to send and receive orders and invoices. Many small companies cannot
participate in the EDI world because of the cost associated with building and
operating an EDI system. First, for connectivity, many EDI systems require a
value-added network (VAN), not the ubiquitous Internet. There are many reasons
why VAN is more desirable than the Internet, such as security, reliability, and
availability, but, at the same time, VAN costs more than the Internet, and the
Internet is required for any business for Web and e-mail anyway. Second, an EDI
system is not like a shrink-pack software that you can buy a copy of at a
nearby mall, install it on your PC, and be ready to conduct business. You need
to contract with a skilled vendor to build the EDI system.
It is natural that a small company who
already has an Internet connection wants to do B2B messaging with their partner
using inexpensive off-the-shelf software. Even for a large company who already
has an EDI system, B2B on the Internet should be a good opportunity of inviting
new small partners to join and connect to their own infrastructure. Thus, B2B
messaging on the Internet, sometimes termed Internet EDI, is getting attention
nowadays.
There are two major technical bottlenecks
associated with B2B messaging on the Internet. One is the security problem. The
Internet is a public network, and there is no protection against attacks such
as eavesdropping and forgery. If messages are stolen or modified during
transmission, B2B messaging will be almost useless. Fortunately, the recent
advancement of public-key based cryptography has remedied most of the security
problems in communication. Using modern cryptographic protocols such as SSL and
S/MIME, the Internet became as secure as any other network, including VANs and
intranets.
The other technical problem is message
format. Here, XML can play its role. As we discussed, building an EDI system
requires high-level skills. Among the skills needed in building EDI system is a
good understanding of X.12 or EDIFACT message format. By at least an order of
magnitude, fewer people are knowledgeable in these binary formats than know
HTML. Since XML and HTML are closely related, it should be easy for people who
are familiar with the basic HTML syntax to understand the gist of XML. A DTD
together with a few message examples should give a good enough understanding of
the message format to begin building a prototype implementation. So with XML,
the "threshold" for participating in a message-exchanging community
can be quite low.
Why
XML in Web applications?
We have seen that several areas are
candidates for applying the XML technology for different goals. Admittedly, XML
is not the only way or even the most efficient way to achieve these goals. For
example, why don't we use a binary format instead of a long character string
such as <CurrTemp>70</CurrTemp> to express numeric data? Or, why
don't we use Remote Procedure Call (RPC) or IIOP instead of HTTP+XML?
Apparently, these sophisticated communication methods are much more efficient
in terms of both communication bandwidth and computation power. Then what is
the benefit of using XML for the things described in the previous section?
Simplicity
The largest benefit of XML compared to other
binary formats is simplicity. If a message is defined in a binary format, such
as "the value of parameter X is represented as a four-byte integer in the
network byte order, beginning at the 12th octet from the top of the
message," we would need to look at a hexadecimal dump to understand the
message.
XML can represent a tree-structured data.
Abstract Syntax Notation 1 (ASN.1) is a binary protocol representation scheme
that has an equivalent data structure, or tree structure. ASN.1 is widely used;
for example, the X.509 digital certificate is defined using ASN.1. It is
carefully designed to minimize data size. Combined with a good ASN.1 parser
implementation, a very efficient protocol engine can be implemented. The
problem is, since ASN.1 is optimized for efficiency, it has complicated bit
arrangements that make it difficult to understand. You can use automatic
protocol generation tools to save some of the effort involved in understanding
ASN.1, but good tools cost thousands of dollars.
In the Internet world, the rule of the game
is openness. Even if you have an unparalleled technology, it will not win in
the market without receiving wide support from the majority of the population.
Less efficient but open and easy-to-understand technologies have a better
chance to win in the Internet
world.
Everybody agrees that
HTTP is more primitive and less efficient than CORBA IIOP as a request/response
protocol. However, there are many more HTTP-based products available than
IIOP-based products.
XML is simple. At least it is human-readable.
That is, it can be read, created, and modified with common text editors. Tags
can be named with understandable strings. Suppose that we have a Web
application that we want to promote. A Web application, it receives an HTTP
request and returns an XML document as a response rather than relying on more
efficient remote procedure calls such as IIOP. Since we provide our data in
XML, we also publish the DTD that describes the syntax of the data. Assume that
one potential partner becomes interested in doing business with us. They access
our Web site and find that we provide an XML-based Web application for automatic
business-to-business messaging. They study our DTD and several sample messages
where tags are named in plain English such as <City> and <Date>.
Their engineers can jump-start developing a "glue" code that connects
their own business application to our Web application using off-the-shelf XML
and Web application tools.
Richness
of data structure
XML is simple yet at the same time powerful
enough to express complex data structures. E-mail or HTTP headers have a very
simple format as in:
variable name: value
But they cannot be directly used for representing structured
data. How complex data should be expressed with XML is arguable. Currently an
XML document has essentially a rooted tree structure. Other possible data
structures that may be of interest are tables and graphs.
The table is the logical data structure of
relational databases. In XML, a table must be mapped into a tree; for example,
one row can be represented as one sub tree and a table as a sequence of such
subtrees.
We can think of different levels of graphs (for example, higher-order graphs whose node may contain another graph), but more complex structure requires more computation. RDF has a graph-based data model.
We can think of different levels of graphs (for example, higher-order graphs whose node may contain another graph), but more complex structure requires more computation. RDF has a graph-based data model.
For many applications, a tree is general and
powerful enough for expressing fairly complex data but still the notion and
process are simple. It strikes a good balance between expressive power and
simplicity.
International
character handling
One substantial benefit of using XML that
should not be underestimated is its capability of handling international
character sets. Even if you are designing a very simple message format, this
point alone is a compelling reason to adopt XML.
Today's businesses are rapidly becoming
international, especially when considering Web applications: the Internet is
obscuring country borders. It is only natural that business transactions
contain street names in Chinese or person names in Arabic. The XML 1.0
specification is defined based on the ISO-10646 (Unicode) character set, so
virtually all the characters that are used today all over the world are legal
characters. (Note: Be careful not to confuse character sets with character
encoding. A character set defines a set of characters regardless of how they
are represented in a binary computer. A character encoding specifies a mapping
between a character set and particular binary representation of the characters.
Therefore, one character set may have many different encodings.)
In addition, XML 1.0 requires that all conformant XML processors must support at least two encodings, UTF-8 and UTF-16. Some XML processors, including the IBM XML for Java Parser, support conversion to/from other locale-specific encodings that are used in different parts of the world. For example, the XML for Java Parser supports 19 encodings including European, Japanese, Chinese, and Korean encodings.
CONCLUSION:
By Comparing XML
with other Mark-Up Languages we can conclude that XML has a lot of additional
features, which are not involved in other Mark Up Languages.
The main drawback of using HTML is that
user-defined tags cannot be used. But we can overcome this in XML i.e in XML we
can use user-defined tags.
XML is a useful tool in managing the data. XML
is an ideal way to send data from one database to another. We can use XML when
large quantities of data need to be stored but the access to the data is
infrequent.
By
observing the qualities of XML we can conclude that XML is better when compared
to other Mark-Up Languages because it avoids many drawbacks, which are involved
in other Mark-Up Languages.