XML

Posted by : Unknown Thursday, April 25, 2013

INTRODUCTION:

XML, the Extensible Markup Language, which is the best described as means of structuring data. XML was developed by the XML working group, which started out as “the SGML Editorial Review Board”, formed by the W3C (world wide web consortium) in1996 and chaired by John Bosak. XML has ability to work with HTML and definitively not an extension of HTML. XML has ability to work with HTML for data display and presentation. XML is a markup language that can run on any platform, operating system environment and is designed to provide developers with a mechanism to better describe their content. XML was developed to provide a structured environment for developers to create DTDs (Document Type Definitions) for content that doesn’t fit into the HTML mold. Both Internet explorer 4 and5 as well as Netscape Navigator 5, have support for XML.

ABOUT MARK-UP LANGUAGES

Ø Markup is commonly used to change the look of text by adding formatting, such as bold or italic fonts, text indents, font sizes and font weights.

Ø Markup languages are used to define the structure and meaning of a document and to modify the look and formatting of text.

Ø In addition to formatting text, markup can also be used to determine the structure and meaning of textual element.

Ø Markup languages use tags embedded directly into the text to describe the various pieces and parts of the text.

Ø Two types of markup languages are in use today:

ª Specific markup

ª Generalized markup

SPECIFIC MARK-UP LANGUAGES:

Ø Specific markup languages are used to generate code that is specific to a particular application.

Ø HTML and RTF were developed for specific purpose

Some limitations common to specific markup languages.

1. Authors are limited to a particular set of tags.

2. A document might not be portable to other applications, as the data is not self-describing.

3. The language probably has a proprietary of marking up text is not compatible with other markup languages.

Ø GENERALISED MARK-UP LANGUAGE:

· Generalized mark-up language describes the structure and meaning of the text in a document, but it doesn’t define the usage of text.

· GMLs are used to generate code that is portable to most applications.

· In the 1970’s DR.C.F.Goldfard and two of his colleagues proposed a method has 2 basic structures, describing text that was not specific to an application.

· The markup should describe the structure of a document and not its formatting or style characteristics.

· The syntax of the markup should be strictly enforced so that a software program can clearly read the code or by a human being. The result of these suggestions was the GML developed for IBM.

SGML:

GML was the precursor to the Standard Generalized Markup language (SGML) that was adopted as a standard by ISO in 1986. SGML is meta-language that facilitates the creation of other languages. SGML is intended to be absolutely independent of any application. SGML DTD (Document Type Definition). SGML is extensible, which means that it allows an author to define a particular structure. Almost all languages that have been created to manipulate documents can trace at least a portion of their roots back to SGML. HTML is an application of SGML. The HTML is not extensible, which means the HTML cannot be used to create another markup language with its own rules and purpose

XML Vs HTML

Ø XML was designed to describe data and HTML was designed to display data.

Ø XML is extensible i.e. user can use their own tags for markup whereas HTML use predefined tags to markup.

Ø XML starting and ending element names must be exactly same (case of each char is same) incase of HTML tag names may be either of the case.

Ø XML not display information as like HTML document in IE it display tree like structuring the documents.

Ø XML uses parsers to check the well formedness of the document it is not necessary for HTML.

Ø XML uses XSL, XLL to display the data as HTML displays in the browser.

Ø XML have namespace to prevent ambiguity in element tags it is not necessary in HTML.

Components of XML:

There are five classes of markup in XML

1.Elements.

2.Contents.

3.Attributes.

4. Comments.

Elements:

XML Elements describe the meaning of the text they contain. Elements typically occur in pairs with start tag and an end tag that enclose the text they markup. Inside the start tag, a keyword indicate the meaning of the markup .The end tag contains the same keyword with ‘/’ in front of it.

Ex:

<Title> My Title </Title>

Here Title is an element in HTML, which is used to display the title in the window of a Web Browser. In XML you can create your own set of elements.

EX:

<empname>John Rambo </empname>

Here empname is the element that is enclosed between the opening tag(< >) and the closing tag(</ >). The data between these tags defines an XML element.

There are some tags that do not come in pairs. Such tags are called empty elements.

Contents:

The information represented by the elements in an XML document is called content of that element. In the example

Here Lion King is the content represented by the Title element.

Attributes:

Attributes provide additional information about the elements. Each attribute has a name and a value. The value can be a number, a word, or a URL.

In HTML, you can use an attribute such as color for the font element, as given below.

<Font color = “red” >Displayed in red </Font>

Here color is the attribute name with attribute value, red.

Comments:

Comments are used to add notes to an XML document. The browser and XML processors ignore comments.

To create a comment, type a less than sign (<) followed by an exclamation point and two dashes (<! --). Type the text you want to use as a comment entry. Now end the comment with a closing tag which is two dashes of a greater than sign as given below (-->). For example

<! - -Writing Comments in XML Documents - - >

XML Documents

XML is a text-based format, similar to HTML in many respects, designed specifically to store and transmit data. An XML source is made up of XML elements, each of which consists of a start tag (<title>), an end tag (</title>), and the information between the two tags (referred to as the content). Like HTML, an XML document holds text annotated by tags. However, unlike HTML, XML allows an unlimited set of tags, each indicating not how something should look, but what something means. For example, an XML element might be tagged as a price, an order number, or a name. It is up to each document's author to determine what kind of data to use and which tag names fit best.XML documents are easy to create. If you are familiar with HTML, you can quicklylearn to author in XML. In this example, XML is used to describe a weather report. This file can be saved with an extension of XML, like ADDRESS.XML

<?xml version="1.0"?>

<addressBook>

<person>

<name>

<first> Ranjith </first>

<last> Kumar </last>

</name>

<e-mail>prasad_minna@hotmail.com</e-mail>

</addressBook>

Namespaces

XML namespaces let developers qualify element names in a recognizable manner to avoid conflicts between elements with the same name. Elements referenced in one document, such as a purchase order, can be defined in different schemas on the Web. Namespaces ensure that element names do not conflict and clarify their origins, but do not determine how to process elements. Parsers must know what elements mean and how to process them. Tags from multiple namespaces can be mixed, which is essential with data coming from multiple sources across the Web. With namespaces, both elements could exist in the same XML-based document instance but could refer back to two different schemas, uniquely qualifying their semantics. For instance, in a bookstore purchase order, one "title" element could contain a book title, and another "title" element could contain the author's title.

The W3C has released XML namespaces as a recommendation, allowing elements to be subordinate to a URI. This ensures that names remain unambiguous even if chosen by multiple authors. Just as anyone can publish their own Web page or view those of others, the namespace facility allows users to define private dictionaries of terms, or use a public namespace of common terms.

<orders xmlns:person="http://www.schemas.org/people"

        xmlns:dsig="http://dsig.org">

  <order>

    <sold-to>

      <person:name>

        <person:last-name>Layman</person:last-name>

        <person:first-name>Andrew</person:first-name>

      </person:name>

    </sold-to>

    <sold-on>1997-03-17</sold-on>

    <dsig:digital-signature>1234567890</dsig:digital-signature>

  </order>

</orders>

This code tells any reader that if a name begins with "dsig:" its meaning is defined by whoever owns the "http://www.dsig.org" namespace. Similarly, elements beginning with the "person:" prefix have meanings defined by the "http://www.schemas.org/people" namespace.

Namespaces ensure that element names do not conflict, and clarify who defined which term. They do not give instructions on how to process the elements. Readers still need to know what the elements mean and decide how to process them. Namespaces simply keep the names straight.

An author can specify an element's data type (it's a number, a date, and so on) and the format of the string's contents. One can use the dt attribute from the data types namespace at "urn: schemas-microsoft-com: datatypes" for this purpose.

<Sold-on dt:dt="date"

         Xmlns: dt="urn: schemas-microsoft-com: datatypes">1997-03-17</sold-on>

Here, "date" specifies that the "sold-on" element's contents are a date in the standard format specified by the data types namespace. As with element names, authors will eventually be able to design their own data types, and also use types shared publicly. Microsoft is working with the W3C to define a set of standard types, and has provided an initial list as part of XML Schema support in Internet Explorer 5.

Document Type Definition

Document Type Definitions (DTDs) can accompany a document, essentially defining the rules of the document, such as which elements are present and the structural relationship between the elements. DTDs help to validate the data when the receiving application does not have a built-in description of the incoming data. With XML, however, DTDs are optional.

Data sent along with a DTD is known as valid XML. In this case, an XML parser could check incoming data against the rules defined in the DTD to make sure the data was structured correctly.

<? Xml version="1.0"? >

<! DOCTYPE note [

<! ELEMENT note (to,from,heading,body)>

<! ELEMENT to (#PCDATA)>

<! ELEMENT from (#PCDATA)>

<! ELEMENT heading (#PCDATA)>

<!ELEMENT body (#PCDATA)>

]>

<note>

<to>Toe</to>

<from>Jani</from>

<heading>Reminder</heading>

<Body>Don’t forget me this weekend</body>

</note>

Well-Formed XML document

A XML document said to well-formed

Ø It contains one or more elements.

Ø It has one root element which contain s all the other elements.

Ø The names used in element start tags and end tags match exactly.

Ø Attribute values must be enclosed in quotes .

Ø Entities are declared before they are used.

Schemas:

A schema is a formal specification of the rules of an XML document, namely the element names, that indicates which elements are allowed in a document and in what combinations

Schemas are the successors of DTDs

· XML Schema was originally proposed by Microsoft, but is now a W3C proposal.

· XML Schemas will be used in Web applications as a replacement for DTDs.

Here are the reasons why:

XML Schemas are easier to learn than DTD

XML Schemas are extensible to future additions

XML Schemas are richer and more useful than DTDs

XML Schemas are written in XML

XML Schemas support data types

XML Schemas support namespaces

XML Parsers

The XML parser in Internet Explorer 5 can read a string of XML data, process it, generate a structured tree, and expose all data elements as objects using the DOM. The parser displays this data using a CSS or XSL style sheet, or makes the data available for further manipulation by script or hands it off to other applications or objects for further processing. Namespaces, data types, queries, and XSL transformations are supported with extended methods available in the DOM.

· XML parsers are implementation of two basic APIs specified by W3C

- SAX (Simple API for XML)

- DOM (Document Object Model)

SAX

· An event-based API, such as SAX uses callbacks to report parsing events to the application

· Events include the start and end of elements and characters.

· Applications that do not require complex manipulations of the XML structure will find the SAX interfaces very useful

DOM

· A tree-based API builds an in-memory tree representation of the XML document.

· It provides classes and methods for an application to navigate and process the tree.

· In general, the DOM interface is most useful for structural manipulations of the XML tree, such as reordering elements, adding or deleting elements and attributes, renaming elements, and so on.

Linking in XML

XML document linking is done in two way.

1.Xlink.

2.Xpointer.

Xlink:

Xlink defines two basic types of links.

Simple Links:

Ø Simple links effectively emulate the hypertext linking of html.

Ø Simple links are only in one direction.

Extended links:

Ø A group of possible destinations from a single can be defined using this links.

Ø Extended link groups provide a way to manage related information by setting up elements that contains lists of related documents.

XPointer:

Xpointer supports addressing into the internal structure of the XML documents. XPointer references to elements, character strings and other parts of document.

XSL (eXtensible Style sheet Language)

An XSL style sheet contains instructions for how to pull information out of an XML document and transform it into another format, such as HTML. The transformation of XML into formats, such as HTML, is done in a declarative way, making it often easier and more accessible than through scripting. In addition, XSL uses XML as its syntax, freeing XML authors from having to learn another markup language.

CSS can still be used for simply structured XML data—and in such situations, it will be useful. However, CSS does not provide a display structure that deviates from the structure of the data source. With XSL, it is possible to generate presentation structures (in HTML for instance) that are very different from the original XML data structures, as shown here.

BENEFITS OF XML

Ø XML promises to simplify and lower the cost of data interchange and publishing in a Web environment.

Ø XML is a text-based syntax that is readable by both computer and humans.

Ø XML offers data portability and reusability across different platforms and devices.

Ø It is also flexible and extensible, allowing new tags to be added without breaking an existing document structure. Based on Unicode, XML provides global language support.

Ø By using XML we can create a new language. Eg: WML, IML, and XHTML

APPLICATIONS OF XML

XML is poised to play a prominent role as a data interchange format in business-to-business Web applications such as e-commerce, supply-chain management, workflow, and application integration. Another use of XML is for structured information management, including information from databases. XML also supports media-independent publishing, allowing documents to be written once and published in multiple media formats and devices. On the client, XML can be used to create customized views into data.

META CONTENT

"Meta content" refers to information about contents, such as title, author, file size, creation date, revision history, keywords, and so on. Meta content can be used for searching, filtering information, managing documents, and so on.

To show the usefulness of explicit meta content, let's assume that we want to search for documents that were written by Bill Clinton. We will get thousands of hits that contain "Bill Clinton" if we input "Bill Clinton" as the search keyword for the current search engines. Most of the hits merely mention "Bill Clinton" in the body of the article and are not written by Bill Clinton himself. The search would be much more productive if we could express the search query as "find documents whose Author element contains 'Bill Clinton'".

HTML extensions do not solve all the problems with meta content. Other resources such as image files, audio and video files, and other content types may require meta content extensions as well.

Since these tags are inside an HTML document, search engines cannot refer to the information without downloading the entire HTML file. It is not efficient to download a 100-KB HTML file just to check whether the TITLE tag contains a certain character string, particularly when there are hundreds of such files available from a Web site. However, consider the improvement in search performance if we put meta information of all the documents available on the site into a single meta file.

Because of these reasons, an external meta content description has received a lot of attention. Because of its extensibility, flexibility, and readability, XML is considered to be the best formalism to define a meta content syntax.

Rich document

HTML is powerful enough for formatting Web pages, but may not be powerful enough for creating large documents that are to be printed on paper. For example, automatic numbering of chapters and sections is not supported. And we cannot control page breaking. Further, HTML's formatting capability for mathematical formulas is severely limited.

With XML it is expected that richer document markup languages can be defined. With a properly designed style sheet, it should be relatively easy to design your own markup language.

One trick that is often used is to define such a language by using the existing HTML tags. Since both XML and HTML originated from the general document markup language SGML, they have many things in common. Given an HTML document, it is usually possible to convert it into a well-formed XML document by adding appropriate end tags. Because normal browsers ignore any non-HTML tags, they are capable of displaying XML documents that contain some of the HTML tags such as <h1> and <p>. Such documents can be processed properly with special-purpose word processors and can also be displayed on a normal browser with certain fidelity. Microsoft and Lotus are planning to use an "XML document with mixed HTML and non-HTML tags" as the native document format for their word processors.

Many three-tier applications extract data from back-end database systems. Usually the results are transformed into the <table> tag of HTML and displayed on the screen. If data is delivered as an XML document that preserves the original information, such as column names and data types, the data can be used by the client for purposes other than just displaying on the screen. For example, it might be possible to load the data into a spreadsheet and do some computations such as calculating sums and averages.

Oracle provides a set of XML parsers for Java, C, C++, and PL/SQL. Each of these parsers is a stand-alone XML component that parses an XML document (or a standalone DTD) so that it can be processed by an application. The parsers support the DOM (Document Object Model) and SAX (Simple API for XML) interfaces, XML Namespaces, validating and non-validating modes, and XSL transformations. The parsers are available on all Oracle platforms.

The v2 versions of the XML Parsers include an integrated XSL Transformation (XSLT) Processor for transforming XML data using XSL style sheets. Using the XSLT processor, you can transform XML documents from XML to XML, HTML, or virtually any other text-based format.

Messaging:

The hottest application area of XML is messaging. "Messaging" refers to message exchange between organizations, or between application systems within an organization.

Messaging among companies is represented by Electronic Data Interchange (EDI). EDI has been widely used in industries like finance and manufacturing since the 1970s.

EDI has greatly contributed to automating business-to-business (B2B) transactions. However, even though virtually every corporation is now connected to a single network and can send messages to anybody else, not all of them are using EDI. They use traditional means such as fax and telephone to send and receive orders and invoices. Many small companies cannot participate in the EDI world because of the cost associated with building and operating an EDI system. First, for connectivity, many EDI systems require a value-added network (VAN), not the ubiquitous Internet. There are many reasons why VAN is more desirable than the Internet, such as security, reliability, and availability, but, at the same time, VAN costs more than the Internet, and the Internet is required for any business for Web and e-mail anyway. Second, an EDI system is not like a shrink-pack software that you can buy a copy of at a nearby mall, install it on your PC, and be ready to conduct business. You need to contract with a skilled vendor to build the EDI system.

It is natural that a small company who already has an Internet connection wants to do B2B messaging with their partner using inexpensive off-the-shelf software. Even for a large company who already has an EDI system, B2B on the Internet should be a good opportunity of inviting new small partners to join and connect to their own infrastructure. Thus, B2B messaging on the Internet, sometimes termed Internet EDI, is getting attention nowadays.

There are two major technical bottlenecks associated with B2B messaging on the Internet. One is the security problem. The Internet is a public network, and there is no protection against attacks such as eavesdropping and forgery. If messages are stolen or modified during transmission, B2B messaging will be almost useless. Fortunately, the recent advancement of public-key based cryptography has remedied most of the security problems in communication. Using modern cryptographic protocols such as SSL and S/MIME, the Internet became as secure as any other network, including VANs and intranets.

The other technical problem is message format. Here, XML can play its role. As we discussed, building an EDI system requires high-level skills. Among the skills needed in building EDI system is a good understanding of X.12 or EDIFACT message format. By at least an order of magnitude, fewer people are knowledgeable in these binary formats than know HTML. Since XML and HTML are closely related, it should be easy for people who are familiar with the basic HTML syntax to understand the gist of XML. A DTD together with a few message examples should give a good enough understanding of the message format to begin building a prototype implementation. So with XML, the "threshold" for participating in a message-exchanging community can be quite low.

Why XML in Web applications?

We have seen that several areas are candidates for applying the XML technology for different goals. Admittedly, XML is not the only way or even the most efficient way to achieve these goals. For example, why don't we use a binary format instead of a long character string such as <CurrTemp>70</CurrTemp> to express numeric data? Or, why don't we use Remote Procedure Call (RPC) or IIOP instead of HTTP+XML? Apparently, these sophisticated communication methods are much more efficient in terms of both communication bandwidth and computation power. Then what is the benefit of using XML for the things described in the previous section?

Simplicity

The largest benefit of XML compared to other binary formats is simplicity. If a message is defined in a binary format, such as "the value of parameter X is represented as a four-byte integer in the network byte order, beginning at the 12th octet from the top of the message," we would need to look at a hexadecimal dump to understand the message.

XML can represent a tree-structured data. Abstract Syntax Notation 1 (ASN.1) is a binary protocol representation scheme that has an equivalent data structure, or tree structure. ASN.1 is widely used; for example, the X.509 digital certificate is defined using ASN.1. It is carefully designed to minimize data size. Combined with a good ASN.1 parser implementation, a very efficient protocol engine can be implemented. The problem is, since ASN.1 is optimized for efficiency, it has complicated bit arrangements that make it difficult to understand. You can use automatic protocol generation tools to save some of the effort involved in understanding ASN.1, but good tools cost thousands of dollars.

In the Internet world, the rule of the game is openness. Even if you have an unparalleled technology, it will not win in the market without receiving wide support from the majority of the population. Less efficient but open and easy-to-understand technologies have a better chance to win in the Internet world.

Everybody agrees that HTTP is more primitive and less efficient than CORBA IIOP as a request/response protocol. However, there are many more HTTP-based products available than IIOP-based products.

XML is simple. At least it is human-readable. That is, it can be read, created, and modified with common text editors. Tags can be named with understandable strings. Suppose that we have a Web application that we want to promote. A Web application, it receives an HTTP request and returns an XML document as a response rather than relying on more efficient remote procedure calls such as IIOP. Since we provide our data in XML, we also publish the DTD that describes the syntax of the data. Assume that one potential partner becomes interested in doing business with us. They access our Web site and find that we provide an XML-based Web application for automatic business-to-business messaging. They study our DTD and several sample messages where tags are named in plain English such as <City> and <Date>. Their engineers can jump-start developing a "glue" code that connects their own business application to our Web application using off-the-shelf XML and Web application tools.

Richness of data structure

XML is simple yet at the same time powerful enough to express complex data structures. E-mail or HTTP headers have a very simple format as in:

variable name: value

But they cannot be directly used for representing structured data. How complex data should be expressed with XML is arguable. Currently an XML document has essentially a rooted tree structure. Other possible data structures that may be of interest are tables and graphs.

The table is the logical data structure of relational databases. In XML, a table must be mapped into a tree; for example, one row can be represented as one sub tree and a table as a sequence of such subtrees.

We can think of different levels of graphs (for example, higher-order graphs whose node may contain another graph), but more complex structure requires more computation. RDF has a graph-based data model.

For many applications, a tree is general and powerful enough for expressing fairly complex data but still the notion and process are simple. It strikes a good balance between expressive power and simplicity.

International character handling

One substantial benefit of using XML that should not be underestimated is its capability of handling international character sets. Even if you are designing a very simple message format, this point alone is a compelling reason to adopt XML.

Today's businesses are rapidly becoming international, especially when considering Web applications: the Internet is obscuring country borders. It is only natural that business transactions contain street names in Chinese or person names in Arabic. The XML 1.0 specification is defined based on the ISO-10646 (Unicode) character set, so virtually all the characters that are used today all over the world are legal characters. (Note: Be careful not to confuse character sets with character encoding. A character set defines a set of characters regardless of how they are represented in a binary computer. A character encoding specifies a mapping between a character set and particular binary representation of the characters. Therefore, one character set may have many different encodings.)

In addition, XML 1.0 requires that all conformant XML processors must support at least two encodings, UTF-8 and UTF-16. Some XML processors, including the IBM XML for Java Parser, support conversion to/from other locale-specific encodings that are used in different parts of the world. For example, the XML for Java Parser supports 19 encodings including European, Japanese, Chinese, and Korean encodings.

CONCLUSION:

By Comparing XML with other Mark-Up Languages we can conclude that XML has a lot of additional features, which are not involved in other Mark Up Languages.

The main drawback of using HTML is that user-defined tags cannot be used. But we can overcome this in XML i.e in XML we can use user-defined tags.

XML is a useful tool in managing the data. XML is an ideal way to send data from one database to another. We can use XML when large quantities of data need to be stored but the access to the data is infrequent.

By observing the qualities of XML we can conclude that XML is better when compared to other Mark-Up Languages because it avoids many drawbacks, which are involved in other Mark-Up Languages.

Subscribe to Posts | Subscribe to Comments

Seminar Sparkz Inc

Desire to Learn & Build Your Career

XML