- Back to Home »
- Data Mining
Posted by : Unknown
Sunday, June 30, 2013
Introduction:
Modern organizations now view information as one of their most valuable assets as they want to respond quickly to changes in the market. Clearly, in order to do this we need rapid access to all kinds of information before we can make any logical decisions. To assist in making the right choices for the organization, it is essential to be able to research the past and identify relevant trends. Obviously, in order to perform any trend analysis we must have access to all the information needed to support us, and this information is mainly stored in very large databases known as operational databases. These are not designed to store historical data or to respond to queries. The easiest way to gain access to this data and facilitate effective decision making is to set up a data warehouse which is designed especially for decision support queries; therefore only data that is needed for decision support is extracted from the operational data and stored in the warehouse.
The application of data mining and KDD (knowledge discovery in databases) techniques can be carried out from the existing data warehouse and the part of the information that is of interest is extracted for the trend analysis. Trying to mine operational data is almost impossible because there are different types of attributes and different data types but no historical data. With a data warehouse this problem does not exist as all the information has been transferred from the operational database to the data warehouse.
Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouse. But it is a very tedicious process as the dataware house consists of large quantities of data, noisy and incomplete data and heterogeneous data. Inspite of its complexity industry surveys clearly indicate that over 80% of Fortune 500 companies view data mining as a critical factor for business success by the year 2000 thus revealing the importance of data mining.
Definition:
It is the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions.
Speaking informally, Data Mining is the "automatic" extraction of patterns of information from historical data, enabling companies to focus on the most important aspects of their business -- telling them what they did not know and had not even thought of asking.
Knowledge Discovery in Databases(KDD ):
There is a confusion about the exact meaning of the terms ‘data mining’ and ‘KDD’ with many authors regarding them as synonyms. At the first international KDD conference in Montreal in 1995, it was proposed that the term KDD be employed to describe the whole process of extraction of knowledge to describe the whole process of extraction of knowledge from data.
The official definition of KDD is:
‘The nontrivial extraction of implicit previously unknown and potentially useful knowledge from data’ – The knowledge must be new, not obvious and one must be able to use it.
The KDD process:
In principle, the knowledge discovery process consists of six stages:
# Data selection
# Cleaning
# Enrichment
# Coding
# Data mining
# Reporting
The fifth stage, data mining, is the phase of real discovery. At every stage the data miner can step back one or more phases. For an instance, when in the coding or the data mining phases, the data miner might realize that the cleaning phase is incomplete or might discover new data and use it to enrich other existing data sets.
Explanation:
These two processes: KDD and Data mining are explained by taking an example which deals with the database of a magazine publisher:
The publisher sells five types of magazines – on cars, houses, sports, music and comics. The aim of the data mining process is to find new, interesting clusters of clients in order to set up a marketing exercise. Therefore, we are interested in questions such as ‘What is the typical profile of a reader of a car magazine?” and ‘Is there any correlation between an interest in cars and an interest in comics?”. Here KDD has something to play.
Data selection:
We start with a rough database containing records of subscription data for the magazines. It is a selection of operational data from the publisher’s invoicing system and contains information about people who have subscribed to a magazine. The records consist of : client number, name, address, date of subscription, and type of magazine. In order to facilitate the KDD process, a copy of this operational data is drawn and stored in a separate database.
Client
Number
|
Name
|
Address
|
Date
Purchase mode
|
Magazine
Purchase
|
23003
23003
23003
23009
23013
23019
|
Johnson
Johnson
Johnson
King
Jonson
|
1 downing street
1 downing street
1 downing street
2 boulevard
3 high road
1 downing street
|
Car
Music
Comic
Comic
Sports
House
|
1. original data
Cleaning:
There are several types of cleaning process, some of which can be execute in advance while others are invoked only after pollution is detected at the coding or the discovery stage.
A very important element is a cleaning operation is the de-duplication of records. In a normal client database some clients will be represented by several records, although in many cases this will be the result of negligence, such as people making typing errors, or of clients moving from one place to another without notifying change of address. Although data mining and data cleaning are two different disciplines, they have a lot in common and pattern recognition algorithms can be applied in cleaning data.
Client
Number
|
Name
|
Address
|
Date
Purchase mode
|
Magazine
Purchase
| ||
23003
23003
23003
23009
23013
23003
|
Johnson
Johnson
Johnson
King
jonson
|
1 downing street
1 downing street
1 downing street
2 boulevard
3 high road
1 downing street
|
Car
Music
Comic
Comic
Sports
House
| |||
2.de-duplication
The second type of pollution that frequently occurs is lack of domain consistency. Note that in our original table we have two records dated 1 January 1901 , although the company probably did not even exist at that time.
Client
Number
|
Name
|
Address
|
Date
Purchase mode
|
Magazine
Purchase
|
23003
23003
23003
23009
23013
23003
|
Johnson
Johnson
Johnson
King
Johnson
|
1 downing street
1 downing street
1 downing street
2 boulevard
3 high road
1 downing street
|
null
|
Car
Music
Comic
Comic
Sports
House
|
3. domain consistency
Enrichment:
We will suppose that we have purchased extra information about our clients consisting of date of birth, income, amount of credit and whether or not an individual owns a car or a house.
Client name
|
Date of birth
|
Income
|
Credit
|
Car owner
|
House owner
|
Johnson
|
$ 18,500
$ 36,000
|
$ 17,800
$ 26,000
|
No
Yes
|
no
no
|
4. enrichment
This is more realistic than it may initially seem, since it is quite possible to buy demographic data on average incomes for a certain neighborhood, and also car and house ownership can be traced fairly easily. Alternatively, we can interview small sub-samples of the client database which will give us very detailed information on the customers behavior. It is necessary to appreciate that the new information can easily be joined to the existing client records.
Client no
|
Name
|
Dob
|
Income
|
Credit
|
Car owner
|
House owner
|
Address
|
Date
Purchase
Made
|
Magazine
purchased
|
23003
23003
23003
23009
23013
23003
|
Johnson
Johnson
Johnson
King
Johnson
|
null
|
$18,500
$18,500
$18,500
$36,000
null
$17,800
|
$17,800
$17,800
$17,800
$26,600
null
$17,800
|
No
No
No
Yes
Null
No
|
No
No
No
No
Null
No
|
1 D.S
1 D.S
1 D.S
2 B.V
3 H.R
1 D.S
|
Null
|
Car
Music
Comic
Comic
Sports
House
|
5. enriched table
Coding:
Here we select only those records that have enough information to be of value. Although it is difficult to give detailed rules for this kind of operation, this is a situation that occurs frequently in practice. In most tables that are collected from operational data, a lot of desirable data is missing, and most is impossible to gather. You therefore have to make a deliberate decision either to overlook it or to delete it. A general rule states that any deletion of data must be a conscious decision, after a thorough analysis of possible consequences. In some cases, especially fraud detection, lack of information can be a valuable indication of interesting patterns.
In the present example we lack vital data about Mr. King so we choose to exclude this records from the final sample. Of course this decision may be questionable because there may be casual connection between the lack of information and certain purchasing behavior of Mr. King . Next we carry out a projection of their records. In this example we are not interested in the clients name, since we just want to identify certain type of clients, so there names are removed from the sample database.
Client no
|
Name
|
Dob
|
Income
|
Credit
|
Car owner
|
House owner
|
Address
|
Date
Purchase
Made
|
Magazine
purchased
|
23003
23003
23003
23009
23003
|
Johnson
Johnson
Johnson
Johnson
|
$18,500
$18,500
$18,500
$36,000
$17,800
|
$17,800
$17,800
$17,800
$26,600
$17,800
|
No
No
No
Yes
No
|
No
No
No
No
No
|
1 D.S
1 D.S
1 D.S
2 B.V
1 D.S
|
Null
|
Car
Music
Comic
Comic
House
|
6. table with column & row removed
The way in which we code the information will, to a great extent, determine the type of patterns we find. Coding, therefore is a creative activity that has to be performed repeatedly in order to get the best results. Take, for example, the subscription date; again, this is much too detailed of any values as such, but there are various ways to recode such dates in a way that yields valuable patterns. One solution might be to transform purchase dates into month numbers, starting from 1990. In this way, you will be able to find patterns in time series of our customers transaction. We could find dependencies similar to the following rule:
A customer with credit >13000 and aged between 22 and 31, who has subscribed to comics at time T will vary likely to subscribe to a car magazine five years later.
Or to might identify trends such as:
The number of house magazines sold to customers with credit between 12,000 and 31,000 living in region 4 is increasing.
An infinite number of coding steps may be applied to different course that are related to any number of different potential patterns .This coding steps are applied when the available information is much too detailed. In our example we have applied the following coding steps.
1. Birth date to age classes of 10 years
2. Divide income by 1000
3. Divide credit by 1000
4. Convert car and house owners yes – no to 1-0
5. Convert address to region codes
6. Convert purchase date to month numbers starting from 1990
By applying the above coding steps, the resultant table is:
Client no
|
Age
|
Income
|
Credit
|
Car owner
|
House
Owner
|
Region
|
Month
Of
Purchase
|
Magazine
purchased
|
23003
23003
23003
23009
23003
|
20
20
20
25
20
|
18.5
18.5
18.5
36.0
18.5
|
17.8
17.8
17.8
26.6
17.8
|
0
0
0
1
0
|
0
0
0
0
0
|
1
1
1
1
1
|
52
42
29
null
48
|
Car
music
comic
comic
house
|
7. an intermediate coding stage
However this table is also not very helpful if one wants to find relationship between different magazines. So we perform the final transformation on the table and create just 1 record for each reader. Instead of having one attribute ‘magazines’ with five different possible values, we create 5 binary attributes, one for every magazine. If the value of the attribute is 1, this means that the reader is subscriber, otherwise the value is 0. Such an operation is called flattening – an attribute with cardinality n is replaced by n binary attributes. This is the coding operation that occurs frequently in a KDD context.
Now we have coded out data set in such a way that we have:
Client number, age, income, credit, information about car and house ownership, area code and five bits indicating to which magazines the customer has subscribed. This is a good basis from which to start the actual mining process.
client
no
|
Age
|
Income
|
credit
|
c.o
|
h.o
|
region
|
c.m
|
h.m
|
s.m
|
m.m
|
cc.m
|
23003 23003
23009
|
20
25
|
18.5
36
|
17.8
26.6
|
0
1
|
0
0
|
1
1
|
1
0
|
1 1
2 0
|
0
0
|
1
0
|
1
1
|
8. the final table
Data mining:
Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. From this point of view data mining is rarely an ‘any thing goes’ affair. Any technique that helps extract more out of your data is useful, so data mining techniques form quite a heterogeneous group. Although various different techniques are used for different purposes those that are of interest in the present context are:
Data mining Techniques:
Data mining is not so much a single technique as the idea that there is more knowledge hidden in the data than shows itself on the surface. From this point of view data mining is rarely an ‘any thing goes’ affair. Any technique that helps extract more out of your data is useful, so data mining techniques form quite a heterogeneous group. Although various different techniques are used for different purposes those that are of interest in the present context are:
# Query tools
# Statistical techniques
# Visualization
# Clustering
# On line analytical processing (OLAP)
# Case-based learning(k-nearest neighbor)
# Decision trees
# Association rules
# Neural networks
Preliminary analysis of the data set using traditional query tools:
The first step in data mining project should always be a rough analysis using traditional query tools. Just by applying simple SQL to a data set, you can obtain a wealth of information. With SQL we can uncover only shallow data, which is information that is easily accessible from the data set; yet although we cannot find hidden data, for the most part 80% of the interesting information can be abstracted from a database using SQL. The remaining 20% of hidden information requires more advance techniques, and for large marketing driven organizations, this 20% can prove of vital importance.
Statistical techniques:
A good way to start is to extract some simple, statistical information from the data set, and averages are an important example in this respect.
Average
| |
Age
Income
Credit
Car owner
House owner
|
46.9
20.8
34.9
0.59
0.59
|
Car magazine
House magazine
Sports magazine
Music magazine
Comic magazine
|
0.329
0.702
0.447
0.146
0.081
|
9. averages
A trivial result that is obtained by an extremely simple method is called a naïve prediction, and an algorithm that claims to learn anything must always do better than the naïve prediction.
Magazine
|
A priori probability
That cline buys magazine
|
Naïve prediction accuracy
|
Car
House
Sports
Music
Comic
|
32.9%
70.2%
44.7%
14.6%
8.1%
|
67.1%
70.2%
55.3%
85.4%
91.9%
|
10. naïve predictions
Magazine
|
Averages
|
Age
|
Income
|
Credit
|
Car house
| |
Car
House
Sports
Music
Comic
|
29.3
48.1
42.2
24.6
21.4
|
17.1
21.1
24.3
12.8
25.5
|
27.3
35.5
31.4
24.6
26.3
|
0.48 0.53
0.58 0.76
0.70 0.60
0.30 0.45
0.62 0.60
|
11. results of applying naïve prediction
Visualization techniques:
Visualization techniques are a very useful method of discovering patterns in data sets, and may be used at the beginning of a data mining process.
An elementary technique that can be of great value is the so-called scatter diagram; in this technique, information two attributes is displayed in a Cartesian space. Scatter diagrams can be used to identify interesting sub-sets of the data sets so that we can focus on the rest of the data mining process. There is a whole field of research dedicated to the search for interesting projections of datasets – this is called projection pursuit.
In our example, we have made a projection along two dimensions: income and age. We see that on average young people with a low income tend to read the music magazine.
Clustering(Likelihood and distance):
There are other reasons to conceive records as points in a multi-dimensional data space. We can determine the distance between two records in this data space:
“records that are close to each other are very alike, and records that are very far removed from each other represent individuals that have little in common”.
Here advantage of good coding comes to light – in order to achieve a good comparison between values, we must normalize the attributes. Age, for example, ranges from 1 to about 100 years, while income has a range from 0 to approximately 100,000 dollars a month. Now, if we use this data without correction, income will be of course be a much more distinctive attribute than age, and records from points in a space determined by their attributes, and distance between them can be measured.
This is not what we want. Therefore we divide income by 1000, in order to obtain a measure that is the same order of magnitude as age. We do the same for the credit attribute. It we scale the attributes to the same order of magnitude we obtain a reliable distance measure between the different records. In our example, using the Euclidean distance measure, the distance between customer1 and customer2 is 15.
In this way records become points in a multi-dimensional data space. For data spaces with low dimensionality it is easy to visualize the data clouds and to find interesting clusters merely by visual inspection. Sometimes it is possible to identify a visual cluster of potential customers that are very likely to buy a certain product. In our sample data set age, income, and credit form an ideal three-dimensional space in which to do this kind of clustering analysis.
The idea of dimensionality can be expanded: a table with n independent attributes can be seen as an n-dimensional space. Managers generally ask questions that pre-suppose a multi-dimensional analysis – they don’t want to know what types of magazines are sold in a designed area per month and to what age group. Information of this nature is called multidimensional and such relationships cannot easily be analyzed when the table has the standard two-dimensional representation. We need to explore the relationship between several dimensions and standard relational databases are not very good at this. They identify records using keys, but there is a limit to the number of keys we can define effectively for a given table. There is however almost no end to the type of questions managers can formulate: one minute a manager might want sales data ordered by area, age and income; the next minute, the same data ordered by credit and age and this preferably online using large data sets. OLAP tools were developed to solve these problems. These tools store their data in special multi-dimensional format, often in memory and a manager can ask any question at all, although the data cannot be updated. OLAP can be an important stage in a data mining and OLAP tools: OLAP tools do not learn, they create no new knowledge, and they cannot search for new solutions. There is thus a fundamental difference between multi-dimensional knowledge and the type of knowledge one can extract from a database via data mining. Data mining is more powerful than OLAP. Another advantage is that data mining algorithms do not need a special form of storage, since they can work directly on data stored in a relational database.
Case based learning(k-nearest neighbor):
When we interpret records as points in a data space, we can define the concept of neighborhood:
“records that are close to each other live in each others neighborhood”.
Suppose we want to predict the behavior of a set of customers and we have a database with records describing this customers. The basic hypothesis required in order to make such a prediction will be that customers of same type will show the same behavior. In terms of the metaphor of our multi dimensional data space, a type is nothing more than a region in this data space. In other words, records of same type will be close to each other in the data space; they will be living in each others neighborhood.
The basic philosophy of k-nearest neighbor is ‘do as your neighbors do’. For example, we start to look at the behavior of ten individuals that are close to him, in the database. We calculate the average of the behavior of these 10 neighbors and this will be the prediction for our individual. The letter k in the k-nearest neighbor stands for the number of neighbors to investigate
.
Simple k-nearest neighbor is not really a learning technique but more a search method because the data set itself used as the reference. The complexity of this technique also proves a drawback. For instance if we want to make a prediction for each element in the data set containing n records then we have to compare each record with every other record. This leads to much more complexity. Hence we use the simple k-nearest neighbor analysis technique mainly on sub-samples or data sets of limited size.
Magazine
|
Accuracy of prediction
|
Car
House
Sports
Music
Comic
|
89% correct
60% correct
74% correct
93% correct
92% correct
|
12. Results of the k-nearest neighbor process
The above figure shows the results of a k-nearest neighbor process to the magazine database. We can see that for the car, sport, and music magazines, k-nearest neighbor does considerably better than the naïve prediction. The prediction for the comics is same as the naïve prediction, so in the case k-nearest neighbor does not help very much. It is interesting to note that for the house magazine k-nearest neighbor does less well than naïve prediction. It may well be that readers of this magazine are randomly distributed over the data set, so that there are no patterns to be discovered. Hence we have to investigate the readers of house magazines by other methods.
Decision trees:
An attempt to predict whether a certain customer will show a certain type of behavior infact implies an assumption that the customer belongs to certain type of customer group and will therefore show this certain kind of behavior.
Let us suppose our database contains attributes such as age, income and credit. Now if we want to predict a certain kind of customer behavior, we might ask which of these attributes give us more information. If you want to predict who will buy a car magazine, what would help us more – the age or the income of the person? It could be that age is more important which implies that only on the basis of knowledge about the age of an individual are we able to predict whether or not he or she will buy a car magazine.
If this is the case, the next thing we have to do is split this attribute in two, that is we must investigate whether there is a certain age threshold that separates car buyers from non-car buyers. In this way we could start with the first attribute, find a certain threshold, go on to the next one, find a certain threshold, and repeat this process until we have made a correct classification for our customers, thus creating a decision tree for our database.
Association rules:
Marketing managers are fond of rules like: 90% of women with red sports cars and small dogs where channel no. 5. These kinds of descriptions give them clear customer profiles on which to target their marketing actions. Now one might wonder whether it is possible to find these sorts of rules with data mining tools – the answer is ‘yes’ and in data mining this type of relationship is called an ‘association rule’.
An association rule tells us about the association between two or more items. For example: In 80% of the cases when people buy bread, they also buy milk. This tells us of the association between bread and milk. We represent it as -
bread => milk | 80%
This should be read as - "Bread means or implies milk, 80% of the time." Here 80% is the "confidence factor" of the rule.
Association rules can be between more than 2 items. For example -
bread, milk => jam | 60%
bread => milk, jam | 40%
Given any rule, we can easily find its confidence. For example, for the rule
bread, milk => jam
We count the number say n1, of records that contain bread and milk. Of these, how many contain jam as well? Let this be n2. Then required confidence is n2/n1.
This means that the user has to guess which rule is interesting and ask for its confidence. But our goal was to "automatically" find all interesting rules. This is going to be difficult because the database is bound to be very large. We might have to go through the entire database many times to find all interesting rules.
Neural networks:
It is interesting to see that many machine learning techniques are derived from paradigms, related to totally different areas of research. Neural networks are modeled on the human brain. The human brain consists of very large number of neurons about 10^11, connected to each other via a huge number of so called synapses. A single neuron is connected to other neurons by a couple of thousands of these synapses. Although neurons could be described as the simple building blocks of the brain, the human brain can handle very complex tasks despite this relative simplicity. This analogy therefore offers an interesting model for the creation of more complex learning machines and has led to the creation of so called – artificial neural networks. Such networks can be built using special hardware but most are just software programs that operate on normal computers.
Typically a neural network consists of a set of nodes: input nodes receive the input signals, output nodes give the output signals and a potentially unlimited number of intermediate layers contain the intermediate nodes.
There are 2 stages that are involved in the neural networks – the encoding stage in which the network is trained to perform a certain task and the decoding stage in which the network is used to classify examples, make predictions, or execute whatever learning task is involved.
1.Encoding Stage
2.Decoding Stage
The input nodes are wholly interconnected to the hidden nodes and the hidden nodes are wholly interconnected to the output nodes. In an untrained network the branches between the nodes have equal weights. During the training stage, the network receives the examples of input and output pairs corresponding to the records in the database, and adapts the weights of different branches until all the input matches the corresponding outputs.
Reporting:
Reporting the result of data mining can take many forms. In general one can use any report writer or a graphical tool to make the results of the process accessible.
Advantages:
· To extract appropriate, accurate and useful information.
· Plays a vital role in the concept of machine-learning.
· Good decision making capability
· For the trend analysis.
Disadvantages:
· Only a prediction
· Huge implementation costs
· Need to choose a best algorithm that matches the need.
Conclusion:
Thus Data mining is emerging as a key technology for enterprises that wish to improve the quality of their decision making and competitive advantage by exploiting operational and other available data. The number of areas in which we see a serious need for data mining and the number of cases where we see data mining techniques operating usefully and successfully, is growing all the time. We believe that the future will see the development of many new applications based on data warehouses and using data mining techniques. Data mining allows for the creation of a self-learning organization and it is this that convinces us that we are at the beginning of a very promising era.