Knowledge Discovery and Data Mining
Knowledge discovery in databases (KDD) and
data mining attempt to extract useful information from
databases, heterogeneous data sources and
very large databases (VLDB).
To be distinguished from:
- information retrieval: search for known items
- data analysis: usually just traditional statistics
- most traditional statistics: prove or disprove hypotheses;
calculate a fixed set of statistical indicators (such as mean, variance);
correlate a few variables; etc
- machine learning: automated computer learning process,
usually based on a fixed, carefully constructed database
Tasks:
- identify new relationships and global patterns in the data
- identify hidden knowledge
- identify trends that can be used for prediction
- identify influential variables (main factors)
- classification or clustering of items into meaningful groups
- identify association rules, ie. attributes or factors that are
associated with each other
Association Rules
Example: "People who buy beer also buy newspapers."
This is not a logical "inference" because it is not true for all
people. It is only true for people in general.
Rule: beer -> newspapers
Rule: X -> Y
support: number of people who buy X and Y / all people
confidence: number of people who buy X and Y /
number of people who buy X
For example: if out of 1000 people there are 2 people who buy beer
and these 2 also buy newspapers, then the support for
beer->newspapers is low (0.2%) but the confidence is high (100%).
On the other hand, if out of 1000 people 100 buy beer and newspapers
and another 500 buy beer but no newspapers, then the
support for beer->newspaper is higher (10%) but the confidence is
lower (20%).
Applications
- human genome project and similar applications in bioinformatics
- economy, business intelligence
- pharmaceutical industry: search for new drugs
- global systems: eg environmental issues, weather data
- market analyses
Major Challenge
It is fairly easy to disvover new information. The questions is
which discovered rules, trends, factors are
novel, interesting, plausible and
understandable.