data mining
Data mining is a name of recent invention. And so it is the tendency. If we look few years ago, information about clients’ preferences, genes or national economies were hard to get. But few years of computers everywhere have changed that. A biologist today is able to know when each of the 10000 genes of a Drosophila is being expressed. A businessman can check the price of a commodity all around the world, in real time. Policy markers have access to data on many national economies, at a great level of detail.
Suddenly we realize that in the answering of our concerns, we have too much information to digest. The biologist focused in conservation might care for few genes in the middle of the 10000 reported. An uitzendbureau in The Netherlands might not care about the detailed life histories of their employees. Nor an environmental lobbyist cares for the price of the stock of Shell. Or is it needed to know it?
Data mining is a set of techniques to find useful information in between big amounts of data. In the past years statisticians have access to information that lies in different places, has been collected by different protocols and contains the answers that our client seeks. Tools that have been developed are similar in genomics, exploratory economics or sociology. A standard data mining study describes, discriminates, models and forecast a data set.
analyze single variables,
find relations between
variables, find outliers.
probability plots, clusters,
correlations and other
exploratory techniques.
assign probabilities to relations and outliers defined earlier.
probability calculation multivariate discriminant and canonical analysis.
build a predictive expression with relevant variables.
generalized linear and nonlinear models
programming languages
use the previous model
to predict patterns in data
response surfaces
time series analyzes
diverse regressions