Data Mining within the Predictive Analytics Proces

The proper use of the term data mining is data discovery. However, the term is used commonly for collection, extraction, warehousing, analysis, statistics, artificial intelligence, machine learning, and business intelligence. Statistics provide sufficient tools for data analysis and machine learning deals with the different learning methodologies. Before any data can be mined it needs to be cleansed. This removes the errors and ensures consistency. Data mining methods are generalisation, characterisation, classification, clustering, association, evolution, pattern matching, data visualisation, and meta rule guided mining.
SAS Institute overall plan for data mining is known as SEMMA. This plan has 5 steps which are as follows: sample, explore, modify, model, and assess.
Step 1: Sample
Extract a portion of a large data set big enough to contain the significant information yet small enough to manipulate quickly.
 Step 2: Explore
Search speculatively for unanticipated trends and anomalies so as to gain understanding and ideas.
Step 3: Modify
Create, select, and transform the variables to focus the model construction process.
Step 4: Model
Search automatically for a variable combination that reliably predicts a desired outcome.
Step 5: Assess
Evaluate the usefulness and reliability of findings from the data mining process.
The tasks in data mining are either automatic or semi automatic analysis of large volume of data which are extracted to check for previously unknown interesting patterns. These are cluster analysis, anomaly detection on unusual records and dependencies check using the association rule mining. This usually involves using database techniques such as spatial indices. These patterns thus identified provides a summary of the input data, and can used in further analysis or in machine language and predictive analytics. The use of data mining methods for samples of data are known as data dredging, data fishing, and data snooping .Mining techniques are employed in different kinds of databases, including relational, transaction, object-oriented, spatial, and active databases.
 Data mining involves six common classes of tasks:
1.Anomaly detection- This is the Outlier or deviation detection, where the identification of unusual data records or data errors that require further investigation are identified.
2.Association rule learning- This is also called dependency modeling which searches for relationships between variables. Also known as market basket analysis.
3.Clustering – Clustering is the task of discovering groups and structures in the data that are in some way or other similar without using known structures in the data.
4.Classification- Classification is the task of generalising known structure to apply to new data.
5.Regression- Regression is the task with an objective to find a function which models the data with the least error.
6.Summarisation- Summarisation tasks provides a more compact representation of the data set, including visualisation and report generation.

Leave a comment