Knowledge Discovery in Data and Steps involved in KDD.

January 18 2025

KDD, which stands for Knowledge Discovery in Data, is a process of identifying useful, previously unknown, and potentially actionable information from large data sets. It is a field of study that combines techniques from statistics, machine learning, and database systems to extract knowledge from data. KDD is a multidisciplinary field that involves a wide range of activities, such as data cleaning, data integration, data selection, data transformation, data mining, pattern evaluation, and knowledge representation.
In summary, KDD is a process of extracting knowledge from large data sets, it is used in data warehousing to extract useful information from large and complex data sets stored in the warehouse, and in data mining to discover patterns and relationships in large data sets.

In a data warehouse, KDD is used to extract useful information from large and complex data sets stored in the warehouse. The data in a warehouse is usually stored in a multidimensional format, which makes it easier to analyze and extract insights. The goal of KDD in a data warehouse is to identify patterns and trends in the data that can be used to improve decision-making and support business objectives.

In data mining, KDD is used to discover patterns and relationships in large data sets. The goal of data mining is to extract useful information from data and transform it into an understandable structure for further use. Data mining can be used to classify, cluster, or predict data, and can be applied to a wide range of fields such as marketing, finance, and healthcare.

In short, KDD is a process of extracting knowledge from large data sets, it is used in data warehousing to extract useful information from large and complex data sets stored in the warehouse, and in data mining to discover patterns and relationships in large data sets.

Steps Involved in KDD

The KDD process typically consists of the following steps:

  1. Data Cleaning: This step involves cleaning and preparing the data for KDD. This includes tasks such as removing errors, inconsistencies, and outliers from the data. Data cleaning can also include handling missing values, correcting inconsistent data formats, and resolving data duplication.
     
  2. Data Integration: This step involves combining data from different sources into a consistent format. This can include integrating data from different databases, files, or external sources, and resolving any conflicts or inconsistencies between the data sources.
     
  3. Data Selection: This step involves selecting the relevant data from the large data sets that are available. This includes identifying the data sources, the data types, and the data size. The goal is to select the data that is most likely to contain the information needed to achieve the specific KDD task.
     
  4. Data Transformation: This step involves converting the data into a format that is suitable for KDD. This can include tasks such as data normalization, feature extraction, and dimensionality reduction. Data transformation can also include creating new variables or attributes, and deriving new information from the data.

  5. Data Mining: This step involves applying data mining techniques to extract useful information from the data. Data mining can be used to classify, cluster, or predict data, and can be applied to a wide range of fields such as marketing, finance, and healthcare.
     
  6. Pattern Evaluation: This step involves evaluating the patterns and relationships discovered in the data mining step. This includes identifying the interestingness and usefulness of the patterns, and determining which patterns should be used to generate new knowledge.
     
  7. Knowledge Representation: This step involves representing the knowledge discovered in the previous steps in a way that is easily understandable and useful. This includes creating visualizations, reports, and other forms of representation that can be used to communicate the knowledge to others.

It's important to note that the steps in the KDD process are iterative in nature and that each step may be revisited and refined as needed. Additionally, the specific steps and techniques used can vary depending on the data and the KDD task at hand.