Skip to main content

Discretization in data mining-Top-down mapping-Bottom-up mapping and Types of Discretization Methods

Discretization in data mining

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the evaluation and management of data become easy. In other words, data discretization is a method of converting attributes values of continuous data into a finite set of intervals with minimum data loss. There are two forms of data discretization first is supervised discretization, and the second is unsupervised discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised discretization refers to a method depending upon the way which operation proceeds. It means it works on the top-down splitting strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Another example is analytics, where we gather the static data of website visitors. For example, all visitors who visit the site with the IP address of India are shown under country level.

Some Famous techniques of data discretization

1.Histogram analysis

2.Binning

3.Cluster Analysis

4.Data discretization using decision tree analysis

5.Data discretization using correlation analysis

Data discretization and concept hierarchy generation

The term hierarchy represents an organizational structure or mapping in which items are ranked according to their levels of importance. In other words, we can say that a hierarchy concept refers to a sequence of mappings with a set of more general concepts to complex concepts. It means mapping is done from low-level concepts to high-level concepts. For example, in computer science, there are different types of hierarchical systems. A document is placed in a folder in windows at a specific place in the tree structure is the best example of a computer hierarchical tree model. There are two types of hierarchy: top-down mapping and the second one is bottom-up mapping.

Let's understand this concept hierarchy for the dimension location with the help of an example.

A particular city can map with the belonging country. For example, New Delhi can be mapped to India, and India can be mapped to Asia.

Top-down mapping

Top-down mapping generally starts with the top with some general information and ends with the bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information and ends with the top to the generalized information.


Explanation :

Discretization in Data Mining

Discretization is an important preprocessing technique in data mining that involves converting continuous attributes or numerical data into discrete intervals or categories. This transformation simplifies data, making it easier to analyze, interpret, and apply various data mining algorithms, especially those that work better with categorical data, such as decision trees, rule-based classifiers, and association rule mining.

Purpose of Discretization

  1. Simplifies Data Analysis:
    Continuous data often contains a wide range of values. Discretization reduces this complexity by grouping values into intervals, making patterns easier to detect.

  2. Improves Algorithm Efficiency:
    Many algorithms perform better with categorical data since it reduces computational complexity and memory usage.

  3. Enhances Interpretability:
    Discrete intervals, such as “low,” “medium,” and “high,” make results more understandable for decision-makers.

  4. Handles Noisy Data:
    Grouping continuous values into intervals can reduce the effect of noise and minor fluctuations in the data.

Types of Discretization Methods

  1. Equal-Width Binning (Interval Binning):
    Divides the range of a continuous attribute into equal-sized intervals. For example, ages 0–100 can be divided into intervals of 0–25, 26–50, 51–75, and 76–100.

  2. Equal-Frequency Binning (Quantile Binning):
    Divides data so that each interval contains approximately the same number of data points. This method adapts to the distribution of the data.

  3. Cluster-Based Discretization:
    Uses clustering algorithms like K-Means to group similar continuous values into clusters, which are then treated as discrete intervals.

  4. Entropy-Based or Information-Theoretic Discretization:
    Uses the information gain criterion to create intervals that maximize the distinction between different target classes, improving classification accuracy.

  5. Manual Discretization:
    Experts define intervals based on domain knowledge, which can be effective when there is prior understanding of data behavior.

Applications of Discretization

  • Classification: Helps decision trees and rule-based classifiers handle continuous features efficiently.

  • Association Rule Mining: Converts numerical data into categorical form for frequent pattern discovery.

  • Medical Diagnosis: Converts patient test values into categories like “normal” or “high-risk” for easy interpretation.

  • Market Analysis: Groups customer spending or age into meaningful intervals for segmentation.


Read More-

  1. What Is Data Warehouse
  2. Applications of Data Warehouse, Types Of Data Warehouse
  3. Architecture of Data Warehousing
  4. Difference Between OLTP And OLAP
  5. Python Notes

Comments

Popular posts from this blog

The Latest Popular Programming Languages in the IT Sector & Their Salary Packages (2025)

Popular Programming Languages in 2025 The IT industry is rapidly evolving in 2025, driven by emerging technologies that transform the way businesses build, automate, and innovate. Programming languages play a vital role in this digital revolution, powering everything from web and mobile development to artificial intelligence and cloud computing. The most popular programming languages in today’s IT sector stand out for their versatility, scalability, and strong developer communities. With increasing global demand, mastering top languages such as Python, Java, JavaScript, C++, and emerging frameworks ensures excellent career growth and competitive salary packages across software development, data science, and IT engineering roles. 1. Python Python stands as the most versatile and beginner-friendly language, widely used in data science, artificial intelligence (AI), machine learning (ML), automation, and web development . Its simple syntax and powerful libraries like Pandas, ...

Why Laravel Framework is the Most Popular PHP Framework in 2025

Laravel In 2025, Laravel continues to be the most popular PHP framework among developers and students alike. Its ease of use, advanced features, and strong community support make it ideal for building modern web applications. Here’s why Laravel stands out: 1. Easy to Learn and Use Laravel is beginner-friendly and has a simple, readable syntax, making it ideal for students and new developers. Unlike other PHP frameworks, you don’t need extensive experience to start building projects. With clear structure and step-by-step documentation, Laravel allows developers to quickly learn the framework while practicing real-world web development skills. 2. MVC Architecture for Organized Development Laravel follows the Model-View-Controller (MVC) architecture , which separates application logic from presentation. This structure makes coding organized, easier to maintain, and scalable for large projects. For students, learning MVC in Laravel helps understand professional ...

Data Mining Weka Software- 4 types of working in weka software

Weka Software Weka Software- Weka software are used for calculating result with different condition in  data mining . The weka software is play important role in industry, That software are open source and gives great graphical user interface to user. That software are basically Java based customized tool. The weka software are mainly used in machine learning programs for calculating a different task. In weka software including lots of data mining algorithms for performing a different task.  Weka software gives four types of GUI for work-  1. Simple CLI- It gives simple command line interface that allows direct execution of commands.  2. Explorer-It is an environment for exploring data means display data to user in different format.  3. Experimenter-That section for performing experiment and conducting statistical test between learning schema. In that stage execution of algorithm are perform.  4. Knowledge Flow-It is J...