Data Preprocessing for Machine Learning

What is Data Pre-Processing?

Data Pre-Processing is a technique used by Data Scientists to clean the data in a way that it becomes eligible for a machine to process the same and apply it to the models that are best suited to run them. Here, the raw data is transformed and converted into clean data using data manipulation techniques so it becomes easier for the machine to understand the data in question.

Why Data Pre-Processing?

Think of it like this, for you to qualify as a Data Scientist by industry standards, you have to be trained not just with the theory of Data Science but also with real world applications of the same. This is where your doubts over the knowledge you have accumulated gets cleaned and prepared to deal with client data, or business problems that cannot be taken lightly.

Tools for Data Pre-Processing

The tools for data pre-processing depends on the type of data that is there at our disposal. If you’ve got a hang of the CRISP-DM model, then you’d know the back and forth which recommends a Data Scientist to constantly see whether the data is ready to be fed into the machine learning algorithms or no. In order to do so, it is essential to clean the data at its very source, which is when it is collected. This can be done manually, or by a 3^rd party. Either ways, we’ll make a list of the tools required to make these changes.

1. Data Understanding – This can be done using Data Visualisation tools which can show bar plots, scatter plots, histograms or heat maps where one could spot discrepancies with the naked eye.

2. Data Manipulation – For missing values in a data set, we can conduct a missing values treatment on the data depending on the type of data and the relation between their features – where we would be required to thereby conduct a univariate, bivariate or a multivariate analysis.

Before we dive deep into Data Cleaning and Data Transformation, we have to understand these various types of analyses since it helps to understand the impact data has when corrections are made to the raw data.

Univariate Analysis: This analysis takes one variable or a single data point which is why it is called univariate analysis. The functions performed during such analysis are usually used to obtain information regarding its mean, median, mode, variance, range or interquartile range, or standard deviation. For example, an array consisting of numerical values is used to give out information that tells us about it alone, without being in comparison with another data set.

Bivariate Analysis: This features the analysis of two data points or sets which can be distinctly compared to draw conclusions or insights with reference to each other. Normally, we perform this wherever we feel that one data set may have an influence over the other.

Multivariate Analysis: Multivariate Analysis is the same as Bivariate Analysis, except for the part where it considers more than two data points or sets for analysis. All-in-all, in pre-processing of data, these analyses give us a clear indication of what treatment or method is to be used for cleaning and preparing the data for modeling. Now let’s take a look at some of these methods.

A. Missing Value Treatment:

In order to prepare the data, we either conduct a transformation on the data set or assign new values to the missing data. During multivariate analysis, if we have a large number of missing values, then it may be suitable to drop the null values. However, say, in the case of univariate analysis, the decision to drop the missing values entirely can cause a bias in the way we look at the data set as a whole.

<1% Missing Values: Usually, we consider the percentage of missing values in comparison with the entire data set. When the missing values are less than 1%, we safely drop the cells as it wouldn’t make much difference to the overall considerations. For example, a sale on one day would be negligible in contrast to the results obtained for sales made throughout the year.

<7% Missing Values: An industry standard for considering transforming the values from null to a finite figure is when the missing values are under 7%. In this case, we replace the null with either the mean, median, or the mode observed in the data set or a sample of it.

>30% Missing Values: Another major criteria to look out for transformation is when the missing values are more than 30%. In this case again, we drop the feature as the input given to the values could cause a bias in the way the entire data set shapes up, leading to falsifications in data. Hence, we avoid this.

B. Outlier Treatment:

Outliers becomes a constant threat to the dataset when they are not taken care of. A simple problem that may arise due to outliers is when they take on extreme values that affect the distribution, however, this decision is key to the Business Team without whose approval a Data Scientist doesn’t usually take this decision. To identify and clean outliers, a box plot is used as a placeholder. Then the changes are made for outlier treatment in consultation with the business team, which decides the value to be replaced with, which is usually the mean of data set.

C. Bad encoding (text)

Let’s consider an example of Alexa here, which only understands a select few languages. Now if we were to give out instructions to this machine in a native language not recognized by it, it’s not going to work. When this happens in the context of a text, we call it bad encoding. For such text, we turn this categorical data into numerical by a process called One-hot encoding where we perform numerical encoding of textual data. Here we take dummy variables and assign 1 to the column where the data appears to be true. If we assign using a zero, it’s called one-cold.

Conclusions: Data Preprocessing is a key stage in Data Analysis where the data is shaped, scaled, or standardized using various tools that we mentioned above. This process helps the data to become fit for Machine Learning as well as in aligning with the business understanding which lies at the core of any Data Science.

If you’re interested in Data Science as a career which is one of the hottest skills in the market today, you can enroll for the course with Skillslash. At Skillslash, you get a unique opportunity to gain real work experience upon completion of the course at top MNCs. To find out more, get in touch with one of our counselors today by visiting https://skillslash.com/data-science-course-training-kolkata