Here’s a great interview question for a data scientist position. How much time do you allocate to data preprocessing vs modeling?
While building models is exciting, high performing models require high quality data as input. If we feed our model poor quality input data, then we obtain faulty output. As the saying goes Garbage In, Garbage Out. Even state-of-the-art machine learning methods will have poor performance if trained with the wrong data, because models learn exclusively from the training data (and not the data we wish they had). So we need to take our time to ensure that the training data is consistent and error-free.
The general accepted answer for our interview question is 80% for data preprocessing and 20% for modeling. Improving the data upstream brings benefits for all downstream steps.
CRISP-DM (CRoss-Industry Standard Process for Data Mining) is the industry standard process for a data science project. Data preprocessing is the third out of six phases in CRISP-DM. Here it is an outline of the phases:
1. Business understanding involves the understanding objectives, creating a project plan and defining performance metrics.
2. Data understanding includes data collection, data explorations as well as ensuring that we have high quality data.
3. Data preparation is what we call data preprocessing and it is the topic of this blog and its follow-up post.
4. Modeling. Once data has been preprocessed in a suitable format for the machine learning task, it is used to train models and tune parameters. This phase can also include model assessment by a domain expert.
5. Evaluation reviews the model according to the business objectives. For instance, we might understand that the model doesn’t cover some edge cases and we need to collect additional data to investigate further.
6. Deployment presents the outcomes in a convenient and accessible format to the users.
We can group data preprocessing into two steps, data cleaning and feature engineering. The former transforms the raw data into consistent data and is done just once. The latter (which we’ll look at in the next blog post) transforms consistent data into a specific format for each machine learning method. So if we are going to apply five different methods, then we need to perform five different feature engineering pipelines.
We can use a toy dataset to explain data preprocessing concepts. The dataset contains information on customers of a hypothetical e-commerce website (HEW). The independent variables are the username (name), age, city, salary, number of visited pages (pages), number of unique sessions (sessions), number of visited products, whether the member clicked on an advertisement of a specific product (click). The dependent variable is whether the member purchased the product currently on promotion (purchased).
The dataset is the following:
Last year, the HEW marketing department showed the advertisement to all its customers, but this year it wants to target only those likely to buy that product. As data scientist, our job is to predict those most likely to buy.
Throughout this and the following blog, we preprocess this toy dataset for the predictive task.
Edwin de Jonge and Mark van der Loo break the data cleaning into three main steps:
• raw data or our input data
• technically correct data or raw data in an organized tabular format. In our salary variable, “50000” (numeric) and “60000” (numeric) belong to the variable domain, whereas “high” (string) is not. The HEW toy dataset is already at this stage
• consistent data transforms technically correct data into a format suitable for machine learning.
This is the resulting data cleaning pipeline overview:
raw data → technically correct data → consistent data
The first two steps are highly dependent on the programming language we use. We may read data in a tabular format using read.table in R and read_table in python using the pandas library. Then, we convert variables according to their type (values of age and salary variables will be converted as numerical values using `as.numeric` function in R). We focus the remaining of this section on the third step, that is building consistent data from technically correct data, and we briefly touch on two topics: handling missing values and handling inconsistencies.
Real world datasets often have several variables with missing values. The common behavior of machine learning models is to remove any entry with missing values. Unfortunately, this removes lots of useful data. Understanding the nature of the missing data may helps us keep the most of the information.
There are four main reasons for which a variable may have missing values:
• missing completely at random (MCAR), where the missing salary value is likely to be one of the other possible salary values.
• missing at random(MAR) when the missing variable is correlated with the observed data, but not with the missing data. That is, the salary value is missing because of the values in other recorded variables.
• missing that depends on unobserved predictors, is where some values are missing because information was not recorded (unobserved).
• Finally missing that depends on the missing value itself, is where we have some values that are more likely to be missing than others. For example, we don’t see salaries above 500000 in the data and no-answer is given instead.
The last two categories are often grouped together under the name of missing not at random (MNAR).
Proper handling of missing values will be covered in a future blog post. For the moment we are going to remove user3 and user8 in the HEW dataset because they have one missing value each.
Some reported values may not belong to the variable domain. For example if we find a “-2” (numeric) in the age variable. Other inconsistencies may be related to rules among variables. An example when someone’s age is 2 and driving licence is Yes. There has been a mistake in recording this information and such inconsistency must be addressed.
Understanding the meaning of the variables, with the help of a domain expert, will tackle this issue and a set of rules may check for inconsistencies.
The HEW dataset has user4 with age set to 1. This is clearly a mistake in the data retrieval and we decide to correct this value with the average ages of other users. The user4 has now age 33.
We introduced data preprocessing in the context of CRISP-DM cycle and delved into data cleaning to solve missing values and inconsistencies. What remains is feature engineering to ensure that data comply with the specific format for the machine learning model of interest. We have written more about that in next blog post.