Machine learning algorithms learn from data. It is critical that you feed them the right data for the problem you want to solve. Even if you have good data, you need to make sure that it is in a useful scale, format and even that meaningful features are included. In this post you will learn how to prepare data for a machine learning algorithm. This is a big topic and you will cover the essentials. Lots of Data Data Preparation ProcessThe more disciplined you are in your handling of data, the more consistent and better results you are like likely to achieve. The process for getting data ready for a machine learning algorithm can be summarized in three steps:
You can follow this process in a linear manner, but it is very likely to be iterative with many loops. Step 1: Select DataThis step is concerned with selecting the subset of all available data that you will be working with. There is always a strong desire for including all data that is available, that the maxim “more is better” will hold. This may or may not be true. You need to consider what data you actually need to address the question or problem you are working on. Make some assumptions about the data you require and be careful to record those assumptions so that you can test them later if needed. Below are some questions to help you think through this process:
It is only in small problems, like competition or toy datasets where the data has already been selected for you. Step 2: Preprocess DataAfter you have selected the data, you need to consider how you are going to use the data. This preprocessing step is about getting the selected data into a form that you can work. Three common data preprocessing steps are formatting, cleaning and sampling:
It is very likely that the machine learning tools you use on the data will influence the preprocessing you will be required to perform. You will likely revisit this step. So much data Step 3: Transform DataThe final step is to transform the process data. The specific algorithm you are working with and the knowledge of the problem domain will influence this step and you will very likely have to revisit different transformations of your preprocessed data as you work on your problem. Three common data transformations are scaling, attribute decompositions and attribute aggregations. This step is also referred to as feature engineering.
You can spend a lot of time engineering features from your data and it can be very beneficial to the performance of an algorithm. Start small and build on the skills you learn. SummaryIn this post you learned the essence of data preparation for machine learning. You discovered a three step framework for data preparation and tactics in each step:
Data preparation is a large subject that can involve a lot of iterations, exploration and analysis. Getting good at data preparation will make you a master at machine learning. For now, just consider the questions raised in this post when preparing data and always be looking for clearer ways of representing the problem you are trying to solve. ResourcesIf you are looking to dive deeper into this subject, you can learn more in the resources below.
Do you have some data preparation process tips and tricks. Please leave a comment and share your experiences. Related posts: |
|