Data Preparation: Understanding Data for AI Product Managers

Elen Gabrielyan
5 min readMar 15, 2022

Data preparation is a significant chunk of workload when developing a Machine Learning (ML) algorithm. It is not a one-time operation but rather a process. Data preparation plays a substantial role during algorithm development. It significantly affects both the Machine Learning model and the model evaluation results. You can quickly get biased results if you are not paying enough attention to the data quality. For example, you may get 99% accuracy on your test data but get bad results in real life. That may result from biased testing, i.e., your test data is from the same dataset as your train dataset.

At Krisp, we work on data during the whole ML project life-cycle. We carefully select data for the training and then improve it continuously. Also, we collect another independent set of data for testing, which evolves depending on the algorithm development stage.

Let’s dive more into the three main stages that constitute the data preparation process:

Data acquisition (collection)

Data acquisition is the process of data collection, as the term speaks for itself. At this stage, it is essential to identify relevant data for your ML project and then find ways to get that data. Many paid sources provide good quality data for certain problems. There are many freely available datasets that you can use right away for research purposes. Take into account that this kind of free data may need filtering before feeding it to a neural network.

At Krisp, we often record our data for final model training. The reasons are manifold — starting from the licensing policy of existing datasets to the fact that there is simply no appropriate data for some unique problems, even in paid sources.

Ask the following questions to yourself before starting the data acquisition:

  • What kind of data is needed, and what are the goals: is it test data or train data?
  • Do you have data that can be improved?
  • Does the data have objective, non-biased labels?
  • Is there a way to get data externally?
  • What’s the quantity of data needed? What’s the fastest and optimal way to get it?
  • Do you have any data generation methods?
Image from meme-generator.com

At this stage you need to consider the following steps:

Identification

We identify the type of data needed. We need to understand the specific use cases of the problem. Here comes the help of product managers, who determine the use cases based on analysis of similar technologies in the market and the final vision of the usage of the technology. So we collect more relevant data, which is diverse, unbiased and balanced(meaning some elements in the dataset are not more strongly weighted or represented than others, making the dataset biased).

Gathering

Once we know the data needed, we can start gathering it. The process is directly connected to the data types, as there might be cases that collecting the data would be easier and costless (in the case of open-source datasets or free download options). Still, there might be project-specific cases where we would need to find companies to buy data from or record on demand.

Labeling

For supervised learning, we need labeled data. Labeling means adding one or more labels specifying some properties, characteristics, classifications, or contained objects of data. This process is critically important, as the model learns based on the labels; hence, its quality is directly related. For example, you have a supervised machine learning model, which should identify a person’s emotions based on their voice. We would need audio files labeled with certain emotion classes for this model. The classes should be specified beforehand, as there might be many emotion classifications.

Data pre-processing

The data quality is essential; it can negatively affect the ML model. We constantly check data quality and filter it appropriately before starting training. And here comes the pre-processing stage.

Some of the questions that need to be addressed at the stage of data pre-processing are:

  • Is the data “noisy” (is it full of incorrect values)? Maybe reconsider data sources.
  • Does the dataset have missing data values? If yes, how can it be replaced or fixed?
  • How to combine datasets so as not to lose important data?
  • Where is the data stored?
  • Is the security and availability of the data considered?
Image from http://labs.centerforgov.org/

Data pre-processing includes:

Cleaning

When conducting data acquisition, we should always consider that data might not be ready for machine learning training. There might be incorrect data, empty data, outliers, etc., which should be fixed beforehand. Also, there might be some personal information of people, which should be changed to ids to make it anonymous.

Transformation

At this stage, the data format might be changed by different methods such as normalization, converting the numerical data points to 0–1 range.

Augmentation

Augmentation is the process of adding additional data for scaling the dataset, for making it richer artificially. It is usually done synthetically by adding somehow artificially changed data samples. For example, for Krisp’s Noise Cancellation models, we may play with volume, pitch, apply filters, etc.

Sampling

In simple words, sampling means selecting a minor part from the whole dataset for the training. Selection is made because the training of machine learning models is quite expensive, and it needs immense computational power and time; hence using a smaller dataset would be more efficient.

Feature engineering

At this point, we already have pre-processed and ready-to-be-used data for our ML models. ML engineers and scientists define features/variables from the data to extract and use on the feature engineering stage.

Some of the questions that need to be addressed at the stage of feature engineering are:

  • What’s the hypothesis behind feature selection?
  • What are feature engineering techniques going to be used?

Below are the two steps included in this stage:

Feature selection

Feature selection is a process that includes selecting the essential attributes/properties over others; hence at this stage, the most important features are highlighted, and the rest is removed.

Feature extraction

At this stage, the extraction of selected features is done, which allows the use of a smaller dataset for the machine learning model to be trained on.

Conclusion

If you have insufficient data, you cannot have a good machine learning model; that’s as simple as that. So everything mentioned above and anything that can affect your data quality is essential to consider when conducting a machine learning project. Though there are many challenges in Data Preparation, it is still better to invest enough time and resources at this stage than coming back to it after achieving mediocre results with the model.

--

--

Elen Gabrielyan

Product Manager, AI. Tech enthusiast. Founder of HYE Box.