How to deal with AI model data preparation

Reading this article is going to give you an idea of how Data Scientists and AI teams deal with data preparation.

Data preparation is the very first step in building AI

It encompasses all the steps in processing raw data (text, video, image, numbers, audio) into the format that can be machine-readable. It also prepares datasets to further analysis. Therefore, before taking our data into a machine learning model, it’s crucial to ensure that our data is neat and consistent. Dorian Pyle, an author of “Data Preparation for Data Mining” wrote that data prep clears the water and removes the murk so that the fish are clearly seen and easily attracted. The fish is, of course, our value we hope to extract from the sea of data.

Data wrangling is a time-consuming and error-prone task during the AI building cycle

Piotr Migdał, Ph.D. and deep learning consultant told us that Data scientists declare to spend from 80 to roughly 90% of their time prepping the data. At AID Doubler we wanted to get into details of the whole process. Our goal was to make sure we understand the needs of the AI community while working on services and products dedicated to them.

To approach this methodologically we decided to use a survey. At the beginning of January, we asked Polish AI professionals to fill out a survey.  Our aim was to know how they work with datasets and understand their main pain points.

We hope that the insights from this research will benefit everyone in the field to better understand the current state of AI building. An especially interesting question we are trying to address is: How optimizing the annotations process can enhance the daily work of the machine learning engineers.

AI crowd – respondents and questions

Results were generated from almost 60 Polish AI people. Over 45% of them represented academia, almost 18% AI business, followed by employees of IT corporations, software houses by media/marketing agency, retail companies and such. You will find out how to get exact numbers at the bottom in the PS.

AI at universities, AI software houses, AI in corporations, AI in media


What did we ask about?

  • work field and type of organizations
  • type of data our respondents mostly deal with
  • crucial factors regarding the dataset preparation
  • type of API they use
  • use of human labeling
  • current challenges in the work

The biggest need in AI model building is…

No surprise here: is all about the AI data labeling. The overwhelming majority of the participants pointed to annotation work as the biggest need for building AI models. This insight confirms the well-known market tendency that data labeling for model building is high in AI hierarchy of needs. Using information based intelligence to build reports is another need for companies out there. Follow Monica Rogati, Data Science advisor: You need a solid foundation for your data before being effective with AI and machine learning.

data cleaning, preparation

What type of data AI professionals mostly work with?

The text was by far the most popular type of materials to deal with (nearly 60%), only slightly overtaking traditional numeric data.

Text-based NLP (Natural Language Processing) proved to be extremely valuable for businesses. In the area of customer development, it can be used in understanding customer needs, behaviors and opinions. NLP tools include extracting metadata and performing sentiment analysis.

Similarly, the image doesn’t stand out from the text and numerical data. 48% of our respondents struggle with images in their daily work. The holy grail in business intelligence is, of course, image recognition and computer vision. We want to teach machines how to “see” things that humans sometimes cannot recognize. Implementation of image recognition ranges from MedTech, e-commerce to self-driving cars. However, its key applications yet to be discovered in the near future.

Cumulative global AI revenue, statista

Interestingly enough, we observe more and more AI analyst using video, voice, and geodata.

Go to PS if you want to obtain exact numbers 😉

Input data precision is KING

Once we know where and what kind of data is most needed, let’s dive into the core of our survey. Asked about the key important considerations while working with data, participants stated that… Precision is a crucial factor in building machine learning models.

Aleksandra Przegalińska, Ph.D., who works on AI at MIT in Boston shared with us recently that any AI team shall put the precision of the data in the first place as a factor while choosing an annotations supplier.

Other two of them listed as speed and price. However, the combination of precision and speed proved to be the secret ingredient.

At AIDoubler we take those facts into consideration and apply it to our AI business solutions to give outstanding service for those who are seeking for the precision of annotations.

API vs Human Data Labeling?

We know that machine learning is expensive and time-consuming. Data Scientists say that they spend the majority of their time on cleaning and data wrangling. Therefore, freeing up the time of expert results in cutting a significant amount of time! This includes time to launch an AI model, proof of concept or following iterations of the final models. In other words, it may significantly lower the costs of building AI solutions and save the time needed to train and evaluate the AI models.

Nevertheless, only a small fraction of organizations use API at all! Businesses based on AI solutions, scholars as well as software houses create their own internal datasets. Roughly one of five respondents make use of any APIs. Among those who do use cloud services, they preferred Google and Microsoft APIs on demand.

Go to PS if you want to obtain the list of all API listed in the survey

Following the lead, we asked our participants which ‘human’ labeling option they use. Unexpectedly, circa 40 % of them employ and manage their own annotator teams instead of choosing one from the market. Among those couple percent who do outsource the data and prep tasks, they name Amazon Turk, Upwork among others. The reasons for giving up those solutions were lack of satisfaction in terms of precision, micromanagement issues and often data security.

The most painful points in AI building are…

Concluding the survey, we asked about the main pain points experienced by the AI community in Poland.

Answers highlighted, the time needed, lack of scalability and concerns regarding the quality, so precision. Other important factors were trust and privacy issues. One of our participants stated, Data is confidential and cannot be labeled by third parties unless there is a strict NDA in place. BTW, at AIDoubler we sign a non-disclosure agreement with our analytics vetted by our Business Partners.

But over and over again professionals we asked pointed toward input preparation as the massive pain point in building solutions based on artificial intelligence. Data requires plenty of cleaning before it can be processed, said Piotr Rybak, Data Scientist with over 10 years of experience in analytics. Another concern expressed by university scholar with large volumes of data stated that Human annotations do not scale, especially with more advanced/specialist needs (such as domain-specific data, privacy issues, etc.). W tag those kinds of problems as a #annotation process, as the problem is the lack of a proper process combined with labeling capacity. More of the painful points to be found clustered in the report accessible under PS.

So, what can we conclude?

Learning from the survey is that the AI Market in Poland is not addressing annotation in a satisfying manner. From the client perspective, the major problems are data privacy and precision. The good news is that those issues can be addressed by the annotation process optimization.

Fast and expert annotating with supreme accuracy and secure data thanks to the optimal annotation processes is our mission. We want your business or AI project to grow faster. Curious? Request a free trial.

Please comment to leave your feedback or message me directly if you prefer. I am curious about your experiences 🤔


If you care about the exact numbers, write down below your email and allow us to send you detailed results from the survey.