How much manual work is behind an AI model? How data annotation works

vitantoniosantoro
24 feb 2022
Tempo di lettura: 5 min

Aggiornamento: 28 feb 2022

Rows of women, mostly young, sit elbow to elbow in chairs, cheap chairs, in this five-story Soviet-style building on the outskirts of Beijing, staring at the screen of their computers. Some people bring their own pillows to help support their backs. During their shifts, they examine images of everyday life and "label" them with dots, lines, and descriptions.

Welcome to the world of artificial intelligence, the arrival of which has been dubbed the fourth industrial revolution, with the promise of freeing humans from repetitive and boring tasks. But, before this utopian promise can be realized, a lot of monotonous work - by humans - must be completed.

Since Frank Rosenblatt invented the "perceptron" in 1958, the first algorithm capable of classifying objects shown to him, not a week has passed without news of new discoveries in artificial intelligence and algorithms capable of completing tasks previously reserved for humans being published with triumphant tones. There's a lot of talk these days about algorithms completing complex tasks with seemingly minimal human involvement; however, imagining artificial intelligence as a self-learning entity is a mistake; just because we can't see humans behind AI doesn't mean there aren't any. This is true not only for the engineers who create these algorithms. In fact, a new class of blue-collar workers is in charge of preparing the sample data that AI algorithms use to "learn."

Quando la gente parla di AI in questi giorni, ciò che intende veramente è di solito il Machine Learning (ML). La maggior parte degli algoritmi di ML sono essenzialmente modelli statistici che "imparano" come eseguire un particolare compito analizzando grandi campioni di dati - "training data" - che sono stati precedentemente elaboratori da persone in carne e ossa, così da fornire degli "esempi" all'algoritmo. But what exactly is "training data"? Assume you own a blueberry muffin factory, but every now and then a stray dog from a nearby animal shelter jumps onto the conveyor belt, and your AI-powered packaging robot must differentiate between muffins and dogs so that no dog ends up on a grocery store shelf. To be able to do this, the robot must be fed a large number of images of muffins and dogs, and these images must be manually labeled by so-called taggers (who image by image distinguish whether there is a dog or a muffin).

The same principle applies to self-driving cars (which, among other things, must be able to recognize a stop sign) and most other AI applications.

Data labeling has no boundaries. Some employees may tag data to assist floor-cleaning robots in recognizing furniture and other obstacles. Others may collect and label various ways of saying "25 degrees Celsius" in order to assist smart air conditioners in understanding user commands. Aside from smart appliances, data must be labeled before it can be used to train algorithms in fields such as autonomous driving, in which cars must recognize and interpret objects in complex, real-time conditions and respond accordingly.

This poses a problem for companies: How can they get labeled or annotated data?

Even if they obtain large amounts of data, such as photos (for image recognition algorithms), voice recordings (for speech recognition), or written text (for sentiment analysis), labeling all of this data is a time-consuming task that must be done by humans. The most time-consuming aspect of training an Artificial Intelligence model is manually annotating the data.

How to label the data?

Data can be labeled in a variety of ways. Some businesses/individuals label their data themselves; however, this can be costly, as hiring people solely for these tasks costs businesses both money and flexibility.

Other companies, such as Google, figure out how to get their data labeled for free. Have you ever wondered why Google's reCAPTCHA keeps prompting you to identify street signs in blurry photos? (Hint: Google also owns Waymo, a company that specializes in self-driving cars.)

In most cases, however, paid workers tag data, and several outsourcing companies have sprung up around the world employing thousands of workers ("the invisible workers who power artificial intelligence") to meet this new market need.

Similarly to how Western companies began outsourcing manufacturing jobs to developing countries in the 1960s and 1970s, technology companies are outsourcing data labeling to foreign companies that operate "data label factories." And, as in the past, these jobs are being relocated to areas with lower wages and more favorable working conditions for businesses. There, in former warehouses and large open-plan offices, hordes of workers sit in front of computers and spend their days labeling data.

Another way to outsource data labeling is through online crowdworking platforms, which divide up different tasks among thousands of workers around the world and manage to annotate huge volumes of data in an extremely flexible way, at competitive costs, and in record time, significantly speeding up the process of creating an AI algorithm.

Roboticly.ai employs over 1,000 collaborators specialized in text analysis in Italian, French, German, Spanish, English, Russian, Chinese languages, as well as the labeling of video and image datasets, to manually analyze and annotate digital content in order to continue "training" the algorithms and improve accuracy.

To efficiently and cost-effectively support third-party companies in the creation of AI models, a team of multilingual collaborators with knowledge in various domains (e.g., Health, Business, Law, etc.) and a tool to distribute tagging tasks among different collaborators is required. There are no companies in Europe that specialize in these tasks, and Roboticly's experience in data labeling activities for AI models is growing.

The blue-collar work of the AI era

Taking a step back, it is clear that a new type of low-skilled laborer has emerged to meet technology's demand for labeled data. Unlike physical assembly-line labor in the industrial economy, this new class of worker is now part of a "digitized data supply chain." Of course, not all of these jobs are low-skilled – for example, an algorithm that detects cancer in CT images must be trained by experienced radiologists.

As a result, it is critical to ensure that this new type of work becomes a source of economic security for workers rather than a source of exploitation.

So, while AI has the potential to create more creative, value-added jobs for humans in the future - at least for the time being - it is also creating a new wave of manual labor that many people are willing to accept.

"There were no iPhone workers or Foxconn workers ten years ago, and I believe that while some jobs are being replaced, new jobs will always be created."

Roboticly.ai can support your company in the creation, labeling and data quality of datasets for AI model training as well as support your company in the actual creation of the AI algorithms.