Find a Job


Back to list


Data Platform for Real-Time Analytics

Appen is an AI data collection, aggregation, and cleaning company that provides high-quality training data for organizations that build effective AI systems. In 2020, Appen integrated Figure Eight, a human-in-the-loop AI/ML-powered platform for data transformation, to boost the efficiency of transforming text, images, audio, and video data into customized high-quality training data.


Appen’s mission is to assist organizations in advancing their Artificial Intelligence (AI) and Machine Learning (ML) efforts by providing them with quality training data and automating business processes with easy-to-deploy ML models and integrated human-in-the-loop workflows. To further expand their data transformation services, Appen implemented a cloud-native data platform to generate near real-time reports on data and serve as a foundational service for AI. This platform was intended to help Appen address any architecture and technology limitations of their legacy monolithic solution, which was developed in the early days of the company, and to enhance the quality of data and scalability of their data collection, processing, and reporting services.

The envisioned data platform had to be built as:

  • A highly scalable solution capable of supporting up to 1K parallel workflows for data collection, processing, and reporting;
  • An efficient analytics solution that enables data annotation teams and business units to generate data reports in real time;
  • A solution with advanced architecture, with the support of microservices and specific ML models as part of its data processing workflows.

Appen joined forces with Provectus to create their solution on AWS. The teams decided to use Provectus’ NextGen Data Platform as a base for the Appen solution, and further refine it through Appen-specific customizations.


Provectus began designing and constructing a data platform for Appen by holding a series of discovery workshops. These workshops were used to evaluate Appen’s current processes, infrastructure, and architecture, as well as to pinpoint business key performance indicators for the platform. During this stage, Provectus identified several architectural and infrastructural bottlenecks in Appen’s existing monolithic legacy system, which hindered the data annotation and analytics teams’ ability to process and analyse data on a larger scale, as well as to generate real-time reports. These limitations increased overhead and inefficiencies and raised the total cost of ownership (including Appen’s clients’ service costs), as well as slowed Appen’s data expansion in the AI market.

In response to this, Provectus recommended building a new data platform on AWS, which would incorporate a data lake and have functionality for near real-time analytics and report generation. The architectural diagram for this platform is provided below.

The engineering team at Provectus designed and built data pipelines on Apache Kafka, Amazon S3, and Amazon ECS in order to migrate Appen’s non-relational data streams and workloads. To create a cloud-native data lake for Appen, Provectus used AWS Glue, Amazon Athena, and Amazon S3, taking advantage of the flexibility, scalability, and efficiency of AWS services.

In order to meet Appen’s scalability requirements, their monolithic data workflows were split into specific microservices, powered by Kafka Streams. Apache Kafka was used as a central hub for communication between the various microservices and databases. The migration strategy also included using PostgreSQL’s Change Data Capture (CDC) mechanism as the initial data feed for consuming microservices.

Testing of the proposed solution showed linear scalability and significant performance improvements.


Appen had the potential to expand its business, but its data services were powered by outdated technologies. As a result, the company was able to provide high-quality training data to its customers, but it did not have the necessary tools to increase productivity and expand its operations.

Provectus intervened to help Appen overcome these technological barriers, allowing them to reduce costs and increase their operational efficiency, which resulted in faster business growth. Drawing on our knowledge and experience in creating cutting-edge streaming data platforms, we created a new cloud-native data platform that included a data lake and advanced real-time analytics capabilities for Appen.

With the implementation of Provectus’ new data platform, Appen has seen significant improvements in their end-to-end data pipelines. Performance has improved five-fold, with the throughput of data annotation jobs and data rows increasing by twenty times. Generating a report on processed data has been reduced from nearly half an hour to less than a minute. This has been made possible due to the platform’s advanced architecture, which provides linear scalability and improved performance.

The platform allows Appen to process any type of data quickly and efficiently. It gives them the foundation they need to grow and continue to expand their AI-related data services.

How much did you like this article?

You might like


Lower Risk of Worker Injuries Through AI-driven Computer Vision and Image Analysis

Lower risk of worker injuries through AI-driven computer vision and image analysis The Provectus Worker…

Learn more arrow


ML-powered Demand Forecasting Solution

Learn more arrow


Data Platform for Real-Time Analytics

Learn more arrow

Back to list

Find your next job at Provectus

If you dream of working with cutting-edge technologies that drive positive change on a global scale and want to contribute to open source projects, we'd love to hear from you!