Data Platform for Real-Time Analytics

Tech news
January 5, 2023

Appen is an AI data collection, aggregation, and cleaning company that provides high-quality training data for organizations that build effective AI systems. In 2020, Appen integrated Figure Eight, a human-in-the-loop AI/ML-powered platform for data transformation, to boost the efficiency of transforming text, images, audio, and video data into customized high-quality training data.

Challenge

Appen’s mission is to assist organizations in advancing their Artificial Intelligence (AI) and Machine Learning (ML) efforts by providing them with quality training data and automating business processes with easy-to-deploy ML models and integrated human-in-the-loop workflows. To further expand their data transformation services, Appen implemented a cloud-native data platform to generate near real-time reports on data and serve as a foundational service for AI. This platform was intended to help Appen address any architecture and technology limitations of their legacy monolithic solution, which was developed in the early days of the company, and to enhance the quality of data and scalability of their data collection, processing, and reporting services.

The envisioned data platform had to be built as:

A highly scalable solution capable of supporting up to 1K parallel workflows for data collection, processing, and reporting;
An efficient analytics solution that enables data annotation teams and business units to generate data reports in real time;
A solution with advanced architecture, with the support of microservices and specific ML models as part of its data processing workflows.

Appen joined forces with Provectus to create their solution on AWS. The teams decided to use Provectus’ NextGen Data Platform as a base for the Appen solution, and further refine it through Appen-specific customizations.

Solution

Provectus began designing and constructing a data platform for Appen by holding a series of discovery workshops. These workshops were used to evaluate Appen’s current processes, infrastructure, and architecture, as well as to pinpoint business key performance indicators for the platform. During this stage, Provectus identified several architectural and infrastructural bottlenecks in Appen’s existing monolithic legacy system, which hindered the data annotation and analytics teams’ ability to process and analyse data on a larger scale, as well as to generate real-time reports. These limitations increased overhead and inefficiencies and raised the total cost of ownership (including Appen’s clients’ service costs), as well as slowed Appen’s data expansion in the AI market.

In response to this, Provectus recommended building a new data platform on AWS, which would incorporate a data lake and have functionality for near real-time analytics and report generation. The architectural diagram for this platform is provided below.

The engineering team at Provectus designed and built data pipelines on Apache Kafka, Amazon S3, and Amazon ECS in order to migrate Appen’s non-relational data streams and workloads. To create a cloud-native data lake for Appen, Provectus used AWS Glue, Amazon Athena, and Amazon S3, taking advantage of the flexibility, scalability, and efficiency of AWS services.

In order to meet Appen’s scalability requirements, their monolithic data workflows were split into specific microservices, powered by Kafka Streams. Apache Kafka was used as a central hub for communication between the various microservices and databases. The migration strategy also included using PostgreSQL’s Change Data Capture (CDC) mechanism as the initial data feed for consuming microservices.

Testing of the proposed solution showed linear scalability and significant performance improvements.

Outcome

Appen had the potential to expand its business, but its data services were powered by outdated technologies. As a result, the company was able to provide high-quality training data to its customers, but it did not have the necessary tools to increase productivity and expand its operations.

Provectus intervened to help Appen overcome these technological barriers, allowing them to reduce costs and increase their operational efficiency, which resulted in faster business growth. Drawing on our knowledge and experience in creating cutting-edge streaming data platforms, we created a new cloud-native data platform that included a data lake and advanced real-time analytics capabilities for Appen.

With the implementation of Provectus’ new data platform, Appen has seen significant improvements in their end-to-end data pipelines. Performance has improved five-fold, with the throughput of data annotation jobs and data rows increasing by twenty times. Generating a report on processed data has been reduced from nearly half an hour to less than a minute. This has been made possible due to the platform’s advanced architecture, which provides linear scalability and improved performance.

The platform allows Appen to process any type of data quickly and efficiently. It gives them the foundation they need to grow and continue to expand their AI-related data services.

#Case Study #Data #ML #Our Solutions

Source: provectus.com

How much did you like this article?

Array

773 Ratings

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Provectus

Provectus

Data Platform for Real-Time Analytics

Challenge

Solution

Outcome

How much did you like this article?

Find your next job at Provectus

Recruitment Team:

PR Team:

Provectus

Provectus

Data Platform for Real-Time Analytics

Challenge

Solution

Outcome

How much did you like this article?

You might like

IDP Platform for BI & Analytics

HIPAA-compliant Cloud Infrastructure

Automated Fraud Detection Using AI

Find your next job at Provectus

Recruitment Team:

PR Team: