Increasing automation and scalability of operational data flows for DataCamp
Both phases resulted in more data-intensive analytical and operational data flows, from which business insights can be gained more easily:
Data scientists have easy access to data from various sources and can perform data exploration and preparation following an automated and scalable approach
Data engineers have a solid foundation to build and deploy operational flows at any scale
As an organization DataCamp can move faster and take better informed decisions because its data backbone is adapted in support of its business growth.
Why Data Minded?
Both DataCamp and Data Minded are located in the city of Leuven. Both organizations have a strong startup culture and are active in the data industry. This created a natural foundation for collaboration.
DataCamp - a leader in online learning in the data science space - reached out to Data Minded to guide them in the design and implementation of their analytical and operational data flows.
In the past, DataCamp was driving these flows from their internal databases. DataCamp’s growth and resulting accumulation of data, was putting more and more strain on this solution. An early initiative revealed external expertise was needed to support their steep business growth.
In a first phase, we implemented a single high-value operational data flow that works end to end: ingesting data from various sources into a data lake, combining and transforming these data, and making the results accessible to internal business actors. The initial solution consisted of the following components:
AWS as cloud provider
Terraform for infrastructure as code
Apache Airflow for orchestration
Apache Spark on EMR for processing
AWS Athena as a serving layer
CircleCI for CI/CD.
This focused approach allowed Data Minded to better understand the DataCamp ecosystem, identify future challenges, and put already some core components in place.
In a second iteration, based on the learnings and feedback from DataCamp, Data Minded automated large parts of the ingestion into the data lake. This enabled Data Camp to make more data available for internal use. In addition, our specialists provided a SQL interface for data scientists to write and run their own ETL flows. They also changed the serving layer from Athena to Redshift and Redshift spectrum to flexibly and efficiently respond to various needs within the organization.