Senior Data Engineer

Building scalable and reliable data pipelines, automating data extraction processes, and enabling machine learning operations (MLOps).

Launched on:

See it live →
Onfido

Overview

At Onfido, I was responsible for building efficient and scalable data pipelines, optimizing data ingestion and transformation processes using AWS services, and enabling MLOps to enhance machine learning model delivery.

Role Highlights

  • Data Pipelines: Built efficient and reliable data pipelines, automating the extraction, transformation, and loading (ETL) processes using AWS services such as S3, Glue, Redshift, and Athena.
  • Real-Time Data Processing: Led the migration of the Extraction Layer from batch processing to near real-time data processing using Terraform, AWS DMS CDC, and AWS Glue, reducing the overall data delivery time from 1 hour to 5 minutes.
  • Machine Learning Operations: Collaborated closely with the Machine Learning team to streamline MLOps processes, improving the deployment of machine learning models.
  • Developer Support: Built data pipeline services for the wider development organization, providing a unified analytics platform, and conducted onboarding sessions for new team members to help them become productive quickly through pair programming and presentations.

Results

  • Reduced data processing times from 1 hour to 5 minutes, enabling real-time data insights and decision-making.
  • Enhanced data reliability and consistency through automation of the ETL processes.
  • Improved machine learning model performance by optimizing the data pipeline to deliver high-quality input data.

Challenges

  • Real-Time Data Processing: Migrating from batch to real-time data processing while maintaining data accuracy and speed.
  • MLOps Integration: Balancing the need for rapid machine learning model deployment with the requirements for scalable and efficient data pipelines.

Conclusion

My work at Onfido enabled the company to transition from batch to real-time data processing, optimizing their data workflows and improving machine learning model delivery.