Onfido
Senior Data Engineer
September 2020 – August 2021 • London, UK
Building scalable and reliable data pipelines, automating data extraction processes, and enabling machine learning operations (MLOps).
Overview
At Onfido, I was responsible for building efficient and scalable data pipelines, optimizing data ingestion and transformation processes using AWS services, and enabling MLOps to enhance machine learning model delivery. Onfido is a leading identity verification platform that helps businesses verify their customers' identities using AI and machine learning.
Key Accomplishments
Role Highlights
- Data Pipelines: Built efficient and reliable data pipelines, automating the extraction, transformation, and loading (ETL) processes using AWS services such as S3, Glue, Redshift, and Athena.
- Real-Time Data Processing: Led the migration of the Extraction Layer from batch processing to near real-time data processing using Terraform, AWS DMS CDC, and AWS Glue, reducing the overall data delivery time from 1 hour to 5 minutes.
- Machine Learning Operations: Collaborated closely with the Machine Learning team to streamline MLOps processes, improving the deployment of machine learning models.
- Developer Support: Built data pipeline services for the wider development organization, providing a unified analytics platform, and conducted onboarding sessions for new team members through pair programming and presentations.
Technical Implementation
Data Pipeline Architecture
- • AWS S3 for data storage
- • AWS Glue for ETL processing
- • Amazon Redshift for data warehousing
- • Amazon Athena for ad-hoc queries
Real-Time Processing
- • AWS DMS for change data capture
- • Terraform for infrastructure
- • Near real-time data flows
- • Automated monitoring systems
MLOps Integration
- • Model deployment pipelines
- • Data quality validation
- • Performance monitoring
- • Automated retraining workflows
Business Impact
Enhanced Data Reliability
Improved data reliability and consistency through automation of ETL processes, reducing manual errors and ensuring data quality.
Real-Time Decision Making
Enabled real-time data insights and decision-making capabilities, allowing the business to respond quickly to changing conditions.
ML Model Performance
Improved machine learning model performance by optimizing data pipelines to deliver high-quality input data consistently.
Challenges & Solutions
Real-Time Data Processing Migration
Challenge: Migrating from batch to real-time data processing while maintaining data accuracy and speed.
Solution: Implemented AWS DMS CDC with careful testing and gradual rollout, ensuring data integrity throughout the migration process.
MLOps Integration
Challenge: Balancing rapid machine learning model deployment with scalable and efficient data pipeline requirements.
Solution: Developed integrated MLOps workflows that automated model deployment while maintaining data pipeline performance and reliability.
Team Development & Knowledge Sharing
I was actively involved in team development and knowledge sharing initiatives, conducting onboarding sessions for new team members through presentations and pair programming. This helped accelerate team productivity and maintain high engineering standards across the data engineering organization.