Moustafa Mahmoud

Moustafa Mahmoud

Field CTO

Onfido

Onfido

Senior Data Engineer

September 2020 – August 2021 • London, UK

Building scalable and reliable data pipelines, automating data extraction processes, and enabling machine learning operations (MLOps).

Overview

At Onfido, I was responsible for building efficient and scalable data pipelines, optimizing data ingestion and transformation processes using AWS services, and enabling MLOps to enhance machine learning model delivery. Onfido is a leading identity verification platform that helps businesses verify their customers' identities using AI and machine learning.

Key Accomplishments

92%
Reduction in data processing time (from 1 hour to 5 minutes)
100%
Migration success from batch to real-time processing

Role Highlights

  • Data Pipelines: Built efficient and reliable data pipelines, automating the extraction, transformation, and loading (ETL) processes using AWS services such as S3, Glue, Redshift, and Athena.
  • Real-Time Data Processing: Led the migration of the Extraction Layer from batch processing to near real-time data processing using Terraform, AWS DMS CDC, and AWS Glue, reducing the overall data delivery time from 1 hour to 5 minutes.
  • Machine Learning Operations: Collaborated closely with the Machine Learning team to streamline MLOps processes, improving the deployment of machine learning models.
  • Developer Support: Built data pipeline services for the wider development organization, providing a unified analytics platform, and conducted onboarding sessions for new team members through pair programming and presentations.

Technical Implementation

Data Pipeline Architecture

  • • AWS S3 for data storage
  • • AWS Glue for ETL processing
  • • Amazon Redshift for data warehousing
  • • Amazon Athena for ad-hoc queries

Real-Time Processing

  • • AWS DMS for change data capture
  • • Terraform for infrastructure
  • • Near real-time data flows
  • • Automated monitoring systems

MLOps Integration

  • • Model deployment pipelines
  • • Data quality validation
  • • Performance monitoring
  • • Automated retraining workflows

Business Impact

Enhanced Data Reliability

Improved data reliability and consistency through automation of ETL processes, reducing manual errors and ensuring data quality.

Real-Time Decision Making

Enabled real-time data insights and decision-making capabilities, allowing the business to respond quickly to changing conditions.

ML Model Performance

Improved machine learning model performance by optimizing data pipelines to deliver high-quality input data consistently.

Challenges & Solutions

Real-Time Data Processing Migration

Challenge: Migrating from batch to real-time data processing while maintaining data accuracy and speed.

Solution: Implemented AWS DMS CDC with careful testing and gradual rollout, ensuring data integrity throughout the migration process.

MLOps Integration

Challenge: Balancing rapid machine learning model deployment with scalable and efficient data pipeline requirements.

Solution: Developed integrated MLOps workflows that automated model deployment while maintaining data pipeline performance and reliability.

Team Development & Knowledge Sharing

I was actively involved in team development and knowledge sharing initiatives, conducting onboarding sessions for new team members through presentations and pair programming. This helped accelerate team productivity and maintain high engineering standards across the data engineering organization.