Machine Learning Engineer – Data Pipelines
Cantina Labs · Singapour
Description du poste
About the role
Cantina Labs is building a social AI platform that brings characters to life through real‑time multimodal models. We are expanding our Singapore team and need a Machine Learning Engineer to design, build, and scale the data pipelines that feed our large‑scale video and image datasets into model training.
Key responsibilities
- Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes.
- Own workflow orchestration, job scheduling, monitoring, and failure recovery for large‑scale data processing jobs.
- Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems.
- Optimize cloud‑based data storage and movement across AWS, GCS, or Azure for cost, throughput, and operational efficiency.
- Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns.
- Design curation pipelines that select, filter, and retain video and image content for model training, including image‑text pair datasets.
- Build and improve VLM‑based captioning and metadata generation workflows at scale.
- Develop quality and aesthetic scoring models, CLIP‑based semantic filtering, and other signal‑extraction approaches for data selection.
- Build tooling for deduplication workflows over large video corpora.
- Analyze dataset composition, identify quality issues, and iterate on curation logic to improve training outcomes.
Required profile
- Strong hands‑on experience building or scaling large‑scale data systems and pipelines for machine learning, including dataset curation, filtering, and quality improvement.
- Experience with distributed data processing frameworks such as PySpark or Ray, and orchestration tools such as Airflow or equivalent.
- Familiarity with containerization and container orchestration, including Docker and Kubernetes.
- Experience working with cloud‑based data storage and compute (AWS, GCS, and/or Azure), understanding trade‑offs around cost, throughput, and storage layout.
- Experience with VLM‑based captioning pipelines or similar multimodal processing workflows.
Required skills
- PySpark
- Ray
- Airflow
- Docker
- Kubernetes
- AWS
- Google Cloud Storage (GCS)
- Azure
- Vision‑Language Model (VLM) captioning pipelines
- CLIP‑based semantic filtering
Questions fréquentes
Pourquoi signalez-vous cette offre ?
Postulez en 30 secondes
Entrez votre email pour postuler. Un compte sera cree automatiquement.
En continuant, vous acceptez nos conditions d'utilisation.
Deja un compte ? Connexion
Publie il y a 15 heures
Expire dans 1 mois
4 vues · 0 interesses
Boostez vos chances
Importez votre CV : nous vous proposons les offres qui matchent votre profil.
Analyse de votre CV en cours...
Cantina Labs
Singapour