Machine Learning Engineer – Data Pipelines

Cantina Labs · Singapour

Nouveau

🇬🇧 English

PySpark Ray Airflow Docker Kubernetes AWS GCS Azure VLM-based captioning pipelines CLIP-based semantic filtering

Description du poste

About the role

Cantina Labs is building a social AI platform that brings characters to life through real‑time multimodal models. We are expanding our Singapore team and need a Machine Learning Engineer to design, build, and scale the data pipelines that feed our large‑scale video and image datasets into model training.

Key responsibilities

Design and scale distributed data pipelines for preprocessing, dataset generation, and repeated dataset refreshes.
Own workflow orchestration, job scheduling, monitoring, and failure recovery for large‑scale data processing jobs.
Implement and maintain containerized pipeline infrastructure using Kubernetes or equivalent orchestration systems.
Optimize cloud‑based data storage and movement across AWS, GCS, or Azure for cost, throughput, and operational efficiency.
Define and implement best practices for dataset storage layout, versioning, caching, retention, and access patterns.
Design curation pipelines that select, filter, and retain video and image content for model training, including image‑text pair datasets.
Build and improve VLM‑based captioning and metadata generation workflows at scale.
Develop quality and aesthetic scoring models, CLIP‑based semantic filtering, and other signal‑extraction approaches for data selection.
Build tooling for deduplication workflows over large video corpora.
Analyze dataset composition, identify quality issues, and iterate on curation logic to improve training outcomes.

Required profile

Strong hands‑on experience building or scaling large‑scale data systems and pipelines for machine learning, including dataset curation, filtering, and quality improvement.
Experience with distributed data processing frameworks such as PySpark or Ray, and orchestration tools such as Airflow or equivalent.
Familiarity with containerization and container orchestration, including Docker and Kubernetes.
Experience working with cloud‑based data storage and compute (AWS, GCS, and/or Azure), understanding trade‑offs around cost, throughput, and storage layout.
Experience with VLM‑based captioning pipelines or similar multimodal processing workflows.

Required skills

PySpark
Ray
Airflow
Docker
Kubernetes
AWS
Google Cloud Storage (GCS)
Azure
Vision‑Language Model (VLM) captioning pipelines
CLIP‑based semantic filtering

Questions fréquentes

Le salaire n'est pas communiqué publiquement par le recruteur. Vous pouvez postuler et négocier directement avec Cantina Labs.

Cliquez sur "Postuler maintenant" en haut de la page. Vous pouvez importer votre CV en 1 clic — Jobiglo extrait automatiquement vos informations et postule pour vous.

Pourquoi signalez-vous cette offre ?

Merci pour votre signalement. Nous allons examiner cette offre.

Postulez en 30 secondes

Entrez votre email pour postuler. Un compte sera cree automatiquement.

En continuant, vous acceptez nos conditions d'utilisation.

Deja un compte ? Connexion

Publie il y a 15 heures

Expire dans 1 mois

4 vues · 0 interesses

Partager Connectez-vous pour gagner des credits en partageant

Boostez vos chances

Importez votre CV : nous vous proposons les offres qui matchent votre profil.

Analyse de votre CV en cours...

Cantina Labs

Singapour

Offres similaires

Emplois à Singapour Métier : Informatique / IT