Job Description

Key Responsibilities :

  • Data Pipeline Development: Build and optimize ETL/ELT workflows using Airflow, Prefect, or similar orchestration tools.
  • Data Integration & Transformation: Ingest data from multiple sources (transactional, behavioural, streaming) and prepare feature-ready datasets.
  • Application Integration: Connect pipelines to Salesforce, Gainsight, and other enterprise applications for real-time data sync.
  • BI Semantic Modelling: Design and maintain semantic layers for BI tools (Power BI, Looker, Tableau) to enable self-service analytics.
  • Version Control & Experiment Tracking: Implement DVC for dataset versioning and integrate MLflow for experiment reproducibility.
  • Distributed Data Processing: Use Spark, Ray, or Databricks for large-scale data transformations and feature engineering.
  • Cloud Infrastructure: Deploy and manage pipelines on Azure Data Factory, Azure Synapse, GCP Dataflow, and Big Query; leverage storage (ADLS, GCS) and compute clusters.
  • MLOps Enablement: Collaborate on CI/CD workflows for ML models, feature store integration, and monitoring pipelines.
  • Data Quality & Governance: Implement validation, lineage tracking, and compliance checks for secure and reliable data.
  • Performance Optimization: Profile and tune pipelines for cost efficiency and low latency.
  • Collaboration: Partner with Data Scientists to ensure timely delivery of clean, well-structured data for ML/DL models.

Required Skills : Technical:

  • Python (pandas, PySpark), SQL; familiarity with Scala is a plus.
  • Orchestration: Airflow, Prefect, DBT.
  • Distributed frameworks: Spark, Ray, Databricks.
  • Cloud platforms: Azure (Data Factory, Synapse, ADLS) and GCP (Dataflow, BigQuery, GCS).
  • BI Semantic Modelling: Power BI, Looker, Tableau.
  • Application Integration: Salesforce, Gainsight, REST APIs.
  • Containerization: Docker; basic Kubernetes.
  • Versioning: Git, DVC; experiment tracking: MLflow.
  • Streaming: Kafka or Pub/Sub for real-time ingestion.