Job Description
Key Responsibilities :
- Data Pipeline Development: Build and optimize ETL/ELT workflows using Airflow, Prefect, or similar orchestration tools.
- Data Integration & Transformation: Ingest data from multiple sources (transactional, behavioural, streaming) and prepare feature-ready datasets.
- Application Integration: Connect pipelines to Salesforce, Gainsight, and other enterprise applications for real-time data sync.
- BI Semantic Modelling: Design and maintain semantic layers for BI tools (Power BI, Looker, Tableau) to enable self-service analytics.
- Version Control & Experiment Tracking: Implement DVC for dataset versioning and integrate MLflow for experiment reproducibility.
- Distributed Data Processing: Use Spark, Ray, or Databricks for large-scale data transformations and feature engineering.
- Cloud Infrastructure: Deploy and manage pipelines on Azure Data Factory, Azure Synapse, GCP Dataflow, and Big Query; leverage storage (ADLS, GCS) and compute clusters.
- MLOps Enablement: Collaborate on CI/CD workflows for ML models, feature store integration, and monitoring pipelines.
- Data Quality & Governance: Implement validation, lineage tracking, and compliance checks for secure and reliable data.
- Performance Optimization: Profile and tune pipelines for cost efficiency and low latency.
- Collaboration: Partner with Data Scientists to ensure timely delivery of clean, well-structured data for ML/DL models.
Required Skills : Technical:
- Python (pandas, PySpark), SQL; familiarity with Scala is a plus.
- Orchestration: Airflow, Prefect, DBT.
- Distributed frameworks: Spark, Ray, Databricks.
- Cloud platforms: Azure (Data Factory, Synapse, ADLS) and GCP (Dataflow, BigQuery, GCS).
- BI Semantic Modelling: Power BI, Looker, Tableau.
- Application Integration: Salesforce, Gainsight, REST APIs.
- Containerization: Docker; basic Kubernetes.
- Versioning: Git, DVC; experiment tracking: MLflow.
- Streaming: Kafka or Pub/Sub for real-time ingestion.