Professional Experience
Following a cronological overview of my work experience.
Data Scientist - H1 Insights - Oct 2023 to March 2024
Strategic Migration: Transitioning Reporting Pipelines from Databricks to AWS
Project Duration: Oct 2023 to March 2024
Learn More
As the technical lead, orchestrated the strategic migration and optimization of QA reporting pipelines from Databricks to AWS services. This initiative involved a transition from Pandas to the PySpark framework, aiming to enhance efficiency, scalability, and achieve significant cost savings.
- Directed the project’s comprehensive planning and execution, overseeing the migration of QA report pipelines and the shift in technology. Coordinated with stakeholders to align the migration with organizational objectives. Utilized AWS EMR Studio, PySpark, and developed centralized Python and PySpark libraries to streamline processes.
- Successfully increased data throughput by 50% and improved computational efficiency by 30%. Implemented advanced data validation protocols, reducing data anomalies and inconsistencies by 20%. Played a crucial role in fostering collaboration between data engineering and analytics teams, contributing to a 30% reduction in pipeline running expenses and establishing new benchmarks in data management efficiency.
Advanced AI and LLM Model Development for QA Reporting with ChromaDB Integration
Project Duration: Oct 2023 to March 2024
Learn More
Led the innovative integration of AI and LLM technologies, incorporating ChromaDB, to revolutionize QA reporting for medical publications. This project significantly enhanced report accuracy and processing efficiency.
- Directed the development of advanced document classification models, utilizing Python, the OpenAI API, and ChromaDB. Managed the transition to a scalable system for efficient data handling and querying, applying LLM techniques to streamline reporting processes.
- Achieved a 45% increase in report processing speed and a 30% improvement in keyword extraction accuracy, facilitated by the efficient use of ChromaDB. Reduced manual review time by 25%, setting new standards for internal reporting mechanisms and demonstrating strong proficiency in AI and predictive analytics.
Probabilistic Entity Deduplication in Large-Scale Organizational Database Using PySpark
Project Duration: Oct 2023 to March 2024
Learn More
Directed a significant organizational database deduplication project, applying advanced probabilistic models and entity matching techniques. Managed over 1 million records, substantially enhancing data integrity and operational efficiency.
- Led the development of a probabilistic deduplication system using PySpark, incorporating the Fellegi-Sunter model and Expectation Maximization algorithm. Optimized PySpark scripts for efficient, large-scale data processing.
- Achieved a 40% reduction in duplication errors, a 30% increase in operational efficiency, and a 50% reduction in manual data review efforts. Established a new benchmark in large-scale data deduplication, enhancing the organization’s data-driven decision-making capabilities.