About Me

I'm Aaryan Shah, a Data Scientist and Data Engineer based in Los Angeles, currently pursuing a Master's in Applied Data Science at USC (graduating May '25). I specialize in Data Mining, Analytics, Machine Learning, and NLP. My blend of academic knowledge and hands-on experience equips me to deliver data-driven insights and solutions for strategic decision-making.

In my work, I specialize in:

🧑🏻‍💻Data Engineering: Developing robust data pipelines and integrating data from multiple sources to support seamless analytics.
🤖Machine Learning & NLP: Building and optimizing machine learning models, including natural language processing (NLP)and large language models (LLMs), to drive insights and enhance data-informed decision-making.
📊 Data Analysis & Visualization: Translating raw data into meaningful patterns and visual insights using advanced dashboarding tools like PowerBI and Tableau, ensuring real-time, actionable insights are accessible across teams.
📈Big Data & Cloud Technologies: Utilizing big data tools like Apache Spark alongside AWS and Azure services to efficiently process and analyze vast datasets, enabling comprehensive, data-driven strategies.

🌟 I believe in having fun, taking ownership of my work, and always trusting in my abilities. Known for my adaptability, precision, and strong work ethic, I bring a unique blend of creativity and analytical thinking to every project. Outside of work, I enjoy capturing the world through photography and travel, always seeking new perspectives.

Languages

My favorite languages for systems programming and software engineering.

Databases

My preferred databases for building scalable applications.

Libraries

My go-to libraries for machine learning and data analysis.

Dashboards

My preferred tools for creating interactive dashboards and data visualizations.

MLOps

My tools for deployment and infrastructure management.

Tools

My essential tools for development and collaboration.

Cloud

My cloud platforms for hosting and deployment.

Big Data

My preferred technologies for large-scale data processing and analytics.

Education

Aug 2023 - May 2025

Master of Science in Applied Data Science

University of Southern California

Los Angeles, CA

Relevant Courses:

Data Science at Scale Machine Learning Statistical Methods Data Management Natural Language Processing Deep Learning Data Visualization Big Data Analytics

Aug 2019 - June 2023

Bachelor of Technology in Electronics and Telecommunication Engineering

University of Mumbai

Mumbai, India

Relevant Courses:

Data Structures & Algorithms Database Management Object Oriented Programming Digital Signal Processing Computer Networks Software Engineering Linear Algebra Statistics

Work Experience

Dec 2023 - Present

NLP Researcher

USC Marshall School of Business

Currently leading a large-scale NLP project, processing over 250,000 data points to categorize strategic initiatives for major oil and gas companies, aimed at enhancing strategic decision-making. Working on boosting a BERT-based model’s accuracy by 20% through a custom neural network architecture that integrates domain-specific clustering with BERT embeddings.

ML MLOps BERT NLP LLM AWS LangChain

Data Engineer

USC Marshall School of Business

Built a scalable AWS and SQL data pipeline integrating 300+ sources, boosting analytics by 25% with real-time insights. Streamlined data processing for 200+ sources using Apache Spark, improving accuracy and efficiency by 25%. Leveraged serverless architecture for batch and event-driven workflows, handling a 50% data volume increase seamlessly. Optimized SQL queries, improving data retrieval speed by 35% for faster decision-making.

Data Pipelining AWS ETL Analysis Spark SQL Tableau

Feb 2022 - May 2022

Machine Learning Intern

Gustovalley Technovations

Led a team to build an Air Quality Prediction System using real-time API data, reducing insight delivery time by 30% through cloud-deployed ML models. I improved model accuracy by 25% with hyperparameter tuning and created 25+ PowerBI dashboards for actionable insights. Enhanced air quality forecasts by 20% using advanced Markov modeling techniques.

Data Analysis Tuning ML Regression Real-time data

Jun 2021 - Aug 2021

Software Engineer Intern

Technocolabs Softwares

Reduced project costs by 15% through a PostgreSQL-based cost tracking system, enhancing financial analysis and uncovering key savings areas. Improved application performance by boosting page load speed by 40% through optimized MongoDB queries.

PostgreSQL MongoDB Financial Analysis Python Optimization

Featured Projects

Marketmind: Serverless Data Ingestion Pipeline for Real-Time Analytics

AWS Python ETL Serverless Analytics

Developed a serverless data ingestion pipeline to capture and process real-time retail transaction data using AWS (Lambda, Glue, S3, Athena, Kinesis), Python, and SQL. Enabled efficient data flow and transformed raw data into insights, visualized in Grafana for data-driven marketing and business strategies.

Check it out!

YelpRec: Scalable Hybrid Recommendation System

Python PySpark XGBoost Machine Learning Recommendation Systems

Built a hybrid recommendation system using item-based collaborative filtering and machine learning models (XGBoost, CatBoost) on the Yelp dataset. Built a scalable data pipeline with PySpark and Spark RDDs to process over 1 million records, enabling rapid experimentation and optimized model performance with an RMSE of 0.9798.

Check it out!

DocuBot: Intelligent PDF Query System

LangChain OpenAI GPT NLP Vector Database FAISS

Developed a conversational PDF query system using LangChain, OpenAI GPT models, and FAISS for efficient document retrieval. Optimized document parsing and embedding with PyPDF2 and a chunk overlap strategy for better context retention. Integrated FAISS vector storage for accurate retrieval, enabling precise, context-aware responses to user queries.

Check it out!

Advanced Martian Frost Detection Using Deep Learning

Python PyTorch CNN Deep Learning Image Classification

Developed a custom 3-layer CNN model to detect Martian frost using HiRISE data, achieving 82% validation accuracy. Enhanced model performance with data augmentation, regularization, and early stopping techniques. Experimented with VGG16, ResNet50, and EfficientNetB0 architectures, showcasing their effectiveness in Martian image classification tasks.

Check it out!

Dynamic Sales Performance and Forecasting Tool

Power BI DAX Excel Data Visualization Forecasting

Created an interactive PowerBI dashboard to track $1.6M in sales across regions and customer segments in United States, offering real-time insights for leadership. Developed a 15-day sales forecast model using historical data and DAX, accurately predicting demand to support inventory planning and promotional strategies.

Check it out!

EmojiQL: Streamlined Data Querying with Parallel Processing

Python SQL Parallel Processing Django Docker

Engineered a custom SQL language in Python with complete SQL functionality, delivered through an emoji-based interface to simplify complex queries for non-experts. Enhanced processing for large datasets up to 1TB with data chunking and parallel processing via joblib, significantly reducing query times and increasing data throughput.

Check it out!

Learnera: A Course Recommendation System

Python NLP TF-IDF A/B Testing Recommendation Systems

Built an NLP-driven course recommendation engine with TF-IDF and content-based filtering, achieving high accuracy in matching courses to user preferences. Enhanced recommendations for over 10,000 profiles using collaborative filtering and improved course completion rates through A/B testing on recommendation models, delivering highly tailored content for diverse learning needs.

Check it out!

Publications

Prediction System Design for Monitoring the Health of Developing Infants from Cardiotocography Using Statistical Machine Learning

Design Engineering, Scopus International Journal, Volume 2021, Issue 07

This research introduces a machine learning-based prediction system designed to assist healthcare providers in monitoring the health of developing infants. By analyzing cardiotocographic data, the system can automatically classify fetal health as normal, suspicious, or pathological, achieving a high accuracy rate of 94% using the Random Forest algorithm. This approach supports physicians by reducing diagnostic errors and improving early detection of potential complications in late-stage pregnancies. The study demonstrates the system’s effectiveness in handling real-time variability in medical data, offering a powerful tool for better prenatal care.

Paper GitHub

Machine Learning Classification Random Forest Imbalanced Data Cardiotocography Medical Random Forest

Learnera: A Course Recommendation System

Techno Journal IETE-SF, DJ Spark 2022-23

This research tackles the overwhelming choice in online education by creating an intelligent recommendation system that connects students with the most relevant courses. Powered by advanced machine learning algorithms, the system personalizes recommendations based on each student’s unique academic background, goals, and learning preferences. By combining collaborative, content-based, and hybrid filtering, this solution empowers students to make informed choices, boosting their learning outcomes and driving academic success.

Paper GitHub

Python Pytorch NLP Machine Learning TF-IDF A/B Testing Web Development Recommendation

Autonomous UV Sanitization System with Human and Object Detection

Techno Journal IETE-SF, DJ Spark 2021-22

This research introduces an autonomous UV sanitization system designed to protect frontline workers by reducing their exposure to contaminated areas. Using Raspberry Pi, OpenCV, and a human detection system, the robot safely and efficiently disinfects spaces by automatically turning off UV light when humans are detected nearby. Equipped with a versatile rocker-bogie mechanism, it can navigate complex terrains, including stairs, making it ideal for sanitizing hospitals, schools, and offices. This system aims to provide a portable, cost-effective solution that enhances safety in high-risk environments.

Paper GitHub

Python Machine Learning Deep Learning Computer Vision Yolo-v5 Object Detection Human Detection Autonomous Raspberry Pi

Aaryan Shah Data Science • Data Engineer • Data Analysis • Machine Learning • Software Developer

About Me

Languages

Databases

Libraries

Dashboards

MLOps

Tools

Cloud

Big Data

Education

Master of Science in Applied Data Science

University of Southern California

Bachelor of Technology in Electronics and Telecommunication Engineering

University of Mumbai

Work Experience

NLP Researcher

USC Marshall School of Business

Data Engineer

USC Marshall School of Business

Machine Learning Intern

Gustovalley Technovations

Software Engineer Intern

Technocolabs Softwares

Featured Projects

Marketmind: Serverless Data Ingestion Pipeline for Real-Time Analytics

YelpRec: Scalable Hybrid Recommendation System

DocuBot: Intelligent PDF Query System

Advanced Martian Frost Detection Using Deep Learning

Dynamic Sales Performance and Forecasting Tool

EmojiQL: Streamlined Data Querying with Parallel Processing

Learnera: A Course Recommendation System

Publications

Prediction System Design for Monitoring the Health of Developing Infants from Cardiotocography Using Statistical Machine Learning

Learnera: A Course Recommendation System

Autonomous UV Sanitization System with Human and Object Detection

Get In Touch