Professional Me

Aaryan Shah Data Science • Data Engineer • Data Analysis • Machine Learning • Software Developer

About Me

I'm Aaryan Shah, a Data Scientist and Data Engineer based in Los Angeles, currently pursuing a Master's in Applied Data Science at USC (graduating May '25). I specialize in Data Mining, Analytics, Machine Learning, and NLP. My blend of academic knowledge and hands-on experience equips me to deliver data-driven insights and solutions for strategic decision-making.

In my work, I specialize in:

🌟 I believe in having fun, taking ownership of my work, and always trusting in my abilities. Known for my adaptability, precision, and strong work ethic, I bring a unique blend of creativity and analytical thinking to every project. Outside of work, I enjoy capturing the world through photography and travel, always seeking new perspectives.

Languages

My favorite languages for systems programming and software engineering.

Databases

My preferred databases for building scalable applications.

Libraries

My go-to libraries for machine learning and data analysis.

Dashboards

My preferred tools for creating interactive dashboards and data visualizations.

MLOps

My tools for deployment and infrastructure management.

Tools

My essential tools for development and collaboration.

Cloud

My cloud platforms for hosting and deployment.

Big Data

My preferred technologies for large-scale data processing and analytics.

Education

Aug 2023 - May 2025

Master of Science in Applied Data Science

University of Southern California

Los Angeles, CA

Relevant Courses:

Data Science at Scale Machine Learning Statistical Methods Data Management Natural Language Processing Deep Learning Data Visualization Big Data Analytics
Aug 2019 - June 2023

Bachelor of Technology in Electronics and Telecommunication Engineering

University of Mumbai

Mumbai, India

Relevant Courses:

Data Structures & Algorithms Database Management Object Oriented Programming Digital Signal Processing Computer Networks Software Engineering Linear Algebra Statistics

Work Experience

Dec 2023 - Present

NLP Researcher

USC Marshall School of Business

Currently leading a large-scale NLP project, processing over 250,000 data points to categorize strategic initiatives for major oil and gas companies, aimed at enhancing strategic decision-making. Working on boosting a BERT-based model’s accuracy by 20% through a custom neural network architecture that integrates domain-specific clustering with BERT embeddings.

ML MLOps BERT NLP LLM AWS LangChain

Data Engineer

USC Marshall School of Business

Built a scalable AWS and SQL data pipeline integrating 300+ sources, boosting analytics by 25% with real-time insights. Streamlined data processing for 200+ sources using Apache Spark, improving accuracy and efficiency by 25%. Leveraged serverless architecture for batch and event-driven workflows, handling a 50% data volume increase seamlessly. Optimized SQL queries, improving data retrieval speed by 35% for faster decision-making.

Data Pipelining AWS ETL Analysis Spark SQL Tableau
Feb 2022 - May 2022

Machine Learning Intern

Gustovalley Technovations

Led a team to build an Air Quality Prediction System using real-time API data, reducing insight delivery time by 30% through cloud-deployed ML models. I improved model accuracy by 25% with hyperparameter tuning and created 25+ PowerBI dashboards for actionable insights. Enhanced air quality forecasts by 20% using advanced Markov modeling techniques.

Data Analysis Tuning ML Regression Real-time data
Jun 2021 - Aug 2021

Software Engineer Intern

Technocolabs Softwares

Reduced project costs by 15% through a PostgreSQL-based cost tracking system, enhancing financial analysis and uncovering key savings areas. Improved application performance by boosting page load speed by 40% through optimized MongoDB queries.

PostgreSQL MongoDB Financial Analysis Python Optimization

Featured Projects

Personal Website

Marketmind: Serverless Data Ingestion Pipeline for Real-Time Analytics

AWS Python ETL Serverless Analytics

Developed a serverless data ingestion pipeline to capture and process real-time retail transaction data using AWS (Lambda, Glue, S3, Athena, Kinesis), Python, and SQL. Enabled efficient data flow and transformed raw data into insights, visualized in Grafana for data-driven marketing and business strategies.

Check it out!
Perpetual Crusades

YelpRec: Scalable Hybrid Recommendation System

Python PySpark XGBoost Machine Learning Recommendation Systems

Built a hybrid recommendation system using item-based collaborative filtering and machine learning models (XGBoost, CatBoost) on the Yelp dataset. Built a scalable data pipeline with PySpark and Spark RDDs to process over 1 million records, enabling rapid experimentation and optimized model performance with an RMSE of 0.9798.

Check it out!
COVID-19 Tracker App

DocuBot: Intelligent PDF Query System

LangChain OpenAI GPT NLP Vector Database FAISS

Developed a conversational PDF query system using LangChain, OpenAI GPT models, and FAISS for efficient document retrieval. Optimized document parsing and embedding with PyPDF2 and a chunk overlap strategy for better context retention. Integrated FAISS vector storage for accurate retrieval, enabling precise, context-aware responses to user queries.

Check it out!
Valuto: Account Management System

Advanced Martian Frost Detection Using Deep Learning

Python PyTorch CNN Deep Learning Image Classification

Developed a custom 3-layer CNN model to detect Martian frost using HiRISE data, achieving 82% validation accuracy. Enhanced model performance with data augmentation, regularization, and early stopping techniques. Experimented with VGG16, ResNet50, and EfficientNetB0 architectures, showcasing their effectiveness in Martian image classification tasks.

Check it out!
COVID-19 Tracker App

Dynamic Sales Performance and Forecasting Tool

Power BI DAX Excel Data Visualization Forecasting

Created an interactive PowerBI dashboard to track $1.6M in sales across regions and customer segments in United States, offering real-time insights for leadership. Developed a 15-day sales forecast model using historical data and DAX, accurately predicting demand to support inventory planning and promotional strategies.

Check it out!
Valuto: Account Management System

EmojiQL: Streamlined Data Querying with Parallel Processing

Python SQL Parallel Processing Django Docker

Engineered a custom SQL language in Python with complete SQL functionality, delivered through an emoji-based interface to simplify complex queries for non-experts. Enhanced processing for large datasets up to 1TB with data chunking and parallel processing via joblib, significantly reducing query times and increasing data throughput.

Check it out!
COVID-19 Tracker App

Learnera: A Course Recommendation System

Python NLP TF-IDF A/B Testing Recommendation Systems

Built an NLP-driven course recommendation engine with TF-IDF and content-based filtering, achieving high accuracy in matching courses to user preferences. Enhanced recommendations for over 10,000 profiles using collaborative filtering and improved course completion rates through A/B testing on recommendation models, delivering highly tailored content for diverse learning needs.

Check it out!

Publications

Prediction System Design for Monitoring the Health of Developing Infants from Cardiotocography Using Statistical Machine Learning

Design Engineering, Scopus International Journal, Volume 2021, Issue 07

This research introduces a machine learning-based prediction system designed to assist healthcare providers in monitoring the health of developing infants. By analyzing cardiotocographic data, the system can automatically classify fetal health as normal, suspicious, or pathological, achieving a high accuracy rate of 94% using the Random Forest algorithm. This approach supports physicians by reducing diagnostic errors and improving early detection of potential complications in late-stage pregnancies. The study demonstrates the system’s effectiveness in handling real-time variability in medical data, offering a powerful tool for better prenatal care.

Machine Learning Classification Random Forest Imbalanced Data Cardiotocography Medical Random Forest

Learnera: A Course Recommendation System

Techno Journal IETE-SF, DJ Spark 2022-23

This research tackles the overwhelming choice in online education by creating an intelligent recommendation system that connects students with the most relevant courses. Powered by advanced machine learning algorithms, the system personalizes recommendations based on each student’s unique academic background, goals, and learning preferences. By combining collaborative, content-based, and hybrid filtering, this solution empowers students to make informed choices, boosting their learning outcomes and driving academic success.

Python Pytorch NLP Machine Learning TF-IDF A/B Testing Web Development Recommendation

Autonomous UV Sanitization System with Human and Object Detection

Techno Journal IETE-SF, DJ Spark 2021-22

This research introduces an autonomous UV sanitization system designed to protect frontline workers by reducing their exposure to contaminated areas. Using Raspberry Pi, OpenCV, and a human detection system, the robot safely and efficiently disinfects spaces by automatically turning off UV light when humans are detected nearby. Equipped with a versatile rocker-bogie mechanism, it can navigate complex terrains, including stairs, making it ideal for sanitizing hospitals, schools, and offices. This system aims to provide a portable, cost-effective solution that enhances safety in high-risk environments.

Python Machine Learning Deep Learning Computer Vision Yolo-v5 Object Detection Human Detection Autonomous Raspberry Pi

Get In Touch

I'm always interested in hearing about new opportunities, collaborations, or just having a chat about data science and technology. Feel free to reach out!