Divyanshi Kashyap

hey, you found me

divyanshi kashyap

fredericton, nb caffeine level: critical black cat era

warning: contains dal makhani opinions and 3am debugging stories

Open to Fall 2026 ML/AI internships

hey, i'm
divyanshi kashyap

3rd year CS at UNB building AI agents, ML pipelines, and the occasional weird side project. Self-described mass anxiety enjoyer; fueled by caffeine and dal makhani.

find me on
Divyanshi Kashyap portrait
soft girl, hard problems
Divyanshi portrait
Divyanshi casual
4
Production AI systems
Right now

What I'm up to

Summer 2026 is chaos in the best way. Here's the live feed.

Full-time

AI & Tech Team Lead

TravCan Technologies. Building production agents, wiring middleware, breaking things at 3am and fixing them by 4am.

Summer term

UNB Study Term

Yes, voluntarily taking classes in summer. No, I don't know why either. Something about "graduating on time."

Open source

GSSoC 2026

Contributing to repos in Go and Rust like it's a personality trait. Issue triage, PRs, the whole thing.

ML cohort

BuildersLab

Founding cohort member. Kaggle competitions at 2am, pretending I understand gradient boosting before coffee.

sleep schedule: non-existent · caffeine intake: concerning · vibes: immaculate

Selected work

Things I've actually shipped

From production AI agents to civic tech to C++ carbon measurement. Each one taught me something I couldn't learn from a course.

AI & Civic Tech

CalgaryPulse

AI civic intelligence platform tackling Calgary's 30.4% downtown vacancy crisis. MindFuel Tech Futures 2026 finalist (31 projects, 7 provinces); seed funding approved.

ReactThree.jsFastAPIPostGISCrewAI
ML & Data InsightEngine — IGT VLT ML pipeline project

InsightEngine

Production-grade ML pipeline for VLT game performance prediction. End-to-end from Snowflake ingestion to deployed explainable prediction system. CatBoost regressors with R²=0.90 profitability, ~130K processed rows, Dockerized on EC2 with weekly CI/CD.

CatBoostSnowflakeDockerSHAPStreamlit
AI Agents TravCan AI travel platform

TravCan: AI Travel Platform

Production AI travel platform with 3 services (React 18, Rust/Axum, Python CrewAI), 17+ DB tables, 8 API integrations, 11-agent AI orchestration, 10-state flight booking FSM with Stripe payments, Redis caching (6 layers), 334+ automated tests, serving live users at travcan.ca.

ReactRust/AxumCrewAISupabaseRedisStripe
AI Agents Monitor Lizard — OpenClaw autonomous agent

Monitor Lizard

Autonomous co-op job tracking agent on OpenClaw. Nightly Noctis Mode scans 60+ portals with A to F scoring, critic-review pass, and ChromaDB vector memory. 172 tests.

Claude APIChromaDBOpenClawPython
Full Stack

NyxLink

Production-grade URL shortener with AI phishing detection (Google Safe Browsing), real-time bot classification, Redis caching (<2ms p50), and tiered rate limiting.

FastAPIPostgreSQLRedisDocker
AI Agents Kaashvi — ReAct AI desktop agent

Kaashvi

Desktop-native ReAct agent, a personal AI chief of staff. Autonomously plans and executes across Google Calendar and Notion with a multi-step reasoning loop.

ElectronReactGoogle OAuth2Notion API
DevTools NightShade — LLM red-teaming framework

NightShade

Adversarial red-teaming framework targeting OWASP LLM Top 10: LLM01 prompt injection, LLM06 disclosure, LLM07 insecure plugins, LLM08 excessive agency. Applied to TravCan security hardening at production.

PythonOWASP LLMAnthropic SDK
DevTools

CarbonLedger

Cross-platform C++17 library measuring real-time CPU & memory consumption, converting telemetry to CO₂ estimates via the Green Software Foundation SCI formula. 85% coverage CI gate.

C++17CMakeGCC/Clang/MSVCValgrind
Where I've been

A short resume in motion

From teaching CS labs to shipping production agents. The path so far.

May 2026 to Present

Open Source Contributor

GirlScript Summer of Code 2026
  • Contributing to open-source repositories in Go and Rust
  • Full contribution lifecycle: issue triage, feature implementation, code review, and PR collaboration
Open Source
Apr 2026 to Present

ML Engineer, Cohort Member

BuildersLab
  • Founding cohort member of the ML Engineer track, inaugural cohort of the program
  • Applied ML training via Kaggle competitions and end-to-end project-based problem solving
  • Building production-oriented skills across feature engineering, model selection, cross-validation, and evaluation
Apprenticeship
Jan 2026 to Present

AI & Tech Team Lead

TravCan Technologies
  • Architected and shipped a production AI travel platform from zero: 3 services (React 18, Rust/Axum, Python CrewAI), 17+ database tables, 8 third-party API integrations, and a custom 11-agent AI orchestration layer, serving live users at travcan.ca
  • Designed a 13-tool agentic AI system with parallel execution, Redis caching (6 layers), conversation history sanitization, and context budget management, reducing multi-tool response latency from 120s to ~45s across 40-message windows
  • Built a 10-state flight booking state machine with Stripe payment collection, Duffel order creation via balance, optimistic locking, webhook reconciliation, and PDF voucher generation. PCI-compliant zero-scope architecture
  • Engineered a multi-agent CrewAI sidecar (11 sequential agents with shared DNA injection, province-aware knowledge base, and structured itinerary output) delegating tool calls through the Rust backend for unified caching, auth, and rate limiting
  • Deployed production infrastructure on Render with Docker multi-stage builds, Supabase PostgreSQL + Auth (ES256 JWKS), automated calendar event ingestion via GitHub Actions, and bilingual (EN/FR) support across 13 i18n namespaces
  • Authored 334+ automated tests covering booking FSM transitions (62), API gateway serialization (39), DB concurrency (21), webhook validation (12), and conversation history healing (16), with zero critical regressions across 6 database migrations
Part-time
Jan to Apr 2026

AI/MLOps Engineer Intern

IGT × UNB · Co-op
  • Architected full Medallion ML pipeline: Bronze (Snowflake ingestion), Silver (MySQL/EC2, 22 game features), Gold (SageMaker training)
  • Three CatBoost models (game shape classification, profitability scoring, risk assessment) with dual-mode batch/real-time inference
  • Caught critical 4× revenue inflation across ~10.9M records before model training began
  • YAML config registry, GroupKFold CV with 30% holdout masking, MLflow tracking, GitHub Actions CI
Read full report
Co-op Work to be published
2023 to Present

Teaching Assistant, Calculus

University of New Brunswick
  • Support undergraduate students through weekly office hours, assignment grading, and exam preparation
  • Translate abstract mathematical concepts into accessible explanations across multiple cohorts
On Campus
The vibes

A bit about me

Divyanshi

I'm a 3rd-year CS student at UNB who happens to spend an unreasonable amount of time thinking about agent architectures and data pipelines that don't lie to you. Outside of code: black cats, lo-fi anime soundtracks, and a saving plan for a motorbike.

Black cat era

They're literally humans with better boundaries.

17hrs/day in headphones

Naruto + BSD + MHA on rotation.

Caffeine + anxiety

The two pillars of my engineering practice.

Saving for a motorbike

Surviving on dal makhani and ramen until then.

Now playing
Blue Bird
Ikimono-gakari · Naruto Shippuden OP3
From the blog

Things I've written about

Writing about what I build, published on Dev.to.

InsightEngine: Building a Full-Stack ML Pipeline for VLT Game Prediction at IGT

How I designed and built a 41-model prediction system from scratch — ETL pipeline, Medallion architecture, similarity engine backed by 13 academic papers, 3-tier prediction, and production deployment on AWS EC2.

CatBoostSnowflakeMLOpsAWSSHAP
Read full report

Monitor Lizard: Using OpenClaw for the First Time, I Loved It

A Discord-based autonomous agent built on OpenClaw that scans 60+ company career portals daily, scores internship roles A to F across 10 dimensions, and automates the entire job search pipeline.

OpenClawAIProductivityDev Challenge
Read article
Open to Fall 2026 ML/AI internships

Let's build something
weird together.

If you're hiring, collaborating, or just want to swap anime recs and infra horror stories, my inbox is open.

© 2026 Divyanshi Kashyap

welcome to my world

you found the secret realm. survival mode, peaceful difficulty
📜39
🌸
💬
🐈‍⬛
99
🍜12
🎧

★ 101 REASONS TO LIVE ★

39 reasons across 5 chests so far. click one to crack it open

🗺️
Travel
8 reasons
Adventure
10 reasons
📖
Learn
9 reasons
🌸
Experiences
7 reasons
💛
Heart
5 reasons

☆ TRAVEL ☆

Bike road trip: Saint John to Moncton
India road/train trip to meet friends
Visit Japan for cherry blossoms
See the Northern Lights
Disneyland with family
Visit Switzerland
Sunny beaches (Cancun maybe)
Send postcards from Fredericton

☆ ADVENTURE ☆

Learn Martial Arts
Skating
Sunrise from a mountain peak
Hot air balloon ride
Surfing / Jet Ski
Learn horse riding
Skydiving / Bungee jumping
Scuba diving
Zero gravity experience
Dance in rain

☆ LEARN ☆

Join a drama club
Learn a sign language
Take a cooking class and make good food
Learn guitar
Learn makeup
Learn about taxes and finance
Learn Astrology
Ceramic class (make a mug)
Learn swimming properly

☆ EXPERIENCES ☆

Win a hackathon
Music festival / Concert
See fireflies in a forest + camping
Go to karaoke
Dance to item songs at a party
Complete my diary
Publicly slap someone (lovingly)

☆ HEART ☆

Volunteer at an adoption center / rescue
Volunteer at old age home / animal shelter
Sponsor a child's education
Send anonymous letters with well wishes
Visit secondary school to see old teachers

♡ VIBES & PINTEREST BOARD ♡

memes, mood, miscellaneous chaos

"i'm a software engineer not a software miracle worker"
3am thought:
what if the bug
is me
kushina energy.
black cat era.
naruto ✓
bsd ✓
mha ✓
jjk... maybe

📌 my pinterest

hello, traveler. you've made it to the end of my secret realm.
what would you like to do next?

Hilary Tachibana
anonymous writer // 2020 to 2022

HILARY TACHIBANA

writer. centrist. headphone addict.

16, stuck at home, world falling apart outside. I had a laptop, too many opinions, and no one to tell them to. So I made up a name and started writing about all the stuff you're not supposed to talk about. Politics, religion, the things that make people uncomfortable. Found strangers online who turned into family. Rode bikes I wasn't allowed to ride. Let music keep me sane when nothing else could. This is that chapter.

01

POLITICS TO LIFELONG FRIENDSHIP

COVID hit and suddenly I'm stuck at home through 11th and 12th grade with nothing to do except think too much. So I started writing. Political stuff, online, under a fake name. Never told a single person I knew. I just... needed somewhere to put all of it.

Over those two years my whole worldview flipped. I went in leaning one way and came out a centrist. Not because I gave up on having opinions. Because I actually started talking to people who disagreed with me. Right wing kids, left wing kids, atheists, religious kids, people with takes so extreme I'd just sit there blinking. All of us hiding behind usernames, all the same age, all trying to figure out what we believed.

Somehow those strangers became my people. We called each other "Boo." I know how that sounds. But honestly? That group was one of the best things that happened to me. We'd argue for hours and still show up the next day. No one got cancelled, no one left over a bad take. Just real conversations with real disagreements and somehow, real trust.

College pulled everyone apart. Different cities, time zones, lives. We text sometimes but it's that thing where you both know it's different now. I still think about those late-night voice chats though. That group got me through the worst stretch of being a teenager and I won't forget that.

02

RASH DRIVING

I love bikes. Like actually love them. Not in a cute Pinterest mood-board way. In a "I used to sneak out and ride on highways and my parents had zero idea" way. That kind of love.

There's nothing like it. The engine's loud enough to drown out your own brain. Wind hitting your face so hard your eyes water. You're going too fast and you know it and you just... don't care. Everything annoying about life disappears for a bit. It's the most free I've ever felt doing anything.

If my parents ever find this... sorry. Also not sorry. Those rides were some of the most beautiful memories I have.

I still don't have my own bike. The plan hasn't changed though: make money, walk into a dealership, ride out. Someday. Until then those highway nights are proof that I used to be fearless and a little dumb and completely alive. Worth it every single time.

03

MUSIC IS EVERYTHING

Look, most people just aren't worth the energy. That sounds cold but I mean it. People leave. People don't get it. People are too busy dealing with their own mess to hold space for yours. I stopped expecting much from them pretty early.

Music though? Music never left. 3am and you can't sleep and everything feels heavy, music just sits with you. Doesn't ask questions. Doesn't judge. Doesn't need you to be okay first.

I used to wear headphones like 12+ hours a day out of the 18 I was awake. Yeah I know how that sounds. But I genuinely think it's the best thing we've been given. That some arrangement of sounds can make you feel less broken, less alone, less like everything's falling apart. That's kind of a miracle if you think about it.

I owe more to whatever algorithm kept feeding me the right song at the right moment than I owe to most people I've met. And I'm okay with that.

That was then. This is now. But she's still in here somewhere.

A SECRET MESSAGE ← BACK TO DIVYANSHI
✧ ✦ ✧

Yā Devī Sarvabhūteṣu Śakti-Rūpeṇa Saṁsthitā |

Namastasyai Namastasyai Namastasyai Namo Namaḥ ||

|| शक्ति ||

InsightEngine

VLT Performance Analytics & Prediction Platform

IGT × UNB VLT Lab Feb–May 2026 (4 months) Lead Developer
111K+
Lines of code
41
ML models
23+
DB tables
13
CI/CD workflows
9
Jurisdictions
205
Commits

Project Background

IGT operates Video Lottery Terminals (VLTs) across 9 lottery jurisdictions in 4 countries (Canada, USA, Sweden, Italy). With over 600 unique game titles generating varying levels of revenue, IGT needed a data-driven system to answer three core business questions:

  1. How is a game performing? — Normalize and compare game revenue across jurisdictions with different currencies, reporting frequencies, and market sizes.
  2. What trajectory will a new game follow? — Predict whether a game will show Growth, Decline, or Other patterns over time.
  3. How profitable and risky will a game be? — Forecast Net Terminal Income (NTI) and its volatility at 1, 3, 6, 9, and 12-month horizons.

What is InsightEngine?

InsightEngine is a full-stack data analytics platform that combines:

  • A quantitative pipeline (ETL + ML) that ingests raw VLT data, trains predictive models, and generates game-level forecasts.
  • A qualitative pipeline (NLP + embeddings) that processes player focus group feedback into searchable, structured insights.
  • An integration layer that bridges both using SHAP explainability — connecting what the model predicts with how players feel.
  • A Streamlit dashboard deployed on AWS EC2 for interactive exploration.

I was the primary developer for the entire quantitative side of InsightEngine. I designed and built the ETL pipeline, ML training and inference system, similarity engine, 3-tier prediction architecture, testing/validation suite, CI/CD automation, dashboard data layer, and production deployment infrastructure from scratch.

Technical Architecture

                         DATA SOURCES
    +----------------------------------------------+
    |  Snowflake (9 region tables + characteristics)|
    |  SAP HANA (exchange rates)                    |
    |  Focus Group Reports (PDF/PPTX/audio)         |
    +----------------------+-----------------------+
                           |
                    +------v------+
                    |  ETL PIPELINE| (7 stages)
                    |  Bronze-Silver|
                    +------+------+
                           |
              +------------+------------+
              |            |            |
      +-------v---+  +----v----+  +---v------+
      | ML PIPELINE|  |SIMILARITY|  |QUALITATIVE|
      | 41 models  |  | ENGINE   |  | PIPELINE  |
      +------+----+  +----+----+  +----+-----+
             |            |            |
      +------v------------v------------v------+
      |        DASHBOARD DATA LAYER            |
      |  Static JSON/YAML + On-Demand APIs     |
      +------------------+--------------------+
                         |
      +------------------v--------------------+
      |          STREAMLIT DASHBOARD           |
      |  Docker + Nginx - EC2 Production       |
      +-----------------------------------------+

Technology Stack

LayerTechnologies
LanguagesPython 3.11, SQL
Data Processingpandas, NumPy, SciPy, scikit-learn
ML FrameworksCatBoost, XGBoost, LightGBM, PyTorch
ExplainabilitySHAP (TreeSHAP via CatBoost)
DatabasesMySQL, Snowflake, ChromaDB
FrontendStreamlit, Plotly
InfrastructureDocker, Nginx, AWS EC2, S3, Bedrock
CI/CDGitHub Actions (13 workflows), self-hosted EC2 runner
Experiment TrackingMLflow with S3 backend

Data Coverage

MetricValue
Jurisdictions9 (AGLC, ALC, MBLL, SD, OSL, SEJQ, Sweden, WCLC, Italy)
Countries4 (Canada, USA, Sweden, Italy)
Currencies4 (CAD, USD, SEK, EUR) — all normalized to USD
Unique Games~600
Weekly Performance Rows~130,000
Trained ML Models41
Prediction Rows780 (game × region combinations)

ETL Pipeline — Data Ingestion & Transformation

Built a 7-stage pipeline that ingests raw VLT performance data from 9 jurisdictions and transforms it into clean, normalized, ML-ready datasets.

Stage 1: Snowflake Extraction

Pulls data from Snowflake into 9 MySQL bronze tables plus a characteristics table. Built special handling for each jurisdiction's data format:

  • Italy: year+week temporal columns (not date), chunked loading at 200K rows per batch to handle ~10.9M rows.
  • AGLC: mixed granularity — monthly data pre-2024, daily data post-2024 with automatic detection.
  • SEJQ: dual source tables (historical + current) merged into one bronze table.
  • Sweden: year+week columns requiring date derivation.

Stage 2: Exchange Rate Loading

Pulls CAD/USD, SEK/USD, EUR/USD rates from SAP HANA. Rates stored with append-only strategy to preserve historical rates.

Stage 3: Data Quality & Currency Normalization

Implements 5 DQ rules per table: placeholder deletion, NULL field deletion, invalid entry removal, zero NTI deletion, and 99.9th percentile outlier removal. Built in-place currency normalization with double-conversion prevention using unique key constraints. Italy-specific: chunked UPDATEs (500K batches) with extended lock timeout to avoid MySQL lock waits.

Stage 4: Game Key Mapping

The most complex ETL stage. The same game appears under different names across characteristics data and 9 regional performance tables (e.g., "IGT - Buffalo Gold 0.01" vs "BUFFALO GOLD" vs "Buffalo Gold"). Solution: multi-step normalization — strip vendor prefixes, RTP percentages, denomination suffixes, version numbers; then fuzzy match using difflib with manual override CSV for edge cases. Created 5 linking tables and built silver_dim_game — the unified game dimension (~600 games).

Stage 5: Weekly Processing

Builds silver_performance (~130K rows). Computed WPUPD (Win Per Unit Per Day) = NTI / VLT_DAYS for every row. Normalized monthly regions: NTI ÷ (days_in_month / 7.0) for weekly equivalents.

Stage 6: Sequence Building

Time-series segmentation with gap detection: consecutive dates > 91 days (13 weeks) apart start a new segment — handles games removed from a market and later re-introduced. Created 3 tables with rolling-window statistics at 1/3/6/9/12 month horizons per segment.

Stage 7: Verification

6 structural integrity checks (613 lines), 12 deep accuracy checks (920+ lines), and standalone SQL validation queries (676 lines). End-to-end bronze→silver validation including random game trace.

Database Schema

Designed the complete DDL defining 23+ tables across three layers: Bronze (11 tables: raw extracts), Mapping (5 tables: game name linking), Silver (6 tables: normalized analytics-ready data). Full Medallion Architecture with referential integrity.

Machine Learning Pipeline — 41 Trained Models

Preprocessing

Transforms silver tables into fixed-size tensors. Normalization chain: Raw NTI → signed log → winsorize (1st–99th percentile) → rolling mean (3-week window) → z-score → interpolation to T=32 fixed timesteps. Computed 20 multi-horizon targets per segment and built 23 game characteristic features + interaction features.

Clustering

Implemented 5 clustering algorithms: K-Prototypes (default), K-Means, K-Shape, DTW K-Means, DBSCAN. Selected K-Prototypes for handling mixed numerical + categorical features. K=3 clusters, silhouette score 0.331.

Shape Classification

Ensemble-based trajectory labeling: classifies each segment into one of 15 fine-grained shapes reduced to 3 macro classes (Growth, Decline, Other). Uses 3 smoothing methods (Gaussian, Savitzky-Golay, EMA) with 2-of-3 consensus voting.

Model Training

Model CategoryCountFeaturesPerformance
Shape classifier122 char-onlyAccuracy 58%, F1 0.53
Profitability (full features)1034 (char + perf)R² 0.77–0.90
Risk (full features)1034 (char + perf)R² 0.13–0.66
Profitability (char-only)1018 char-onlyR² 0.40–0.65
Risk (char-only)1018 char-onlyIntentionally lower

Cross-validation: 5-fold GroupKFold grouped by game_id (prevents data leakage). Algorithm selection: tested CatBoost, XGBoost, LightGBM, Random Forest — selected CatBoost. SHAP pre-computation: CatBoost TreeSHAP computed during training and saved as parquet files. Runtime SHAP for single-input concept predictions takes < 100ms.

Similarity Engine — Academically-Backed Game Matching

When a game has no performance history, find the most similar existing games and use their data as a proxy for prediction. Designed a weighted game matching system grounded in 13 academic papers spanning cold-start recommendation, case-based reasoning, and collaborative filtering.

Scoring Algorithm — Weighted Gower Coefficient

Based on Gower (1971, Biometrics): handles mixed feature types (categorical, boolean, numeric) without dimensionality inflation from one-hot encoding. Feature weights derived from CatBoost model importance: game_type (10.0), reel_setup (8.0), denom (5.0), jackpot (4.0), etc.

Adaptive Weighted KNN

Based on Sarwar et al. (2001, WWW): up to k_max=3 neighbors above 0.70 similarity threshold. Dilution control: only include neighbors scoring ≥ 85% of the best match score.

Key Design Decisions (Paper-Backed)

DecisionAcademic Source
Cold-start content-based fallbackSchein et al. (2002), ACM SIGIR
Similarity imputation over regressionRazavi-Far et al. (2021), PeerJ CS
Gower coefficient for mixed typesGower (1971), Biometrics
Feature-importance weightingWilson & Martinez (1997), JAIR
Mandatory match constraintsRichter & Weber (2013), Springer
Threshold-based neighbor selectionAnagnostopoulos et al. (2024), IJDSA
Adaptive K with dilution controlDesrosiers & Karypis (2011), Springer

Three-Tier Prediction System

Handles any game — from well-established titles with years of data to brand-new concepts with zero history.

TierWhenData SourceConfidence
Tier 1 (Exact Match)Game exists in DB with perf dataReal NTI from silver tablesHIGH
Tier 2 (Similar Match)Similar game(s) found ≥ 70%Weighted proxy from KNN neighborsMEDIUM
Tier 3 (No Match)No similar game foundCharacteristics onlyLOW

Built a Concept Builder API (751 lines) that accepts game name or raw characteristics, finds similar games, queries NTI aggregates for neighbors, weights by similarity score, runs Model Registry predictions, and derives WPUPD from NTI using neighbor ratios. Also built a Historical Data Export API (388 lines) for on-demand time series for dashboard graphs.

Critical Bug Fix — KENO/BINGO Scoring

Problem: KENO games could never match (max score ~0.67 vs 0.70 threshold) and BINGO queries returned wrong game types.

Root Cause: compute_score() added ALL feature weights to the denominator unconditionally, but KENO games have NULL reel_setup (97%) and NULL bonus_volatility (100%). These weights (8.0 + 2.5 = 10.5) inflated the denominator while contributing 0 to the numerator.

Fix (3 targeted changes): Moved total_weight += weight inside each scoring branch — only features where BOTH sides have data count in the denominator. Added mandatory game_type filtering. Removed score=0 median fallback that returned random REEL games.

Testing & Validation Framework

Unit Tests (60+ tests, 6 files)

  • Shape set disjointness, macro encoding, algorithm validity
  • GroupKFold no-leakage, DataPreparer per-fold isolation, metric ranges
  • Feature toggle parsing, mode-based column selection
  • R² thresholds (profitability > 0.5, risk > 0, shape F1 > 0.20), fold stability (std R² < 0.15)
  • End-to-end WPUPD computation chain verification (427+ lines)

Pipeline Verification (18 checks)

  • 6 structural checks: weekly aggregation ratios, monthly normalization, NTI spot-check, segment integrity, row counts, FK consistency.
  • 12 accuracy checks: bronze overview, DQ impact, currency conversion, game matching funnel, NTI bronze vs silver (all 9 regions), monthly granularity proof, WPUPD math (every row), sequence week counts, segment integrity, NTI aggregate recomputation, FK integrity, random game trace.

CI/CD Automation — 13 GitHub Actions Workflows

#WorkflowPurpose
1extract.ymlSnowflake → MySQL extraction
2dq.ymlData quality + exchange rates + currency normalization
3mapping.ymlGame key mapping + dimension export + CSV reports
4weekly.ymlWeekly processing + sequences + basic verification
5ml.ymlFull ML pipeline with dropdown selection
6predict_game.ymlSingle-game 3-tier prediction
7predict_regional.ymlCross-region comparison with job summary
8full-pipeline.ymlEnd-to-end chain, scheduled every Sunday 2 AM UTC
9verify.yml6 structural integrity checks
10validate.yml12 deep accuracy checks
11dashboard_data.ymlBuild dashboard JSON/parquet artifacts + S3 upload
12deploy.ymlDocker build + deploy to EC2 + health check
13game_bridge.ymlQual↔quant name matching with supervisor approval

Configured self-hosted EC2 runner for all workflows. Auto-commit workflow outputs with [skip ci] tags. Full secrets management for Snowflake, MySQL, and AWS credentials.

Production Deployment on AWS EC2

Docker Containerization

Python 3.11-slim base, multi-container setup (Streamlit + Nginx) on custom bridge network. Nginx reverse proxy with gzip compression, WebSocket upgrade headers, SSE support, 24-hour timeouts.

ChromaDB Persistence Challenge

GitHub Actions checkout@v4 wipes the entire git directory on every deploy, destroying the VectorDB store. Solution: persistent store at /home/ubuntu/chroma_store/ (outside git checkout), Docker volume mount (read-only), post-deploy restore scripts. Took 4 iterations to get right.

S3 Integration

DashboardDataLoader class with 3 modes: local (dev), s3 (production), auto (S3 with local fallback). ModelRegistry auto-downloads .pkl models from S3 when missing locally.

Dashboard Data Layer & Integration

Built a dashboard data builder (327 lines) generating: predictions_enriched.json, game_catalog.json, shap_summary.json, model_quality.json, and 35 SHAP parquet files.

Built a game name bridge with supervisor approval (381 lines) — fuzzy-matches qualitative game names to VLT game names using rapidfuzz with a 3-file supervisor approval system.

Designed the SHAP-to-qualitative category mapping: feature importance aligned with player sentiment categories. Alignment detection: both positive = ALIGNED; mismatch = DIVERGENT (the valuable insight).

Key Technical Challenges

Cross-Jurisdiction Data Normalization

9 regions report data at different frequencies, in different currencies, with different naming conventions. Built multi-stage normalization: currency conversion to USD, monthly-to-weekly normalization, fuzzy game name matching across regions.

Game Name Disambiguation

Same game appears under different names across data sources. Multi-step normalization + difflib fuzzy matching + manual override CSV. Built traceable 5-table linking chain so every join can be audited.

Time Series Segmentation

Games can be removed and re-introduced. Gap detection at 91-day threshold — each continuous deployment period becomes a separate segment for clean ML training.

Cold-Start Prediction

New games have no performance history. 3-tier architecture with graceful degradation. Similarity engine transfers data from structurally similar games with clear confidence disclosure.

NULL Feature Scoring

KENO games have NULL values for key features, inflating similarity denominators. Implemented Gower's δ indicator — only features where both games have data contribute to the denominator.

Persistent VectorDB Across Deployments

GitHub Actions checkout wipes the working directory, destroying ChromaDB. External persistent storage with Docker volume mount and post-deploy restore scripts. Took 4 iterations.

Documentation Authored

DocumentLinesContent
PIPELINE_GUIDE.md1,787Complete system design: architecture, 23-table DB schema, file-by-file ETL reference
VLT_PIPELINE_GUIDE.md586Condensed technical guide: all stages, workflows, Docker, local running
EC2_DEPLOYMENT_GUIDE.md469SSH, Docker operations, dev mode, troubleshooting, AWS credentials
INTEGRATION_PLAN.md7627-phase build plan covering IGT Sections A–I
XAI_INTEGRATION_GUIDE.md770API documentation, data schemas, join keys, error handling
similarity_academic_foundations.md36813 academic papers mapped to design decisions and code locations

Total: 4,742 lines of documentation

Quantitative Summary

205Total Commits
111,260+Lines of Code
1,037Files Touched
33Active Dev Days
7ETL Stages
23+DB Tables
41ML Models
13CI/CD Workflows
60+Unit Tests
18Verification Checks
4,742Lines of Docs
13Papers Referenced

Report by Divyanshi Kashyap · May 2026