ML & Data · Co-op Work Report
InsightEngine
VLT Performance Analytics & Prediction Platform
IGT × UNB VLT Lab
Feb–May 2026 (4 months)
Lead Developer
Project Background
IGT operates Video Lottery Terminals (VLTs) across 9 lottery jurisdictions in 4 countries (Canada, USA, Sweden, Italy). With over 600 unique game titles generating varying levels of revenue, IGT needed a data-driven system to answer three core business questions:
- How is a game performing? — Normalize and compare game revenue across jurisdictions with different currencies, reporting frequencies, and market sizes.
- What trajectory will a new game follow? — Predict whether a game will show Growth, Decline, or Other patterns over time.
- How profitable and risky will a game be? — Forecast Net Terminal Income (NTI) and its volatility at 1, 3, 6, 9, and 12-month horizons.
What is InsightEngine?
InsightEngine is a full-stack data analytics platform that combines:
- A quantitative pipeline (ETL + ML) that ingests raw VLT data, trains predictive models, and generates game-level forecasts.
- A qualitative pipeline (NLP + embeddings) that processes player focus group feedback into searchable, structured insights.
- An integration layer that bridges both using SHAP explainability — connecting what the model predicts with how players feel.
- A Streamlit dashboard deployed on AWS EC2 for interactive exploration.
I was the primary developer for the entire quantitative side of InsightEngine. I designed and built the ETL pipeline, ML training and inference system, similarity engine, 3-tier prediction architecture, testing/validation suite, CI/CD automation, dashboard data layer, and production deployment infrastructure from scratch.
Technical Architecture
DATA SOURCES
+----------------------------------------------+
| Snowflake (9 region tables + characteristics)|
| SAP HANA (exchange rates) |
| Focus Group Reports (PDF/PPTX/audio) |
+----------------------+-----------------------+
|
+------v------+
| ETL PIPELINE| (7 stages)
| Bronze-Silver|
+------+------+
|
+------------+------------+
| | |
+-------v---+ +----v----+ +---v------+
| ML PIPELINE| |SIMILARITY| |QUALITATIVE|
| 41 models | | ENGINE | | PIPELINE |
+------+----+ +----+----+ +----+-----+
| | |
+------v------------v------------v------+
| DASHBOARD DATA LAYER |
| Static JSON/YAML + On-Demand APIs |
+------------------+--------------------+
|
+------------------v--------------------+
| STREAMLIT DASHBOARD |
| Docker + Nginx - EC2 Production |
+-----------------------------------------+
Technology Stack
| Layer | Technologies |
| Languages | Python 3.11, SQL |
| Data Processing | pandas, NumPy, SciPy, scikit-learn |
| ML Frameworks | CatBoost, XGBoost, LightGBM, PyTorch |
| Explainability | SHAP (TreeSHAP via CatBoost) |
| Databases | MySQL, Snowflake, ChromaDB |
| Frontend | Streamlit, Plotly |
| Infrastructure | Docker, Nginx, AWS EC2, S3, Bedrock |
| CI/CD | GitHub Actions (13 workflows), self-hosted EC2 runner |
| Experiment Tracking | MLflow with S3 backend |
Data Coverage
| Metric | Value |
| Jurisdictions | 9 (AGLC, ALC, MBLL, SD, OSL, SEJQ, Sweden, WCLC, Italy) |
| Countries | 4 (Canada, USA, Sweden, Italy) |
| Currencies | 4 (CAD, USD, SEK, EUR) — all normalized to USD |
| Unique Games | ~600 |
| Weekly Performance Rows | ~130,000 |
| Trained ML Models | 41 |
| Prediction Rows | 780 (game × region combinations) |
ETL Pipeline — Data Ingestion & Transformation
Built a 7-stage pipeline that ingests raw VLT performance data from 9 jurisdictions and transforms it into clean, normalized, ML-ready datasets.
Stage 1: Snowflake Extraction
Pulls data from Snowflake into 9 MySQL bronze tables plus a characteristics table. Built special handling for each jurisdiction's data format:
- Italy: year+week temporal columns (not date), chunked loading at 200K rows per batch to handle ~10.9M rows.
- AGLC: mixed granularity — monthly data pre-2024, daily data post-2024 with automatic detection.
- SEJQ: dual source tables (historical + current) merged into one bronze table.
- Sweden: year+week columns requiring date derivation.
Stage 2: Exchange Rate Loading
Pulls CAD/USD, SEK/USD, EUR/USD rates from SAP HANA. Rates stored with append-only strategy to preserve historical rates.
Stage 3: Data Quality & Currency Normalization
Implements 5 DQ rules per table: placeholder deletion, NULL field deletion, invalid entry removal, zero NTI deletion, and 99.9th percentile outlier removal. Built in-place currency normalization with double-conversion prevention using unique key constraints. Italy-specific: chunked UPDATEs (500K batches) with extended lock timeout to avoid MySQL lock waits.
Stage 4: Game Key Mapping
The most complex ETL stage. The same game appears under different names across characteristics data and 9 regional performance tables (e.g., "IGT - Buffalo Gold 0.01" vs "BUFFALO GOLD" vs "Buffalo Gold"). Solution: multi-step normalization — strip vendor prefixes, RTP percentages, denomination suffixes, version numbers; then fuzzy match using difflib with manual override CSV for edge cases. Created 5 linking tables and built silver_dim_game — the unified game dimension (~600 games).
Stage 5: Weekly Processing
Builds silver_performance (~130K rows). Computed WPUPD (Win Per Unit Per Day) = NTI / VLT_DAYS for every row. Normalized monthly regions: NTI ÷ (days_in_month / 7.0) for weekly equivalents.
Stage 6: Sequence Building
Time-series segmentation with gap detection: consecutive dates > 91 days (13 weeks) apart start a new segment — handles games removed from a market and later re-introduced. Created 3 tables with rolling-window statistics at 1/3/6/9/12 month horizons per segment.
Stage 7: Verification
6 structural integrity checks (613 lines), 12 deep accuracy checks (920+ lines), and standalone SQL validation queries (676 lines). End-to-end bronze→silver validation including random game trace.
Database Schema
Designed the complete DDL defining 23+ tables across three layers: Bronze (11 tables: raw extracts), Mapping (5 tables: game name linking), Silver (6 tables: normalized analytics-ready data). Full Medallion Architecture with referential integrity.
Machine Learning Pipeline — 41 Trained Models
Preprocessing
Transforms silver tables into fixed-size tensors. Normalization chain: Raw NTI → signed log → winsorize (1st–99th percentile) → rolling mean (3-week window) → z-score → interpolation to T=32 fixed timesteps. Computed 20 multi-horizon targets per segment and built 23 game characteristic features + interaction features.
Clustering
Implemented 5 clustering algorithms: K-Prototypes (default), K-Means, K-Shape, DTW K-Means, DBSCAN. Selected K-Prototypes for handling mixed numerical + categorical features. K=3 clusters, silhouette score 0.331.
Shape Classification
Ensemble-based trajectory labeling: classifies each segment into one of 15 fine-grained shapes reduced to 3 macro classes (Growth, Decline, Other). Uses 3 smoothing methods (Gaussian, Savitzky-Golay, EMA) with 2-of-3 consensus voting.
Model Training
| Model Category | Count | Features | Performance |
| Shape classifier | 1 | 22 char-only | Accuracy 58%, F1 0.53 |
| Profitability (full features) | 10 | 34 (char + perf) | R² 0.77–0.90 |
| Risk (full features) | 10 | 34 (char + perf) | R² 0.13–0.66 |
| Profitability (char-only) | 10 | 18 char-only | R² 0.40–0.65 |
| Risk (char-only) | 10 | 18 char-only | Intentionally lower |
Cross-validation: 5-fold GroupKFold grouped by game_id (prevents data leakage). Algorithm selection: tested CatBoost, XGBoost, LightGBM, Random Forest — selected CatBoost. SHAP pre-computation: CatBoost TreeSHAP computed during training and saved as parquet files. Runtime SHAP for single-input concept predictions takes < 100ms.
Similarity Engine — Academically-Backed Game Matching
When a game has no performance history, find the most similar existing games and use their data as a proxy for prediction. Designed a weighted game matching system grounded in 13 academic papers spanning cold-start recommendation, case-based reasoning, and collaborative filtering.
Scoring Algorithm — Weighted Gower Coefficient
Based on Gower (1971, Biometrics): handles mixed feature types (categorical, boolean, numeric) without dimensionality inflation from one-hot encoding. Feature weights derived from CatBoost model importance: game_type (10.0), reel_setup (8.0), denom (5.0), jackpot (4.0), etc.
Adaptive Weighted KNN
Based on Sarwar et al. (2001, WWW): up to k_max=3 neighbors above 0.70 similarity threshold. Dilution control: only include neighbors scoring ≥ 85% of the best match score.
Key Design Decisions (Paper-Backed)
| Decision | Academic Source |
| Cold-start content-based fallback | Schein et al. (2002), ACM SIGIR |
| Similarity imputation over regression | Razavi-Far et al. (2021), PeerJ CS |
| Gower coefficient for mixed types | Gower (1971), Biometrics |
| Feature-importance weighting | Wilson & Martinez (1997), JAIR |
| Mandatory match constraints | Richter & Weber (2013), Springer |
| Threshold-based neighbor selection | Anagnostopoulos et al. (2024), IJDSA |
| Adaptive K with dilution control | Desrosiers & Karypis (2011), Springer |
Three-Tier Prediction System
Handles any game — from well-established titles with years of data to brand-new concepts with zero history.
| Tier | When | Data Source | Confidence |
| Tier 1 (Exact Match) | Game exists in DB with perf data | Real NTI from silver tables | HIGH |
| Tier 2 (Similar Match) | Similar game(s) found ≥ 70% | Weighted proxy from KNN neighbors | MEDIUM |
| Tier 3 (No Match) | No similar game found | Characteristics only | LOW |
Built a Concept Builder API (751 lines) that accepts game name or raw characteristics, finds similar games, queries NTI aggregates for neighbors, weights by similarity score, runs Model Registry predictions, and derives WPUPD from NTI using neighbor ratios. Also built a Historical Data Export API (388 lines) for on-demand time series for dashboard graphs.
Critical Bug Fix — KENO/BINGO Scoring
Problem: KENO games could never match (max score ~0.67 vs 0.70 threshold) and BINGO queries returned wrong game types.
Root Cause: compute_score() added ALL feature weights to the denominator unconditionally, but KENO games have NULL reel_setup (97%) and NULL bonus_volatility (100%). These weights (8.0 + 2.5 = 10.5) inflated the denominator while contributing 0 to the numerator.
Fix (3 targeted changes): Moved total_weight += weight inside each scoring branch — only features where BOTH sides have data count in the denominator. Added mandatory game_type filtering. Removed score=0 median fallback that returned random REEL games.
Testing & Validation Framework
Unit Tests (60+ tests, 6 files)
- Shape set disjointness, macro encoding, algorithm validity
- GroupKFold no-leakage, DataPreparer per-fold isolation, metric ranges
- Feature toggle parsing, mode-based column selection
- R² thresholds (profitability > 0.5, risk > 0, shape F1 > 0.20), fold stability (std R² < 0.15)
- End-to-end WPUPD computation chain verification (427+ lines)
Pipeline Verification (18 checks)
- 6 structural checks: weekly aggregation ratios, monthly normalization, NTI spot-check, segment integrity, row counts, FK consistency.
- 12 accuracy checks: bronze overview, DQ impact, currency conversion, game matching funnel, NTI bronze vs silver (all 9 regions), monthly granularity proof, WPUPD math (every row), sequence week counts, segment integrity, NTI aggregate recomputation, FK integrity, random game trace.
CI/CD Automation — 13 GitHub Actions Workflows
| # | Workflow | Purpose |
| 1 | extract.yml | Snowflake → MySQL extraction |
| 2 | dq.yml | Data quality + exchange rates + currency normalization |
| 3 | mapping.yml | Game key mapping + dimension export + CSV reports |
| 4 | weekly.yml | Weekly processing + sequences + basic verification |
| 5 | ml.yml | Full ML pipeline with dropdown selection |
| 6 | predict_game.yml | Single-game 3-tier prediction |
| 7 | predict_regional.yml | Cross-region comparison with job summary |
| 8 | full-pipeline.yml | End-to-end chain, scheduled every Sunday 2 AM UTC |
| 9 | verify.yml | 6 structural integrity checks |
| 10 | validate.yml | 12 deep accuracy checks |
| 11 | dashboard_data.yml | Build dashboard JSON/parquet artifacts + S3 upload |
| 12 | deploy.yml | Docker build + deploy to EC2 + health check |
| 13 | game_bridge.yml | Qual↔quant name matching with supervisor approval |
Configured self-hosted EC2 runner for all workflows. Auto-commit workflow outputs with [skip ci] tags. Full secrets management for Snowflake, MySQL, and AWS credentials.
Production Deployment on AWS EC2
Docker Containerization
Python 3.11-slim base, multi-container setup (Streamlit + Nginx) on custom bridge network. Nginx reverse proxy with gzip compression, WebSocket upgrade headers, SSE support, 24-hour timeouts.
ChromaDB Persistence Challenge
GitHub Actions checkout@v4 wipes the entire git directory on every deploy, destroying the VectorDB store. Solution: persistent store at /home/ubuntu/chroma_store/ (outside git checkout), Docker volume mount (read-only), post-deploy restore scripts. Took 4 iterations to get right.
S3 Integration
DashboardDataLoader class with 3 modes: local (dev), s3 (production), auto (S3 with local fallback). ModelRegistry auto-downloads .pkl models from S3 when missing locally.
Dashboard Data Layer & Integration
Built a dashboard data builder (327 lines) generating: predictions_enriched.json, game_catalog.json, shap_summary.json, model_quality.json, and 35 SHAP parquet files.
Built a game name bridge with supervisor approval (381 lines) — fuzzy-matches qualitative game names to VLT game names using rapidfuzz with a 3-file supervisor approval system.
Designed the SHAP-to-qualitative category mapping: feature importance aligned with player sentiment categories. Alignment detection: both positive = ALIGNED; mismatch = DIVERGENT (the valuable insight).
Key Technical Challenges
Cross-Jurisdiction Data Normalization
9 regions report data at different frequencies, in different currencies, with different naming conventions. Built multi-stage normalization: currency conversion to USD, monthly-to-weekly normalization, fuzzy game name matching across regions.
Game Name Disambiguation
Same game appears under different names across data sources. Multi-step normalization + difflib fuzzy matching + manual override CSV. Built traceable 5-table linking chain so every join can be audited.
Time Series Segmentation
Games can be removed and re-introduced. Gap detection at 91-day threshold — each continuous deployment period becomes a separate segment for clean ML training.
Cold-Start Prediction
New games have no performance history. 3-tier architecture with graceful degradation. Similarity engine transfers data from structurally similar games with clear confidence disclosure.
NULL Feature Scoring
KENO games have NULL values for key features, inflating similarity denominators. Implemented Gower's δ indicator — only features where both games have data contribute to the denominator.
Persistent VectorDB Across Deployments
GitHub Actions checkout wipes the working directory, destroying ChromaDB. External persistent storage with Docker volume mount and post-deploy restore scripts. Took 4 iterations.
Documentation Authored
| Document | Lines | Content |
| PIPELINE_GUIDE.md | 1,787 | Complete system design: architecture, 23-table DB schema, file-by-file ETL reference |
| VLT_PIPELINE_GUIDE.md | 586 | Condensed technical guide: all stages, workflows, Docker, local running |
| EC2_DEPLOYMENT_GUIDE.md | 469 | SSH, Docker operations, dev mode, troubleshooting, AWS credentials |
| INTEGRATION_PLAN.md | 762 | 7-phase build plan covering IGT Sections A–I |
| XAI_INTEGRATION_GUIDE.md | 770 | API documentation, data schemas, join keys, error handling |
| similarity_academic_foundations.md | 368 | 13 academic papers mapped to design decisions and code locations |
Total: 4,742 lines of documentation
Quantitative Summary
205Total Commits
111,260+Lines of Code
1,037Files Touched
33Active Dev Days
7ETL Stages
23+DB Tables
41ML Models
13CI/CD Workflows
60+Unit Tests
18Verification Checks
4,742Lines of Docs
13Papers Referenced
Report by Divyanshi Kashyap · May 2026