Senior ML Engineer — Audit + AWS Production Deploy (XGBoost / FastAPI / Shadow Scoring)
Hourly contract · Estimated 30–50 hours over 4 weeks · Budget guidance: $80–150/hr depending on track record
About us
Small in-house engineering team at a public-utility-adjacent organization. Internal QA platform (10–15 reviewers), volume-heavy (1.5M tickets/year target), operational stakes are real — bad model decisions create downstream cost.
We have engineering capacity. We want senior ML-deployment judgment on top before we go live in production.
What the system is
Internal QA platform that scores incoming work tickets for risk priority using a three-layer pipeline:
Rules engine — deterministic policy enforcement (always runs)
ML model — XGBoost / LightGBM trained on human-reviewed historical tickets (always runs)
LLM — conditional, fires only when rules + ML disagree or a ticket escalates
Production launch on AWS: July 2026.
Architecture
Backend stack:
Python 3.11, FastAPI (async), SQLAlchemy 2.0 (async)
Postgres 15 (partitioned by year on the hot tables) + Redis + Celery
Containerized via Docker; AWS deploy target
ML stack:
scikit-learn / XGBoost / LightGBM, joblib serialization
Production model at risk_model_latest.joblib, candidate at risk_model_candidate.joblib — both on a persistent volume (/app/models)
HMAC-signed model binaries; unsigned files refuse to load
SHAP per-prediction explanations served via API
Shadow scoring (candidate runs in parallel with production, results in ml_shadow_scores)
Human-in-the-loop training: every reviewer correction lands in ml_feedback and feeds retrain
Retrain history tracked in ml_training_runs
Audit trail:
Every prediction snapshots rules_version, ml_version, llm_version, workflow_config
Permanent system_audit table for security/admin events
Versioned config history per knob
What's already built
Full scoring pipeline is shipped and running in dev
Shadow scoring + drift comparison logic is in place
HMAC sign + verify on model load is in place
Admin ops surface for ML promotion, drift dashboards, retraining triggers
Postgres partitioning, archive/purge lifecycle, response caching — all shipped
What we need
Senior ML engineer to audit + deploy, not build:
ML integration audit — review the rules→ML→LLM orchestration, agreement thresholds, shadow-scoring math, drift detection, SHAP wiring. Produce a P0/P1/P2 findings list.
Model artifact lifecycle audit — verify HMAC signing, candidate→production promotion path, rollback procedure, IAM scope for model files on AWS.
AWS production deploy — recommend artifact storage strategy (S3 vs EFS vs persistent volume), validate prod matches dev behavior, pair on the cutover and post-deploy verification.
Deliverables:
Written audit report with P0/P1/P2 findings
Promotion / rollback / emergency-demote runbook
Signed-off production deploy
1 week post-launch on-call availability for ML-related issues
We're open to a better approach
If during the audit you think our current setup is wrong for our scale or use case, tell us. Specifically, we want your honest read on:
Is XGBoost/LightGBM the right model family for this problem, or should we be looking at something else (deep tabular models, calibrated linear stacks, a different boosting library, a managed service like SageMaker)?
Is the on-disk joblib + HMAC artifact pattern the right shape for production, or should the model live somewhere else (SageMaker endpoint, MLflow registry, BentoML / KServe)?
Is our home-grown shadow-scoring + drift-detection layer worth keeping, or are we reinventing something a hosted MLOps tool would handle better at our scale?
Should we even own training infrastructure for 1.5M tickets/year, or is this a "managed retrain pipeline" use case?
We'd rather hear "scrap your candidate-promotion code and use SageMaker model registry" or "your boosting choice is fine, fix these three things" than a polite review that misses the bigger call.
If the recommendation is "scrap and replace," we'll treat that as a separate engagement decision — you're not on the hook to execute a rewrite as part of this contract.
Required experience
5+ years shipping ML systems to production (not just notebooks, not just research)
Strong with: Python, scikit-learn, XGBoost / LightGBM, joblib, FastAPI, Celery, Postgres, Redis
Deep AWS experience: ECS/EKS, S3, IAM, Secrets Manager, CloudWatch — specifically deploying ML model artifacts with proper IAM scoping and audit
Has actually done this before: model versioning, signature validation, candidate/shadow rollout, drift monitoring. We don't want someone learning these patterns on our project.
Bonus: experience with SHAP in production (per-prediction explanations served via API), HMAC-signed model loading patterns, or human-in-the-loop training pipelines
What will disqualify you
Listing "AI/ML" alongside 30 other unrelated skills
Proposals that don't reference our actual stack (XGBoost, joblib, FastAPI, AWS S3)
LLM-only specialists — this is a tree-based ML audit, not a prompt engineering project
To apply, answer these 3 questions
Walk us through how you'd validate that an HMAC-signed joblib model binary hasn't been tampered with at load time. What's the failure mode if the signature check is wrong?
We use shadow scoring — a candidate model runs in parallel with production and scores are written to ml_shadow_scores. What metrics would you track to decide when to promote? What would block a promotion?
Have you deployed an XGBoost or LightGBM model to AWS production in the last 12 months? Briefly describe the artifact storage strategy you chose and why (S3 vs EFS vs container image vs persistent volume).
Generic / template answers will be ignored. We're filtering for people who've actually done this.
Timeline
Start: ASAP — ideally this week
Production launch: July 1, 2026
Engagement window: 4 weeks (audit + deploy + 1 week post-launch on-call)