Senior ML Engineer — Audit + AWS Production Deploy (XGBoost / FastAPI / Shadow Scoring)

Upwork·

viaUpwork

RemoteContractorPublic

Anywhere$65–$128/hr1d ago

Job Description

Senior ML Engineer — Audit + AWS Production Deploy (XGBoost / FastAPI / Shadow Scoring) Hourly contract · Estimated 30–50 hours over 4 weeks · Budget guidance: $80–150/hr depending on track record About us Small in-house engineering team at a public-utility-adjacent organization. Internal QA platform (10–15 reviewers), volume-heavy (1.5M tickets/year target), operational stakes are real — bad model decisions create downstream cost. We have engineering capacity. We want senior ML-deployment judgment on top before we go live in production. What the system is Internal QA platform that scores incoming work tickets for risk priority using a three-layer pipeline: Rules engine — deterministic policy enforcement (always runs) ML model — XGBoost / LightGBM trained on human-reviewed historical tickets (always runs) LLM — conditional, fires only when rules + ML disagree or a ticket escalates Production launch on AWS: July 2026. Architecture Backend stack: Python 3.11, FastAPI (async), SQLAlchemy 2.0 (async) Postgres 15 (partitioned by year on the hot tables) + Redis + Celery Containerized via Docker; AWS deploy target ML stack: scikit-learn / XGBoost / LightGBM, joblib serialization Production model at risk_model_latest.joblib, candidate at risk_model_candidate.joblib — both on a persistent volume (/app/models) HMAC-signed model binaries; unsigned files refuse to load SHAP per-prediction explanations served via API Shadow scoring (candidate runs in parallel with production, results in ml_shadow_scores) Human-in-the-loop training: every reviewer correction lands in ml_feedback and feeds retrain Retrain history tracked in ml_training_runs Audit trail: Every prediction snapshots rules_version, ml_version, llm_version, workflow_config Permanent system_audit table for security/admin events Versioned config history per knob What's already built Full scoring pipeline is shipped and running in dev Shadow scoring + drift comparison logic is in place HMAC sign + verify on model load is in place Admin ops surface for ML promotion, drift dashboards, retraining triggers Postgres partitioning, archive/purge lifecycle, response caching — all shipped What we need Senior ML engineer to audit + deploy, not build: ML integration audit — review the rules→ML→LLM orchestration, agreement thresholds, shadow-scoring math, drift detection, SHAP wiring. Produce a P0/P1/P2 findings list. Model artifact lifecycle audit — verify HMAC signing, candidate→production promotion path, rollback procedure, IAM scope for model files on AWS. AWS production deploy — recommend artifact storage strategy (S3 vs EFS vs persistent volume), validate prod matches dev behavior, pair on the cutover and post-deploy verification. Deliverables: Written audit report with P0/P1/P2 findings Promotion / rollback / emergency-demote runbook Signed-off production deploy 1 week post-launch on-call availability for ML-related issues We're open to a better approach If during the audit you think our current setup is wrong for our scale or use case, tell us. Specifically, we want your honest read on: Is XGBoost/LightGBM the right model family for this problem, or should we be looking at something else (deep tabular models, calibrated linear stacks, a different boosting library, a managed service like SageMaker)? Is the on-disk joblib + HMAC artifact pattern the right shape for production, or should the model live somewhere else (SageMaker endpoint, MLflow registry, BentoML / KServe)? Is our home-grown shadow-scoring + drift-detection layer worth keeping, or are we reinventing something a hosted MLOps tool would handle better at our scale? Should we even own training infrastructure for 1.5M tickets/year, or is this a "managed retrain pipeline" use case? We'd rather hear "scrap your candidate-promotion code and use SageMaker model registry" or "your boosting choice is fine, fix these three things" than a polite review that misses the bigger call. If the recommendation is "scrap and replace," we'll treat that as a separate engagement decision — you're not on the hook to execute a rewrite as part of this contract. Required experience 5+ years shipping ML systems to production (not just notebooks, not just research) Strong with: Python, scikit-learn, XGBoost / LightGBM, joblib, FastAPI, Celery, Postgres, Redis Deep AWS experience: ECS/EKS, S3, IAM, Secrets Manager, CloudWatch — specifically deploying ML model artifacts with proper IAM scoping and audit Has actually done this before: model versioning, signature validation, candidate/shadow rollout, drift monitoring. We don't want someone learning these patterns on our project. Bonus: experience with SHAP in production (per-prediction explanations served via API), HMAC-signed model loading patterns, or human-in-the-loop training pipelines What will disqualify you Listing "AI/ML" alongside 30 other unrelated skills Proposals that don't reference our actual stack (XGBoost, joblib, FastAPI, AWS S3) LLM-only specialists — this is a tree-based ML audit, not a prompt engineering project To apply, answer these 3 questions Walk us through how you'd validate that an HMAC-signed joblib model binary hasn't been tampered with at load time. What's the failure mode if the signature check is wrong? We use shadow scoring — a candidate model runs in parallel with production and scores are written to ml_shadow_scores. What metrics would you track to decide when to promote? What would block a promotion? Have you deployed an XGBoost or LightGBM model to AWS production in the last 12 months? Briefly describe the artifact storage strategy you chose and why (S3 vs EFS vs container image vs persistent volume). Generic / template answers will be ignored. We're filtering for people who've actually done this. Timeline Start: ASAP — ideally this week Production launch: July 1, 2026 Engagement window: 4 weeks (audit + deploy + 1 week post-launch on-call)

Senior ML Engineer — Audit + AWS Production Deploy (XGBoost / FastAPI / Shadow Scoring)

Upwork · Anywhere

New