The Role
• We are looking for an Applied AI Engineer/Scientist to build, evaluate, and continuously improve clinical AI agents and supervised ML Models.
• You will work at the intersection of software engineering, LLM systems, evaluation, model improvement, and deep healthcare workflow understanding. Your job is to turn frontier model capability into reliable production behavior: agents that read complex medical records, use the right clinical and coding context, call the right tools, produce auditable outputs, and improve from real-world failures.
• You will be embedded in hard healthcare problems - clinical documentation integrity, medical coding, denial prevention, appeals, revenue cycle workflows, and payer logic - and will own the loop from problem framing to agent design, evaluation, deployment, trace analysis, and ongoing improvement.
• The ideal candidate is a strong engineer who thinks like an applied scientist: rigorous about measurement, comfortable with ambiguity, excited by messy real-world data, and motivated by closing the gap between impressive demos and dependable production systems.
What You'll Do
• Design, build, and iterate on agentic AI systems for complex healthcare workflows, including documentation, coding, denial management, appeals, and revenue cycle automation.
• Develop long-horizon agent behavior across context construction, retrieval, tool use, memory, routing, verification, escalation, and human-in-the-loop review.
• Define what "good" looks like for clinical agents end-to-end, translating expert workflows into specifications, rubrics, gold standards, test cases, and clinically meaningful success criteria.
• Build rigorous evaluation and feedback loops using expert review, production logs, model outputs, and benchmarks to measure performance, regressions, edge cases, safety, reliability, provenance quality, and business impact.
• Prototype new AI capabilities from 0 1, then harden them into reliable, explainable, auditable production systems with clear contracts, monitoring, evidence, rationale, and performance gates.
• Partner with research and ML engineering teams on model selection, fine-tuning, reward modeling, distillation, synthetic data, post-training, and internal AI infrastructure, including instrumentation, experiment tracking, benchmarking, prompt/version management, and reproducible evaluation.
What Makes This Role Different
• Most AI roles are either too research-heavy or too product-light. This role sits in the middle.
• You will not only write prompts or run experiments. You will own whether an agent actually works in production. That means understanding the workflow, designing the system, building the evals, inspecting failures, improving the agent, and proving that the improvement matters.
The right person will be excited by questions like:
• What context does this agent need to make the right decision?
• How do we know the output is clinically and operationally correct?
• Which failures are prompt problems, retrieval problems, model problems, tool problems, or product-spec problems?
• How do we turn expert feedback into a better benchmark or training set?
• When should we use prompting, RAG, rules, fine-tuning, reward modeling, or a different architecture?
• How do we make agent outputs auditable enough for clinical and operational review?
• How do we build a data flywheel that improves the system every week?
You May Be a Good Fit If You
• Bring 4+ years of software engineering, ML engineering, research engineering, or applied AI experience.
• Are highly proficient in Python and comfortable building production systems with APIs, structured data, async workflows, testing, logging, and observability.
• Have experience turning messy real-world workflows into structured AI problems, including classification, ranking, extraction, decisioning, LLM applications, agents, RAG, tool calling, structured outputs, prompting, or evaluation.
• Have built or operated evaluation systems, benchmarks, annotation workflows, experiment tracking, or regression tests for AI systems.
• Thrive in ambiguous, high-stakes domains: working with experts, debugging real-world failures, and turning model potential into reliable, correct, safe systems that work for users.
Role Leveling
We are looking for candidates at various levels, ranging from Level 2 to Staff
• L2: Independently delivers a complete end-to-end project, owning design, implementation, and delivery of scoped work
• L3: Leads delivery of larger projects, handling increased technical complexity and ambiguity, providing light guidance to L2s on shared work
Benefits
• Top-of-market compensation (salary + equity)
• Flexible PTO
• Comprehensive health benefits
• 401(k) matching
• Inspiring, brilliant, mission-driven teammates
Hiring Flow
• Intro call - your background & our mission alignment
• Technical deep-dives - pseudo-coding exercise and systems design (not Leetcode)
• Final in-person interview at one of our hubs (SF, NYC, Austin, or Chicago; travel arranged)
• References
• Offer
Interview Logistics Notice
As part of our hiring process, selected candidates will participate in an in-person interview. Candidates located near one of our talent hubs-San Francisco, New York, Austin, or Chicago-will be scheduled to meet with team members in those locations. For candidates residing outside these areas, we will arrange travel to a hub for the interview. Travel accommodation will be provided as needed.