AI Evaluation / Testing (Evaluation Engineer)
Fort Worth, TX (Remote) Long Term Contract
We are seeking a skilled AI Evaluation Engineer to validate AI models and agent workflows built on AWS and Azure as the core AI foundation, with Microsoft Copilot as the primary user experience layer. The role is responsible for ensuring AI systems meet rigorous standards for accuracy, safety, bias, and performance through structured testing, benchmarking, and continuous evaluation pipelines across the full AI lifecycle. The candidate will work closely with AI Architects, AI Engineers, and AI Security Engineers to establish evaluation frameworks that provide confidence in AI outputs before and after production deployment, including Copilot-integrated workflows and RAG-based systems.
Key Responsibilities:
• Evaluation Framework Design
• Model & Agent Testing
• RAG & Retrieval Evaluation
• Safety, Bias & Responsible AI Testing
• Continuous Evaluation Pipelines
• Benchmarking & Performance Testing
• Collaboration & Quality Advocacy
Required Qualifications:
• 6-10 years of experience in software testing, data science, or AI/ML engineering with 3+ years focused on evaluation, testing, or quality assurance of LLM-powered or AI systems in production.
• Hands-on experience evaluating AI workloads on AWS (Bedrock, SageMaker) and Azure (Azure AI Foundry, Azure ML) including model testing, RAG evaluation, and agent workflow validation.
• Experience testing Microsoft Copilot-integrated solutions, Copilot plugins, or Microsoft 365 AI features for quality, accuracy, and governance compliance.
• Strong understanding of LLM evaluation metrics including BLEU, ROUGE, BERTScore, faithfulness, relevance, coherence, and task-specific scoring methodologies.
• Experience with RAG evaluation frameworks and retrieval metrics (MRR, nDCG, precision, recall) across vector search platforms such as Azure AI Search and Amazon OpenSearch.
• Familiarity with responsible AI evaluation principles including bias detection, fairness assessment, safety testing, and regulatory compliance validation.
• Experience building automated evaluation and CI/CD pipelines; proficiency in Python and familiarity with evaluation frameworks such as Azure AI Evaluation SDK, Ragas, or DeepEval.
• Strong analytical and communication skills with the ability to translate complex evaluation findings into clear quality assessments and release recommendations.