How to vet AI candidates with proof of real project experience

When teams ask how to vet AI candidates, the real test is whether they can show shipped work, failure modes, and output judgment inside a live workflow, not just talk fluently about the tools.
Most AI hiring breaks at the same point: teams mistake exposure for experience. A candidate has used ChatGPT, maybe built a weekend RAG demo, maybe can talk fluently about agents - and then stalls when you ask what shipped, what failed in production, and how they evaluated output quality. Key takeaway: to vet AI candidates well, you need proof of real project experience - what they built, what broke, what data they touched, and how they judged whether outputs were good enough to use. Tool familiarity alone does not predict whether someone can deliver inside your workflow.
In practice, vetting AI candidates means verifying applied work, not screening for keyword matches or polished answers. That matters whether you are hiring an AI engineer in Berlin, a marketing ops lead in Hamburg using Claude and GPT-4 for campaign production, or a RevOps team in Chicago trying to automate account research without creating compliance debt. The hiring market is already heavily automated - according to Harvard Business Review, more than 90% of employers use automated systems to filter or rank applications, and 88% were already using some form of AI for initial screening in 2025. Efficient, yes. Reliable proof of real builder ability, not necessarily.
This article shows how to separate surface fluency from shipped capability. We will cover the signals that matter: artifacts, decision logs, evaluation methods, stack choices, cross-functional constraints, and ownership boundaries around code and data. If you are hiring for AI delivery - not AI theatre - this is how you avoid expensive false positives.
TL;DR
- Require candidates to walk through one shipped AI project end to end, and ask them to show the repo, prompt library, eval set, rollout note, or dashboard that proves they actually built it — for example, a LangSmith trace, a GitHub repo, or a Notion launch doc.
- Use an evidence ladder in screening: treat claims as unverified until they are demonstrated live, backed by artifacts, or confirmed by outcomes, and reject profiles that stay at the “talked about it” level; this is the same logic teams use in systems like Google’s hiring rubrics, where evidence matters more than self-description.
- Ask for the decision trail behind the work: model choice, latency budget, fallback logic, retrieval design, annotation criteria, and sign-off steps, not just stack names or buzzwords.
- Test how they judged output quality by having them explain the evaluation method they used, what “good enough” meant, and what changed after launch; if they cannot explain this clearly, treat that as a red flag, especially if they cannot describe a simple eval set or a human review loop.
- Verify ownership boundaries before hiring by asking what data they touched, what compliance or GDPR review was needed, and which parts they personally shipped versus merely supported.
What does it actually mean to vet AI candidates?
Vetting AI candidates is a verification problem, not a keyword-matching problem. By 2026, resumes across product, ops, and engineering all tend to name the same stack - OpenAI, Claude, LangChain, Azure AI - but those labels tell you little about whether the person can turn a messy workflow into something that ships. What matters is whether they can reconstruct one real project end to end and tie their story to inspectable evidence. GitHub’s own guidance for technical hiring has long pushed teams toward portfolio and contribution review over tool-name filtering, and the interview prep material many candidates now study already covers the standard AI topics, which makes generic Q&A even less useful (GitHub Docs, GitHub interview question repository).
-
Start with one shipped project, not a career summary. Ask: what was the user problem, what did you personally build, and what changed after launch? Real builders can usually explain problem definition, data handling, evaluation, deployment, and iteration in sequence. Surface-level candidates jump straight to tools.
-
Map claims to an evidence ladder. Treat each statement as one of five levels: claimed, described, demonstrated live, verified in artifacts, or confirmed by outcomes. Artifacts can be a repo, prompt library, eval set, dashboard, architecture diagram, rollout note, or screenshot.
-
Probe decisions, not jargon. Ask why they chose a model, what constraints mattered, and how they judged output quality. The useful details are hard to fake: latency budgets, fallback logic, retrieval choices, annotation criteria, stakeholder sign-off, or GDPR review steps in EU teams. The NIST AI Risk Management Framework is useful here because it frames AI work around measurement, governance, and failure handling rather than demo fluency, and the OECD AI principles reinforce that deployable systems need accountability and traceability.
-
Force the failure story. Ask what broke first, what they tried, and why it failed. In a Munich-region software team we worked with, polished candidates stalled here, while a quieter engineer could walk through rejected prompts, evaluation drift, and the GitHub/Notion trail behind the eventual internal copilot rollout. That difference - between polished recall and operational memory - is usually the line between observer and builder.
How can AI help recruiters verify experience?
AI helps recruiters verify experience by turning candidate claims into a structured evidence check: what was said, what artifacts exist, and where the gaps or inconsistencies are. It can flag mismatches fast, but it should narrow the field for human review, not replace judgment.
Build a claim ledger from the resume and portfolio. Have AI extract each concrete claim into fields: project name, role, data used, model choice, evaluation method, deployment environment, stakeholders, and business outcome. This is where many CVs collapse: they list tools but not decisions. A candidate who says “built a support copilot” but leaves out retrieval setup, eval criteria, or handoff into Zendesk has given you a topic, not evidence. For role-specific dimensions like evaluation, RAG, agents, and cross-functional delivery, a practical taxonomy such as the AI engineering interview question set on GitHub is useful, and NIST’s AI Risk Management Framework gives a better lens for checking whether testing and governance were real or hand-waved.
-
Compare claims against artifacts and transcripts. Use AI to line up what the candidate wrote with what appears in GitHub commits, Notion docs, slide decks, demo videos, or interview transcripts. You are looking for inconsistency. In one Munich-region hiring project we saw polished candidates describe “production AI [workflows](/ai-workflows-for-finance-teams-month-end-reporting/)” that, once compared against artifacts, turned out to be prompt experiments with no evaluation notes or deployment trace. Meanwhile, a quieter internal engineer had commit history, rollout docs, and model eval notes that matched their interview story.
-
Generate follow-up questions only where the evidence is thin. If the role is product-facing, ask for workflow integration and stakeholder tradeoffs. If it is engineering-heavy, ask for eval design, failure handling, and deployment constraints. If it is EU-based, ask how they handled approval gates, privacy, or works-council concerns rather than accepting generic “we were compliant” language; OECD’s AI principles and the EU AI Act portal both make clear that governance is operational, not rhetorical.
-
Keep the final decision human. AI should flag “needs probing on evaluation,” “artifact does not support deployment claim,” or “cross-functional ownership unclear.” It should not auto-reject. Let AI compress the review workload, then let a trained interviewer test the contradictions.
What are the failure modes most hiring teams miss?
The failure mode most hiring teams miss is not lack of AI exposure; it is overestimating people who sound fluent but have never delivered under real constraints. Polished language, tool familiarity, and self-reported impact can all look convincing in an interview and still collapse once the work gets messy.
The first miss is treating AI tool usage as proof of capability. It proves access, maybe curiosity, nothing more. Since 2024, generic AI usage has spread far faster than workflow redesign, which is exactly why exposure is now weak signal; Deloitte’s human capital research showed many teams were already using AI and automation in day-to-day work years ago, but far fewer were restructuring work itself, which is the bar that matters in hiring Deloitte Human Capital Trends. A candidate who has really shipped can usually explain where the model sat in the process, what data entered it, where human review stayed, and how output quality was judged against a business task, not a demo.
The second miss is overvaluing portfolios that show outputs without showing authorship. A polished Notion case study, nice screenshots, or a repo with generated code can hide whether the candidate defined the eval set, handled retrieval failures, or only joined for the final presentation. The GitHub documentation on contribution history is useful here because commit patterns, review comments, and issue threads often reveal actual ownership better than the portfolio itself. In one Munich-region hiring process we saw two “AI-savvy” product hires interview well but stall immediately; the quieter internal candidate had the real signal: evaluation notes, rollout docs, and evidence that the workflow had changed how the team worked.
The third miss is conversational competence without operational memory. Surface candidates stay abstract when you ask about the messy middle: bad source data, failed prompts, stakeholder resistance, approval gates, or eval drift after launch. That matters even more in Europe, where teams may need to work through GDPR constraints and consultation requirements rather than hand-wave them; the European Commission’s GDPR guidance and the EU AI Act tracking site make clear that “we’ll sort compliance later” is not a serious delivery answer. The strongest candidates do not just know LangChain, OpenAI, or Azure AI. They can show they owned a workflow that changed throughput, quality, or decision speed for a real team.
Bottom line
Most AI hiring breaks at the same point: teams mistake exposure for experience. Vet candidates by forcing proof of one shipped project end to end - repo, prompt library, eval set, rollout note, or dashboard - and if they can’t show what they personally built, what broke, and how they judged output quality, treat them as unproven. If you need a faster way to separate real builders from polished talkers across AI roles, outside help with evidence-based screening and candidate verification can save a bad hire.
When you need to know how to vet AI candidates, ask for the decision points, the fallback path, and the artifacts that prove they actually shipped.
FAQ
What interview questions should I ask to verify AI project experience?
Ask for the exact decision points, not a recap of the project. Good prompts are: what was the first production constraint, what did you reject and why, and what would have broken if the model latency doubled. If they really shipped, they can usually name the fallback path, the evaluation threshold, and the person who signed off on launch.
How do I check if an AI candidate really built the project themselves?
Ask them to reconstruct the work live from memory and artifacts, then compare the sequence against the repo history, issue tracker, or docs. A real builder can explain why specific commits happened, who owned data access, and which parts were adapted from existing internal code. If they cannot describe the implementation tradeoffs without reading from notes, that is a weak signal.
What artifacts should AI candidates show in an interview?
The most useful proof is a small set of concrete artifacts: a repo, prompt library, eval set, rollout note, dashboard screenshot, or incident postmortem. For non-engineering roles, ask for campaign outputs, workflow docs, or before/after metrics tied to the AI process. The point is to see evidence of judgment, not just output volume.