KYC Vendor Evaluation: How to Actually Test Accuracy Claims

TL;DR

Every KYC vendor claims high accuracy, without defining what accuracy means in their context.
“Accuracy” without pass rate, document scope, and attack-detection data is meaningless.
FATF’s risk-based framework requires evaluating the assurance level of the verification system, not vendor marketing claims.
Independent testing under ISO/IEC 30107-3 is the only externally verifiable liveness benchmark currently available.
Technology stack ownership determines how fast a vendor responds when accuracy degrades.

Walk into any KYC vendor evaluation and you will hear the same phrases like “most accurate,” “99.9% accuracy,” “industry-leading performance on every shortlisted deck. The numbers shift, however, the claim does not. And it almost never comes with enough context to mean anything.

The problem isn’t that vendors fabricate figures. The problem is that “accuracy” in identity verification describes at least three different things, and vendors quote whichever one flatters them most. A headline number that looks strong on a slide can reflect conditions that bear no resemblance to your users, your document types, or your fraud exposure. Buying on that figure is one of the most reliable ways to discover what it really meant, after you have signed a contract.

This guide is for the evaluation stage, before you sign. It explains what accuracy actually measures in a KYC context, the three levers vendors use to make numbers look better than they are, and the six questions that turn a polished claim into something verifiable. Work through it with every vendor on your shortlist.

What does “accurate” actually mean in identity verification?

Accuracy is a meaningful concept in identity verification. It simply does not mean one thing. Vendors routinely quote whichever of the three performance dimensions they have optimised for in a benchmark, and rarely disclose how each relates to the others. Understanding all three is the first step of any credible KYC vendor evaluation.

Pass rate: how many legitimate users get through

Pass rate is the share of genuine verification attempts that succeed on the first submission. A vendor quoting a high pass rate is telling you something about conversion and user experience. It tells you nothing about how many fraudulent attempts also cleared the system. Strong pass rates can coexist with high fraud-acceptance rates if the rejection threshold has been tuned for volume over security.

False positive rate: how many legitimate users get blocked

The false positive rate is the share of genuine users the system incorrectly flags and rejects. A vendor who tightens fraud detection will typically raise false positives at the same time. These two figures live on opposite ends of the same dial. Any vendor claiming both are simultaneously low should be asked to demonstrate it on your data in a live test, because optimising one almost always costs the other.

Attack detection rate: how many fraud attempts get stopped

Attack detection rate measures the share of fraudulent submissions the system correctly identifies. This is the metric fraud teams care about most. A vendor optimising for conversion may sacrifice detection here. A vendor optimising for detection may degrade pass rates. Understanding how pass rate and accuracy interact is the foundation of any evaluation that goes beyond the headline number.

Why the same number means different things across vendors?

A 94% pass rate from one vendor may be a worse result than 89% from another. The difference usually comes down to three factors that vendors control during benchmarking, and rarely disclose upfront.

Document scope sets the base population

A system trained primarily on Western passports and driver’s licences will score well when tested against a Western-skew dataset. Show that same system documents from Vietnam, Indonesia, or the Gulf and the figures shift, sometimes materially. Before accepting any accuracy claim, ask which document types and issuing countries are represented in the benchmark. A number that excludes your user population is not a number about your use case.

Risk threshold tuning moves the accept/reject line

Every identity verification system applies a confidence threshold: below it, the decision is reject or refer. Above it, the decision is accepted. Vendors can shift this threshold to inflate either the pass rate figure or the fraud-detection figure, depending on which one they want to put in front of you. Ask to see the default threshold settings and ask how false positive rates change if the threshold moves by even 5 points.

Lab results behave differently from live traffic

Controlled benchmark conditions produce accuracy figures that rarely survive contact with real traffic. Good lighting, cooperative test subjects, high-quality images, and known document types are standard in a lab. None of those apply reliably in live traffic. Live users submit documents on low-end mobile devices, in poor lighting, with physical wear on the ID. Adversarial attempts are baked into the live flow. Ask specifically for live-traffic performance data from a deployment that resembles yours. A vendor who can only produce lab figures is telling you something important about their confidence in live-traffic performance.

The six questions that stress-test any accuracy claim

The fastest path from a vendor’s headline figure to something you can act on is to ask directly. The table below maps each question to what a substantive answer looks like, alongside the deflections that tell you the vendor cannot or will not answer.

#	Question	Credible answer	Red flag
1	What is your pass rate for our target geographies and document types?	A market-specific figure from live traffic, named geography	“Industry-leading overall pass rate”
2	What is your false positive rate at default threshold settings?	Named figure with threshold disclosed	“Our false positive rate is very low”
3	Has your liveness been independently tested under ISO/IEC 30107-3, and at which level?	Named level (1, 2, or 3), named lab, IAPAR disclosed	PAD Level 1 only, certificate date not given, or no answer
4	Do you own the liveness and OCR models, or do you license them from a third party?	In-house, named as proprietary	“We use best-of-breed technology partners”
5	If accuracy degrades on a specific document type, who remediates and how fast?	SLA named, owner identified, remediation path described	“We work with our technology partners on that”
6	Can you run a proof-of-concept on our actual document types before we sign?	POC offered at no charge, structured and scoped	Deflection to reference clients or published case studies

The pattern in the red-flag column is worth noticing: deflection to aggregate figures, unnamed partners, and reference clients are all ways of avoiding accountability for your specific use case. A vendor with strong numbers for your document types and geographies will not deflect to averages.

What independent validation actually proves?

Third-party testing removes the self-reporting problem entirely. For liveness detection, the relevant standard is ISO/IEC 30107-3, evaluated by accredited testing laboratories. iBeta Quality Assurance is the most widely used NVLAP-accredited laboratory for presentation attack detection (PAD) compliance testing.

iBeta structures its evaluation across three escalating levels of attacker capability. Level 1 tests basic presentation attacks using artefacts costing under $30, including printed photographs and video replays. Level 2 raises the bar to materials under $300, including 3D-printed masks and higher-fidelity silicone artefacts. By mid-2025, more than 100 biometric products globally had demonstrated ISO 30107-3 PAD compliance at Levels 1 and 2, making conformance at those levels table stakes rather than a differentiator.

iBeta introduced Level 3 in mid-2025 to address AI-driven fraud techniques that earlier levels were not built to catch, including deepfakes, face swaps, and generative attack tools. Level 3 removes the budget constraint entirely. Expert attackers have weeks to prepare and attempt a breach using the full range of current AI spoofing techniques. Very few vendors globally hold Level 3 conformance today.

When a vendor says their liveness has been independently tested, ask specifically: which level, which laboratory, and what was the Imposter Attack Presentation Accept Rate (IAPAR)? A Level 2 certificate from 2023 is a materially different claim from Level 3 conformance earned under the 2025 framework. The question about level is the single most diagnostic question in any liveness evaluation.

Why does technology ownership change the accountability equation?

Identity verification systems degrade. New attack types emerge, document templates change, model drift quietly degrades OCR accuracy on a specific document class. What happens next depends entirely on whether the vendor owns the component that failed.

Most identity verification platforms were not built end-to-end. They assembled their stacks from third-party components, liveness from one provider, document OCR from another, creating fragmented systems with no single owner. When accuracy degrades inside a third-party component, the vendor’s response time is bounded by their supplier’s backlog, not their own engineering capacity. The client sees degraded performance. The vendor has a ticket open with a third party.

FATF’s Digital Identity Guidance makes this point clearly. Regulated entities must evaluate the assurance level of a digital identity system across its technology, architecture, and governance, not just its headline accuracy figure. Governance includes accountability. If the component that failed belongs to a subprocessor, the vendor cannot be fully accountable for the failure or its remediation timeline.

Unexplained accuracy degradation mid-contract, on a document type the vendor said they supported or an attack vector they said their liveness handled, is one of the most common friction points compliance teams encounter when they skip the ownership question during the evaluation. Ask it before you sign: if your accuracy drops on a specific document type next month, what is the remediation path, and who owns it?

How Shufti approaches the ownership question?

If the answers to the checklist questions above are thin from a current or prospective vendor, the underlying cause is almost always the same: a fragmented stack means no single team can answer fully for its performance.

Shufti built and owns its entire technology stack, from OCR and liveness detection to document intelligence and AML screening, with no third-party components on the critical path. When accuracy degrades, there is one owner and one remediation path, on Shufti’s timeline. The liveness layer holds iBeta Level 3 conformance under ISO/IEC 30107-3, the highest published independent standard for liveness attack detection as of 2025, and models were trained natively on 10,000+ document types across 220+ countries, so hard-market performance figures reflect live traffic, not Western-skew lab conditions.

One platform. Fully owned technology. Global coverage with real local depth.

Test Shufti’s accuracy on your own document types — book a 20-minute demo.

Frequently Asked Questions

What is a reasonable KYC pass rate to benchmark against?

Pass rate benchmarks vary by document type and user geography. For document-plus-biometric verification in regulated financial services, a live-traffic first-attempt pass rate above 90% on Western documents is a starting baseline. On non-Latin documents from markets like Vietnam, Indonesia, or the Gulf, ask for hard-market-specific figures, because aggregate numbers that include easy document types will mask the performance gaps in the markets that matter.

How do I test a vendor's accuracy before committing to a contract?

Request a proof-of-concept on a representative sample of your actual document types and user geographies. A credible vendor runs this at no charge. Any vendor who deflects to reference clients or published benchmarks rather than testing on your data is signalling something about how they handle accountability when conditions do not match the benchmark.

What is the difference between PAD Level 1, 2, and 3?

iBeta's three-level framework for ISO/IEC 30107-3 escalates attacker capability. Level 1 covers basic attacks using artefacts under $30. Level 2 raises this to attacks under $300. Level 3 removes all budget constraints, uses expert attackers over weeks, and tests against current AI spoofing and deepfake techniques. By mid-2025, Levels 1 and 2 had been achieved by over 100 products globally. Level 3 conformance is what now differentiates a liveness vendor's claim.

Can a vendor's accuracy degrade over time without any explicit system change?

Yes. Model drift, new document template releases, shifts in your user population, and emerging attack types can all reduce accuracy with no system update in sight. Ask vendors whether their models are retrained on live traffic data and at what cadence. If a vendor uses third-party components for OCR or liveness, ask who owns the retraining of those models and what the SLA is for a degradation event.

The KYC vendor evaluation checklist: how to actually test accuracy claims.

TL;DR

What does “accurate” actually mean in identity verification?

Pass rate: how many legitimate users get through

False positive rate: how many legitimate users get blocked

Attack detection rate: how many fraud attempts get stopped

Why the same number means different things across vendors?

Document scope sets the base population

Risk threshold tuning moves the accept/reject line

Lab results behave differently from live traffic

The six questions that stress-test any accuracy claim

What independent validation actually proves?

Why does technology ownership change the accountability equation?

How Shufti approaches the ownership question?

Frequently Asked Questions

What is a reasonable KYC pass rate to benchmark against?

How do I test a vendor's accuracy before committing to a contract?

What is the difference between PAD Level 1, 2, and 3?

Can a vendor's accuracy degrade over time without any explicit system change?

Join the
Shufti Sphere Newsletter

TL;DR

What does “accurate” actually mean in identity verification?

Pass rate: how many legitimate users get through

False positive rate: how many legitimate users get blocked

Attack detection rate: how many fraud attempts get stopped

Why the same number means different things across vendors?

Document scope sets the base population

Risk threshold tuning moves the accept/reject line

Lab results behave differently from live traffic

The six questions that stress-test any accuracy claim

What independent validation actually proves?

Why does technology ownership change the accountability equation?

How Shufti approaches the ownership question?

Frequently Asked Questions

What is a reasonable KYC pass rate to benchmark against?

How do I test a vendor's accuracy before committing to a contract?

What is the difference between PAD Level 1, 2, and 3?

Can a vendor's accuracy degrade over time without any explicit system change?

Related Articles

Join theShufti Sphere Newsletter

Join the
Shufti Sphere Newsletter