Why we ran this test
Every major LLM provider claims strong document understanding. The marketing language is consistent: accurate extraction, reliable output, production-ready. The reality is more nuanced, and the gaps become visible when you move from demo PDFs to the kinds of documents that actually show up in financial operations workflows.
We ran a structured test using real financial documents from operations use cases: bank statements, invoices, loan applications, and trade finance documents. The documents were sourced from actual workflow samples (anonymized), not benchmark datasets. We tested GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro against the same extraction tasks.
This is not a comprehensive academic benchmark. It is a practical evaluation oriented toward the question that operations teams actually care about: can I trust this output enough to route it into my systems without a human reviewing every line?
Test methodology
We used 120 documents across four categories, with 30 documents per category. Each document was processed with the same prompt structure across all three models. We measured field-level extraction accuracy against manually verified ground truth, and tracked confidence calibration (whether high-confidence outputs were actually more accurate than low-confidence ones).
The four document categories:
- Bank statements: 30 statements from 8 different financial institutions, mix of digital-native PDFs and scanned originals. Key fields: account holder name, account number, opening/closing balance, transaction count, statement period.
- Supplier invoices: 30 invoices from 15 vendors across 4 countries. Key fields: invoice number, date, total amount, line items, VAT amount, supplier name, buyer name.
- Loan applications: 30 application forms, mix of structured forms and semi-structured documents. Key fields: applicant name, income, loan amount requested, employment status, collateral description.
- Trade finance documents: 30 letters of credit and bill of lading samples. Key fields: parties, amounts, commodity description, port of loading/discharge, expiry date.
Overall accuracy results
| Model | Bank statements | Supplier invoices | Loan applications | Trade finance docs | Overall |
|---|---|---|---|---|---|
| GPT-4o | 94% | 91% | 88% | 74% | 87% |
| Claude 3.5 Sonnet | 96% | 93% | 90% | 78% | 89% |
| Gemini 1.5 Pro | 92% | 89% | 85% | 71% | 84% |
The headline numbers look impressive. But they obscure the distribution of errors, which matters more than the average for operational use.
Where errors actually occurred
Aggregate accuracy figures hide a critical pattern: errors in LLM document extraction are not randomly distributed. They cluster in specific conditions.
| Error category | GPT-4o | Claude 3.5 Sonnet | Gemini 1.5 Pro |
|---|---|---|---|
| Multi-currency documents (wrong currency applied) | 8 errors | 6 errors | 11 errors |
| Scanned documents with low resolution | 14 errors | 11 errors | 17 errors |
| Non-Latin script fields | 9 errors | 7 errors | 13 errors |
| Date format ambiguity (e.g. 04/05/24) | 6 errors | 5 errors | 8 errors |
| Trade finance boilerplate conflated with key fields | 22 errors | 18 errors | 26 errors |
The trade finance category is where all three models showed their weakest performance. The documents contain dense legal and commercial boilerplate alongside the key fields, and models frequently conflate the two. This is not a solvable problem through prompt engineering alone. It requires document-type-specific extraction logic and a human review layer for complex documents.
Confidence calibration: the gap that matters for operations
Beyond accuracy, we looked at whether the models' stated confidence scores reliably predicted accuracy. A well-calibrated model should be more accurate on high-confidence outputs than low-confidence ones. If confidence is not calibrated, you cannot use it as a reliable signal for routing decisions.
Our findings: all three models showed meaningful confidence calibration on simple document types (bank statements, clean invoices). On complex documents (trade finance, multi-page scanned applications), confidence calibration deteriorated. Models returned high-confidence outputs on fields they extracted incorrectly at rates that would be operationally dangerous if used as a routing signal without tuning.
This is the most important finding for operations teams. High accuracy on clean documents plus poor confidence calibration on complex documents means that full automation is safe for a subset of your document volume and genuinely risky for another subset. The challenge is building a system that distinguishes between the two reliably.
What this means for document workflow design
The results reinforce what experienced operations teams already know: LLMs are extraction tools, not workflow solutions. Using them directly in a production document pipeline without a validation and review layer is a reliability risk, regardless of which model you choose.
The practical design implication is the same across all three models: automate the high-confidence, standard-format documents fully, and route everything else through a human review gate. The question is not which LLM to use. The question is how to build the layer around the LLM that makes the output safe to use downstream.
For more on how to design that layer, see our guide on human-in-the-loop document automation. If you want to see how Floowed structures this for financial document workflows, talk to the team.
"The accuracy numbers look fine until you look at where the errors happen. In financial documents, a wrong currency or a transposed date is not a minor error. It is an operational incident."
Head of Operations, Southeast Asia Trade Finance Platform
Frequently Asked Questions
Which AI model performs best for financial document extraction?
Based on our structured test of 120 real financial documents, Claude 3.5 Sonnet produced the highest overall accuracy at 89%, followed by GPT-4o at 87% and Gemini 1.5 Pro at 84%. However, the headline figures are less important than the distribution of errors, which cluster on specific conditions such as multi-currency documents, low-resolution scans, and complex trade finance documents regardless of the model used.
Can I use ChatGPT, Claude, or Gemini directly in a production document workflow?
Using LLMs directly without a validation and review layer is a reliability risk in production financial workflows. All three models produce errors that cluster in predictable conditions, and their confidence calibration deteriorates on complex document types. The recommended architecture is to use the LLM as the extraction engine and build a validation and human review layer on top of it.
What is confidence calibration and why does it matter for document extraction?
Confidence calibration describes how reliably a model's confidence score predicts its accuracy. A well-calibrated model is significantly more accurate on high-confidence outputs than low-confidence ones, making the confidence score a useful signal for routing decisions. Our tests found that all three models showed acceptable calibration on simple documents but deteriorated on complex ones, making confidence scores unreliable as a routing signal without additional tuning.
Why did all three models perform poorly on trade finance documents?
Trade finance documents contain dense legal and commercial boilerplate alongside the key fields. Models frequently conflate the two, extracting values from boilerplate sections rather than from the operative fields. This is not easily solved through prompt engineering alone. It requires document-type-specific extraction logic and a human review layer for complex documents.
Should I choose my LLM based on document accuracy benchmarks?
Benchmarks on clean, standard-format documents are not a reliable guide for production document workflows. The gaps between models are small on standard documents and larger on edge cases. A more useful evaluation is to test your specific document types, including your edge cases, and to measure confidence calibration rather than average accuracy alone.





%20(1).png)