ChatGPT vs Claude vs Gemini: Financial Doc Extraction

Why we ran this financial document extraction test

Every major LLM provider claims strong document understanding. The marketing language is consistent: accurate extraction, reliable output, production-ready. The reality is more nuanced, and the gaps become visible when you move from demo PDFs to the kinds of documents that actually show up in lending and financial operations workflows.

We ran a structured test using real financial documents from operations use cases: bank statements, invoices, loan applications, and trade finance documents. The documents were sourced from actual workflow samples (anonymized), not benchmark datasets. We tested GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro against the same extraction tasks.

This is not a comprehensive academic benchmark. It is a practical evaluation oriented toward the question that credit and risk teams actually care about: can I trust this output enough to route it into a credit decision without a human reviewing every line?

Test methodology

We used 120 documents across four categories, with 30 documents per category. Each document was processed with the same prompt structure across all three models. We measured field-level extraction accuracy against manually verified ground truth, and tracked confidence calibration (whether high-confidence outputs were actually more accurate than low-confidence ones).

The four document categories:

Bank statements: 30 statements from 8 different financial institutions, mix of digital-native PDFs and scanned originals. Key fields: account holder name, account number, opening/closing balance, transaction count, statement period.
Supplier invoices: 30 invoices from 15 vendors across 4 countries. Key fields: invoice number, date, total amount, line items, VAT amount, supplier name, buyer name.
Loan applications: 30 application forms, mix of structured forms and semi-structured documents. Key fields: applicant name, income, loan amount requested, employment status, collateral description.
Trade finance documents: 30 letters of credit and bill of lading samples. Key fields: parties, amounts, commodity description, port of loading/discharge, expiry date.

Overall accuracy results

Model	Bank statements	Supplier invoices	Loan applications	Trade finance docs	Overall
GPT-4o	94%	91%	88%	74%	87%
Claude 3.5 Sonnet	96%	93%	90%	78%	89%
Gemini 1.5 Pro	92%	89%	85%	71%	84%

The headline numbers look impressive. But they obscure the distribution of errors, which matters more than the average for operational use.

Where errors actually occurred in financial document extraction

Aggregate accuracy figures hide a critical pattern: errors in LLM document extraction are not randomly distributed. They cluster in specific conditions.

Error category	GPT-4o	Claude 3.5 Sonnet	Gemini 1.5 Pro
Multi-currency documents (wrong currency applied)	8 errors	6 errors	11 errors
Scanned documents with low resolution	14 errors	11 errors	17 errors
Non-Latin script fields	9 errors	7 errors	13 errors
Date format ambiguity (e.g. 04/05/24)	6 errors	5 errors	8 errors
Trade finance boilerplate conflated with key fields	22 errors	18 errors	26 errors

The trade finance category is where all three models showed their weakest performance. The documents contain dense legal and commercial boilerplate alongside the key fields, and models frequently conflate the two. This is not a solvable problem through prompt engineering alone. It requires document-type-specific extraction logic and a human review layer for complex documents.

The scanned document problem in detail

The 11-17 errors on low-resolution scanned documents across all three models points to a structural gap in using general-purpose LLMs for financial document workflows. These models receive the image as-is. They do not preprocess it to improve character recognition accuracy before attempting extraction.

For bank statement analysis in markets where passbooks and photographed statements are common, this is not an edge case. It is a significant portion of the incoming document volume. The test documents here represented the real range of scan quality that lenders and financial services operations see, not the clean PDFs that vendor benchmarks use. This is exactly the territory where US-built IDPs like Ocrolus, Rossum, and Hyperscience, tuned for pristine documents, start to struggle, and it is the structural reason frontier AI cannot reliably read bank statements out of the box.

A purpose-built document intelligence platform runs preprocessing steps before the extraction model sees the image: deskewing rotated pages, denoising grainy scans, enhancing contrast on faded print. It then goes beyond reading the page to analysing it: normalising income, running cash-flow and bank-statement analysis (ADB, DSCR), flagging tampering and fraud signals, and cross-checking fields across documents. These steps meaningfully improve accuracy on real-world document quality, including handwritten, photographed, and skewed loan documents. They are not available when routing documents directly through a general-purpose LLM API.

What the results mean for loan and lending workflows

The accuracy gaps become more consequential when you consider what these documents are used for. In loan processing, the extracted data from bank statements and tax returns drives credit decisions. A wrong income figure, a misread account balance, or a transposed date in a bank statement is not a minor data quality issue. It is an input error in a credit decision.

The test showed that all three models produce errors on scanned bank statements at rates between 6% and 8% of documents. On a high-volume document processing operation handling thousands of applications per month, that error rate translates to hundreds of decisions potentially affected by incorrect input data.

This does not mean LLMs should not be used in lending workflows. It means they should be embedded in a system that includes validation, confidence scoring, and human review routing, not called directly from a workflow tool without these layers. See our guide on document workflow automation for how this architecture is typically structured in production financial services environments.

Confidence calibration: the gap that matters for operations

Beyond accuracy, we looked at whether the models' stated confidence scores reliably predicted accuracy. A well-calibrated model should be more accurate on high-confidence outputs than low-confidence ones. If confidence is not calibrated, you cannot use it as a reliable signal for routing decisions.

Our findings: all three models showed meaningful confidence calibration on simple document types (bank statements, clean invoices). On complex documents (trade finance, multi-page scanned applications), confidence calibration deteriorated. Models returned high-confidence outputs on fields they extracted incorrectly at rates that would be operationally dangerous if used as a routing signal without tuning.

This is the most important finding for credit and risk teams. High accuracy on clean documents plus poor confidence calibration on complex documents means that full automation is safe for a subset of your document volume and genuinely risky for another subset. The challenge is building a system that distinguishes between the two reliably.

What this means for document workflow and decisioning design

The results reinforce what experienced lending teams already know: LLMs are extraction tools, not decisioning solutions. Using them directly in a production credit pipeline without a validation and review layer is a reliability risk, regardless of which model you choose.

The practical design implication is the same across all three models: automate the high-confidence, standard-format documents fully, and route everything else through a human review gate. But reading the document is only half the job. The other half is the decision: running your credit policy against that data, the same way, on every application. That is the difference between an extraction model and a lending system.

This is how Floowed is built, as two products that work together. Document Intelligence reads and analyses any loan document at any quality, handwritten, scanned, or photographed, into decision-ready data: normalised income, cash-flow and bank-statement analysis, fraud and tampering signals, and cross-document validation. The Decision Engine then runs your credit policy on every application, with the rules behind each call visible and auditable. Bring any score or your own model and Floowed absorbs it unchanged; we orchestrate the decision, we do not compete with your scorecard. In production at Alon Capital, founder Rene de Jesus puts it simply: "Floowed reads the documents, runs our credit policy, and surfaces a decision in minutes."

For more on how to design the review layer, see our guide on human-in-the-loop document automation. If you want to see how Floowed structures this for financial document workflows, book a demo, or start free.

"The accuracy numbers look fine until you look at where the errors happen. In financial documents, a wrong currency or a transposed date is not a minor error. It is an operational incident."

Head of Operations, Trade Finance Platform

Floowed's document intelligence and decisioning platform for lenders covers the full workflow from document intake to credit decision.

‍

Frequently Asked Questions

Which AI model performs best for financial document extraction?

Based on our structured test of 120 real financial documents, Claude 3.5 Sonnet produced the highest overall accuracy at 89%, followed by GPT-4o at 87% and Gemini 1.5 Pro at 84%. However, the headline figures are less important than the distribution of errors, which cluster on specific conditions such as multi-currency documents, low-resolution scans, and complex trade finance documents regardless of the model used.

Can I use ChatGPT, Claude, or Gemini directly in a production lending workflow?

Using LLMs directly without a validation and review layer is a reliability risk in production financial workflows. All three models produce errors that cluster in predictable conditions, and their confidence calibration deteriorates on complex document types. The recommended architecture is to use document intelligence to read and analyse the paperwork, then run a decisioning engine that applies your credit policy, with human review on the documents that need it.

What is confidence calibration and why does it matter for document extraction?

Confidence calibration describes how reliably a model's confidence score predicts its accuracy. A well-calibrated model is significantly more accurate on high-confidence outputs than low-confidence ones, making the confidence score a useful signal for routing decisions. Our tests found that all three models showed acceptable calibration on simple documents but deteriorated on complex ones, making confidence scores unreliable as a routing signal without additional tuning.

Why did all three models perform poorly on trade finance documents?

Trade finance documents contain dense legal and commercial boilerplate alongside the key fields. Models frequently conflate the two, extracting values from boilerplate sections rather than from the operative fields. This is not easily solved through prompt engineering alone. It requires document-type-specific extraction logic and a human review layer for complex documents.

Should I choose my LLM based on document accuracy benchmarks?

Benchmarks on clean, standard-format documents are not a reliable guide for production document workflows. The gaps between models are small on standard documents and larger on edge cases. A more useful evaluation is to test your specific document types, including your edge cases, and to measure confidence calibration rather than average accuracy alone.

How do general-purpose LLMs compare to purpose-built document intelligence platforms?

General-purpose LLMs provide strong baseline extraction accuracy on clean, standard-format documents. Purpose-built platforms add preprocessing for low-quality scans, document-type-specific analysis (income normalisation, cash-flow and bank-statement analysis, fraud signals, cross-document validation), confidence scoring with routing logic, human review interfaces, and field-level audit trails, then run your credit policy on the result through a decisioning engine. For production lending operations where document quality varies and accuracy requirements are high, the additional layers in purpose-built platforms address the gaps that this test surfaced in direct LLM usage.

We Tested ChatGPT, Claude, and Gemini on Real Financial Documents. Here's What We Found.