Bank Statement Extraction
From PDF to Verified, Structured Data

OCR reads the text. But a bank statement isn't text, it's a table. OCR gives you "1.250,00" but not whether it's a debit, a credit, or a running balance. It gives you "VIREMENT RECU / ÜBERWEISUNG" but not which row it belongs to. Get one assignment wrong and every balance after it is off. Holofin reconstructs the table structure, assigns every value to its row and column, and proves the result by reconciling the balance.

Schedule a Demo
BANQUE LEFORT & CIE
24 avenue Marceau, 75008 Paris
Relevé de compte
Janvier 2025
Compte courantFR76 3000 4012 3400 0107 8425 162EUR
Dumont Consulting SARLPériode: 01/01 – 31/01/2025
Solde précédent:12 450,00
Nouveau solde:14 270,30
Date op.LibelléDébitCréditSolde
03/01VIR RECU SALAIRE JANV3 200,0015 650,00
05/01PRLV SEPA ASSURANCE MMA328,5015 321,50
10/01VIR SEPA LOYER BUREAU1 250,0014 071,50
15/01PRLV SEPA EDF ELECTRICITE187,4013 884,10
18/01CB MONOPRIX PARIS 0862,3013 821,80
22/01VIR RECU CLIENT FACTURE 2401568,5014 390,30
28/01VIR RECU REMB TROP PERCU120,0014 510,30
✓ Balance reconciled✓ 23 transactions✓ EUR

Why Generic OCR
Keeps Getting It Wrong

A bank statement looks like a simple table. It is not. Every issuer formats things differently, and the PDF format itself is working against you. Here's what actually breaks.

The core problem

Every bank does it differently

There's no standard for bank statement layout. BNP Paribas puts dates on the left and uses separate Debit/Credit columns. Deutsche Bank uses a single Amount column with D/C indicators. Revolut doesn't even include running balances. A template trained on one bank produces garbage on another.

Is "1.250" a thousand or 1.25?

French banks write "1 250,00 €". German ones write "1.250,00 EUR". British ones write "£1,250.00".

The same dot means "thousands" in Frankfurt and "decimals" in London. The same comma means the opposite. A space is a thousand separator in Paris and nothing in New York.

Misread one separator and a €1,250 rent payment becomes €1.25. Your balance check won't catch it. The numbers still add up, just to the wrong total.

Which column is the debit?

One column or two? Negative numbers or a "D/C" indicator? A minus on the left, on the right, or parentheses? German banks use "S" and "H". Some just leave the other column blank. The table looks obvious to a human. It's a nightmare to parse programmatically.

Tables that break across pages

200 transactions don't fit on one page. The table continues on page 2, sometimes with headers repeated, sometimes not. A transaction might start on one page and finish on the next. You need to stitch the table back together before you can extract anything.

Multiple accounts in one PDF

Your client sends a single 47-page PDF. It contains three accounts (current, savings, credit card) across four quarters. That's 12 separate statements inside one file. Treat it as one continuous table and you get nonsense.

Multiple bank statements in one PDF

Not everything that looks like a transaction is one

Banks pad statements with auxiliary tables that look exactly like transactions: card payment breakdowns listing every contactless tap, SEPA transfer summaries repeating each direct debit, fee schedules, interest calculations. Extract them and you double-count. Skip the wrong one and your balance is off.

The real transactions live in the main table. Everything else is noise dressed up as data.

How It Works

Every bank statement goes through four stages. No templates, no issuer-specific configuration. The same pipeline handles BNP Paribas and Chase.

Classification

Our classifier identifies 100+ bank issuers using both content and visual clues: header positions, column structures, logos, text patterns. No templates to configure per bank.

Segmentation

Multi-account PDFs get split before extraction. We detect account boundaries by IBAN, account number, and period markers. That 47-page PDF becomes 12 segments, processed in parallel.

Extraction

A visual model reads the page layout and extracts accurate transaction data: date, description, debit, credit, running balance, and account metadata. No template rules. The model understands the table structure.

Every extraction produces a JSON like this:

{
  "bank_name": "Qonto",
  "currency": "EUR",
  "account_type": "current",
  "usage_type": "business",
  "client_names": ["Starflight Dynamics GmbH"],
  "account_number": "DE15100101232339317943",
  "start_balance": 3071.69,
  "end_balance": 3030.39,
  "start_date": "2025-05-01",
  "end_date": "2025-05-31",
  "validation_status": "OK",
  "transactions": [
    {
      "transaction_date": "2025-05-02",
      "value_date": "2025-05-02",
      "amount": -963.9,
      "description": "Schmittlein Kloster Arbeitsrecht Partnerschaft",
      "credit": null,
      "debit": 963.9,
      "page": 1,
      "row": 1
    }
  ]
}

Validation

This is where most tools stop, and where we start. Every extracted segment gets checked:

  • Balance reconciliation: opening balance + total credits − total debits = closing balance, within €2 tolerance. If the equation doesn't balance, the extraction is flagged.
  • Running balance continuity: each transaction's running balance must equal the previous balance plus/minus the transaction amount. Breaks indicate missing or mis-extracted rows.
  • Date ordering: transaction dates must be in chronological sequence within the statement period. Out-of-order dates suggest row assignment errors.
  • Duplicate detection: identical transactions (same date, description, amount) are flagged for review rather than silently included.

Balance reconciliation equation:

Show Your Work

Every extracted value carries coordinates that point back to its exact position on the source page. Not just "this came from page 3" but the pixel-level bounding box around the original text. You can verify any number by clicking on it.

Auditors love this

When an auditor asks "where did this number come from?", you show them. The exact location on the source PDF, highlighted. No "the system said so."

Fix errors in seconds

Your reviewer spots a wrong amount. They click the value. The source region highlights on the original document. Compare, correct, move on.

Full data lineage

Trace any number from the credit decision back to the original bank statement, page, and row. The full chain is documented. Regulators don't have to take your word for it.

BNP Paribas - January 2025
Date
Description
Amount
Balance
03/01
VIR RECU SALAIRE
+3,200.00
15,650.00
15/01
VIR SEPA LOYER JANV
-1,250.00
14,400.00
18/01
PRLV SEPA EDF ELEC
-187.40
14,212.60
22/01
CB CARREFOUR MARKET
-62.30
14,150.30
28/01
VIR RECU REMB TROP
+120.00
14,270.30
Date / Description
Credit
Debit
Balance

Scale and Coverage

We process 100K+ documents a month for lending teams across Europe. Here's what the infrastructure looks like.

Infrastructure

~40 seconds per statement

Upload to validated JSON. Multi-segment documents process in parallel, so a 12-segment PDF doesn't take 12x longer.

REST API + webhooks

Upload via API, get a webhook when it's done. Batch upload supported.

European infrastructure, GDPR-compliant

99.9% uptime SLA. Configurable retention. Data never leaves the EU.

Banks we cover

French banks

BNP Paribas, Société Générale, Crédit Agricole, Crédit Mutuel, La Banque Postale, Boursorama, CIC, LCL, Caisse d'Épargne

German banks

Deutsche Bank, Commerzbank, Sparkasse, Volksbank, N26, DKB, ING DiBa, HypoVereinsbank

Pan-European & international

ING, HSBC, Revolut, Wise, Barclays, Lloyds, NatWest, UniCredit, Rabobank, ABN AMRO, Santander

UK & US banks

Chase, Bank of America, Wells Fargo, Citi, HSBC UK, Barclays UK, Monzo, Starling

Don't see your bank? It probably works anyway.

We don't use templates. The extraction engine reads layout from the document itself. New issuers work without setup.

FAQ

The questions we get most from lending and accounting teams.

Holofin processes native PDF bank statements from any issuer worldwide, including all major European, UK, and US banks. It handles both digitally-generated and scanned statements. No templates or issuer-specific configuration needed. The system learns layout from the document itself. We actively cover 100+ issuers with validated extraction accuracy, and new issuers typically work without any configuration.

Holofin's segmentation engine detects account boundaries (IBAN, account number, period markers) and splits combined PDFs into individual statement segments before extraction. A 47-page PDF with 3 accounts across 4 quarters becomes 12 individual, independently validated segments. Each segment is extracted and balance-reconciled separately, then aggregated into a unified JSON response.

Field-level accuracy exceeds 97% on native PDF bank statements across tested issuers. But raw accuracy isn't the full story. Every extraction includes automatic balance reconciliation (opening + credits − debits = closing), providing mathematical validation that catches extraction errors a simple accuracy metric would miss. When reconciliation fails, the extraction is flagged for human review rather than silently passed through.

Yes. Scanned bank statements are processed through OCR with font decoding and layout recognition. Accuracy depends on scan quality (300 DPI or higher recommended). The balance reconciliation step catches most OCR errors that affect financial totals. For degraded scans, the system flags low-confidence values so reviewers focus on the fields that need attention, not the entire document.

Yes. Holofin provides a REST API for programmatic document submission and result retrieval. Upload a PDF, receive a webhook when extraction completes, fetch the structured JSON result. Batch processing is supported: submit hundreds of documents in a single API call and collect results as they complete. Authentication uses API keys with organization-level scoping.

After extraction, Holofin verifies the accounting equation: opening balance + total credits − total debits = closing balance, within a tolerance of €0.01 in the statement currency. Running balance continuity is also checked: each transaction's running balance must equal the previous balance plus or minus the transaction amount. Date ordering and duplicate detection round out the validation suite. When any check fails, the extraction is flagged with specific error details rather than a generic failure.

Holofin handles all major number formats automatically: European comma decimals (1.234,56), US/UK period decimals (1,234.56), space-separated thousands (1 234.56), parenthesized negatives, and D/C indicators. Format detection is per-document, not per-issuer. The system reads the actual format used in the statement and parses accordingly. No configuration or locale settings required.

Yes. Holofin processes all data on European infrastructure. Document retention is configurable per organization. Data is encrypted at rest and in transit. No document content is used for model training. Holofin can execute data deletion requests in compliance with GDPR Article 17 (right to erasure). A Data Processing Agreement (DPA) is available for enterprise customers.

Bank Statement Extraction

Data You Can
Bank On.

Send us the bank statements that broke your last tool. The 47-page multi-account PDFs. The degraded scans. The obscure German Sparkasse format. We'll show you what comes out the other side.

97%+ accuracy
100K+ documents/month
Balance reconciliation on every extraction
Holofin