TL;DR: Bank Statement Parser is an open-source Python library that parses seven bank statement formats (CAMT.053, PAIN.001, CSV, OFX, QFX, MT940, and PDF) into pandas DataFrames. Hybrid PDF pipeline with balance verification, REST API, enrichment, ledger export, 27K+ tx/s throughput.
Bank Statement Parser is an open-source Python library that parses bank statements from seven formats into structured pandas DataFrames. The deterministic core processes structured formats locally with zero network calls. The optional hybrid PDF pipeline routes through local LLMs (via Ollama) for digital and scanned statements.
Who Is This For?
- Treasury teams migrating from MT940 to CAMT.053 who need a parser that handles both old and new formats during the transition, plus PDF statements from banks that don't offer structured exports.
- Fintech developers building reconciliation, reporting, or accounting pipelines who want a single dependency with built-in balance verification, categorisation, and ledger export.
- Compliance teams who need PII redaction by default, deterministic output, and Golden Rule verification that flags discrepancies before they reach the ledger.
- Plaintext-accounting users who want automated ingestion from PDF bank statements directly into hledger or beancount journals.
- Anyone who refuses to send sensitive financial data to a third-party SaaS when a local, open-source tool can do the job.
Supported Formats
| Format | Standard | File Types | Parser/Method |
|---|---|---|---|
| CAMT.053 | ISO 20022 Bank-to-Customer Statement | .xml |
CamtParser |
| PAIN.001 | ISO 20022 Credit Transfer Initiation | .xml |
Pain001Parser |
| CSV | Generic bank exports | .csv |
CsvStatementParser |
| OFX | Open Financial Exchange | .ofx |
OfxParser |
| QFX | Quicken Financial Exchange | .qfx |
QfxParser |
| MT940 | SWIFT standard | .mt940, .sta |
Mt940Parser |
| Digital and scanned statements | .pdf |
smart_ingest() |
All formats produce normalised pandas DataFrames with consistent column names, making downstream processing format-agnostic.
Key Capabilities
- Hybrid PDF Pipeline:
smart_ingest()routes PDFs through three paths — deterministic table extraction, text-LLM, or vision-LLM — with automatic Golden Rule balance verification. - Format Auto-Detection:
detect_statement_format()identifies the format;create_parser()instantiates the right parser. - Balance Verification: Golden Rule check (
opening + credits − debits == closing) with VERIFIED/DISCREPANCY/FAILED status. - Multi-Currency Verification:
verify_balance_multi_currency()groups transactions by currency for independent verification. - REST API: FastAPI microservice with
/ingestand/healthendpoints for production deployments. - Enrichment: LLM-powered transaction categorisation with pluggable schemas (Plaid 13-category default).
- Interactive Review: Walk through discrepancies with accept/edit/skip/delete actions via
--type review. - Ledger Export:
to_hledger()andto_beancount()for plaintext-accounting workflows. - Bulk Scanning:
scan_and_ingest()processes folder trees with automatic cross-file deduplication. - Account Mapping: Regex-based account mapping rules from JSON config for ledger export.
- Streaming Parsing: Process large files (50 MB+, 50K+ transactions) with bounded memory using
parse_streaming(). - Parallel Processing: Parse multiple files concurrently with
parse_files_parallel()using ProcessPoolExecutor. - Deduplication: Idempotent
transaction_hash(MD5 fingerprint) for safe incremental ingestion. - In-Memory Parsing:
from_string()andfrom_bytes()for SFTP and API workflows with no disk I/O. - Secure ZIP Processing:
iter_secure_xml_entries()with compression ratio limits, entry size caps, and encrypted entry rejection. - Export: CSV, JSON, Excel (
.xlsx), Polars DataFrames, hledger, and beancount journals.
Security And Privacy
- PII Redaction: Names, IBANs, and addresses are masked by default in CLI output. Opt in with
--show-pii. - XXE Protection: XML parsing uses
resolve_entities=False,no_network=True,load_dtd=False. - ZIP Bomb Protection: Compression ratio limits (100:1 default), entry size caps (10 MB), encrypted entry rejection.
- Path Traversal Prevention: Dangerous pattern blocklist and symlink resolution.
- Supply Chain Security: SHA-256 hash-locked dependencies, CycloneDX SBOM, build provenance attestation.
- Local LLMs Only: The hybrid PDF pipeline uses Ollama for local inference — no data sent to cloud APIs.
Performance
| Metric | Value |
|---|---|
| CAMT.053 throughput | 27,000+ tx/s |
| PAIN.001 throughput | 52,000+ tx/s |
| Per-transaction latency (CAMT) | 37 microseconds |
| Per-transaction latency (PAIN.001) | 19 microseconds |
| Time to first result | < 2 ms |
| Memory scaling (1K-50K tx) | Constant (streaming) |
| Test coverage | 100% branch coverage |
| Tests | 718 across 29 test files |
Start Building
Get started with installation and examples ❯
"GitHub Repository"