Does Bank Statement Parser send my data anywhere?

No. All processing happens locally within your Python runtime. The library makes zero API calls, zero cloud connections, and collects zero telemetry.

What formats does Bank Statement Parser support?

Six formats: CAMT.053, PAIN.001, CSV, OFX, QFX, and MT940. All produce normalised pandas DataFrames with consistent column names.

Is PII redacted by default?

Yes. Names, IBANs, and postal addresses are automatically masked in CLI output and streaming mode. Use --show-pii to opt in to unredacted output.

Does it support the SWIFT ISO 20022 migration?

Yes. Bank Statement Parser handles both legacy MT940 and modern CAMT.053 with a single API. Auto-detection identifies the format automatically.

Bank Statement Parser FAQ: Privacy, Performance, and Usage

Q: How fast is Bank Statement Parser?

CAMT.053 parsing exceeds 27,000 transactions per second. PAIN.001 exceeds 52,000 transactions per second. Time to first result is under 2 milliseconds.

Data Privacy and Compliance

Does any data leave my infrastructure?

No — not even for PDF extraction. Bank Statement Parser operates as a stateless library. All processing -- parsing, PII redaction, archive extraction -- occurs within your local runtime memory. The hybrid PDF pipeline uses Ollama for local LLM inference — no cloud APIs. XML parsers are hardened with no_network=True, blocking all outbound access at the parser level. Your financial data never leaves your environment.

How does PII redaction work?

Sensitive fields are masked before they reach your application logic. The parser identifies debtor names, creditor names, IBANs, and postal addresses, replacing them with ***REDACTED*** in console output and streaming mode.

Redaction is on by default in CLI output and streaming mode.
File exports (CSV, JSON, Excel) retain unredacted data for downstream processing.
Opt in to full data with --show-pii on the CLI or redact_pii=False in the API.

Is the extraction process deterministic?

Yes for structured formats -- byte-identical output on every run. Given the same input file, the deterministic parsers (CAMT, PAIN.001, CSV, OFX, QFX, MT940) produce the same result every time. No randomness, no model inference, no heuristic sampling.

For the hybrid PDF pipeline, LLM-based extraction paths may produce minor variations between runs. This is why every PDF extraction is verified with the Golden Rule (opening + credits − debits == closing) and flagged discrepancies can be reviewed interactively.

CI enforces determinism with 718 tests at 100% branch coverage, including property-based fuzzing via Hypothesis.

What compliance standards does the project follow?

The project maintains ISO 13485-aligned documentation with full traceability:

A quantified Risk Register with severity/probability scoring and residual risk assessment.
A Verification and Validation Plan with 19 gated steps across 5 phases.
A Change Control Procedure with impact assessment and rollback protocols.
A SOUP Register covering all dependencies with risk levels and EOL tracking.
A Traceability Matrix mapping design inputs to implementation and verification.

Every release includes a CycloneDX SBOM, SHA-256 checksums, and GitHub build provenance attestation.

Performance and Scalability

How fast is Bank Statement Parser?

Performance thresholds are validated in CI on every commit:

Metric	Value
CAMT.053 throughput	27,000+ transactions/second
PAIN.001 throughput	52,000+ transactions/second
Per-transaction latency (CAMT)	37 microseconds
Per-transaction latency (PAIN.001)	19 microseconds
Time to first result	< 2 ms

PDF extraction speed depends on the routing path: deterministic (sub-second), text-LLM (seconds), vision-LLM (seconds per page).

How are large files handled?

Streaming with bounded memory -- tested at 50,000 transactions per file. Use parse_streaming() to process XML files incrementally. Each transaction is yielded as a dictionary; elements are cleared after processing to prevent memory growth. Memory does not scale with file size -- the 50K-transaction test (25+ MB) uses less than 2x the memory of the 10K-transaction test.

For files exceeding 50 MB (e.g., host-to-host PAIN.001 batches with 100K+ payments), the parser streams through a temporary file with chunk-based namespace stripping -- the full document is never loaded into memory.

How are ZIP archives processed securely?

iter_secure_xml_entries() validates each member before extraction:

Entry size cap (default 10 MB per entry)
Total uncompressed size cap (default 50 MB)
Compression ratio limit (default 100:1) to prevent ZIP bombs
Encrypted entry rejection

No file is written to disk. XML bytes pass directly to the parser via from_bytes().

Can I parse multiple files in parallel?

Yes. Use parse_files_parallel() which distributes work across a ProcessPoolExecutor:

from bankstatementparser import parse_files_parallel

results = parse_files_parallel([
    "statements/jan.xml",
    "statements/feb.xml",
    "statements/mar.xml",
])
for r in results:
    print(r.path, r.status, len(r.transactions), "rows")

For bulk PDF ingestion, use scan_and_ingest() which processes entire folder trees with automatic deduplication.

Supported Formats

Which bank statement formats are supported?

Format	Standard	File Types	Parser/Method
CAMT.053	ISO 20022 Bank-to-Customer Statement	`.xml`	`CamtParser`
PAIN.001	ISO 20022 Credit Transfer Initiation	`.xml`	`Pain001Parser`
CSV	Generic bank exports	`.csv`	`CsvStatementParser`
OFX	Open Financial Exchange	`.ofx`	`OfxParser`
QFX	Quicken Financial Exchange	`.qfx`	`QfxParser`
MT940	SWIFT standard	`.mt940`, `.sta`	`Mt940Parser`
PDF	Digital and scanned statements	`.pdf`	`smart_ingest()`

How does the hybrid PDF pipeline work?

The hybrid pipeline (v0.0.5+) intelligently routes PDFs through three extraction paths:

Path A (Deterministic): Structured PDF tables parsed directly — free, fastest, no LLM needed.
Path B (Text-LLM): Digital PDFs with complex layouts extracted via local LLM (LiteLLM/Ollama).
Path C (Vision-LLM): Scanned or photocopied statements processed with multimodal vision models.

Every extraction is verified with the Golden Rule (opening + credits − debits == closing). Discrepancies can be reviewed interactively with --type review.

Does the parser handle bank-specific dialects of CAMT.053?

Yes -- namespace-agnostic by design. The parser strips XML namespaces before processing, handling any CAMT.053 variant (camt.053.001.02, camt.053.001.04, or proprietary bank wrappers) without namespace-specific configuration. XPath queries target element structure, not namespace URIs.

For banks that wrap CAMT in a custom envelope, use from_string() or from_bytes() to feed the inner document directly.

Can I map custom CSV column headers to the standard schema?

Yes -- automatic normalisation, zero configuration. CsvStatementParser recognises common header variations: "Date", "Transaction Date", "Booking Date" all map to the date field. "Amount", "Value", "Sum" map to amount. Split credit/debit columns (e.g., "Credit" and "Debit") are detected and combined into a single signed amount automatically.

What is the output format?

All parsers produce standardised pandas DataFrames with consistent column types:

Format	Key Columns
CAMT	`Amount`, `Currency`, `DrCr`, `Debtor`, `Creditor`, `Reference`, `ValDt`, `BookgDt`, `AccountId`
PAIN.001	`PmtInfId`, `PmtMtd`, `InstdAmt`, `Currency`, `CdtrNm`, `EndToEndId`, `MsgId`, `CreDtTm`, `NbOfTxs`
CSV/OFX/QFX/MT940	`date`, `description`, `amount` (normalised)

You can also export to CSV, JSON, Excel, Polars DataFrames, hledger, or beancount journal format.

PDF and LLM Features

What LLM models does the hybrid pipeline support?

The pipeline uses LiteLLM as the model abstraction layer, with a direct Ollama bridge for vision prompts. Recommended models:

Text extraction: Any LiteLLM-compatible model (local or remote).
Vision extraction: ollama/minicpm-v (recommended) for scanned PDFs.
Categorisation: Any LiteLLM-compatible model.

All models can run 100% locally via Ollama — no API keys required.

What is the Golden Rule verification?

Every PDF extraction is verified with the equation: opening balance + credits − debits == closing balance. Results are tagged as:

VERIFIED: Balances match exactly.
DISCREPANCY: Balances don't match — review recommended.
FAILED: Verification could not be performed (missing balance data).

Can I categorise transactions automatically?

Yes. The enrichment module (v0.0.6+) provides LLM-powered transaction categorisation:

from bankstatementparser.enrichment import Categorizer

categorizer = Categorizer()
enriched = categorizer.categorize_batch(transactions)

The default schema uses 13 Plaid-compatible categories. You can provide your own category schema.

Can I export to hledger or beancount?

Yes (v0.0.8+). Export transactions to plaintext-accounting journal formats with account mapping:

from bankstatementparser.export import to_hledger, to_beancount

journal = to_hledger(transactions, account="Assets:Bank:Checking")

Treasury Workflows

How does the parser handle multi-currency statements?

Each transaction preserves its original currency -- no implicit conversion. The Currency field is extracted from the XML Ccy attribute per transaction. Multi-currency statements remain as-is. The get_account_balances() method returns opening and closing balances per account with original currency codes.

Since v0.0.8, verify_balance_multi_currency() groups transactions by currency and runs the Golden Rule independently per group — useful for accounts that hold multiple currencies.

Does the parser support both outgoing and incoming formats?

Yes. Pain001Parser handles ISO 20022 PAIN.001 credit transfer initiation files (outgoing payments). CamtParser handles CAMT.053 bank-to-customer statement files (incoming reporting). Both support streaming, PII redaction, and export to CSV, JSON, Excel, hledger, and beancount. Use detect_statement_format() to identify the format automatically.

What happens when a transaction entry is malformed?

Behaviour depends on the parsing mode:

parse() (batch mode) -- Malformed entries missing required fields (Amount, Currency, or CdtDbtInd) are skipped with a warning log. The rest of the statement parses normally.
parse_streaming() (streaming mode) -- Parse errors propagate immediately as exceptions. No silent data loss. This fail-fast behaviour is intentional for financial workflows where every transaction must be accounted for.
smart_ingest() (hybrid PDF) -- Extraction errors are captured in the IngestResult with verification status, allowing interactive review.

How does deduplication work?

Each transaction is assigned an idempotent transaction_hash (MD5 fingerprint) based on its key fields. This enables safe incremental ingestion — re-processing the same file produces the same hashes, so duplicates are detected automatically.

from bankstatementparser import CamtParser, Deduplicator

parser = CamtParser("statement.xml")
dedup = Deduplicator()
result = dedup.deduplicate(dedup.from_dataframe(parser.parse()))

print(f"Unique: {len(result.unique_transactions)}")
print(f"Exact duplicates: {len(result.exact_duplicates)}")
print(f"Suspected matches: {len(result.suspected_matches)}")

Installation and Compatibility

How do I install Bank Statement Parser?

# Core install (deterministic parsers only)
pip install bankstatementparser

# PDF hybrid pipeline
pip install 'bankstatementparser[hybrid]'         # Text-LLM path
pip install 'bankstatementparser[hybrid-vision]'   # Vision-LLM path

# Extras
pip install 'bankstatementparser[enrichment]'      # Transaction categorisation
pip install 'bankstatementparser[api]'             # REST API microservice
pip install 'bankstatementparser[polars]'          # Polars DataFrame support

Which Python versions are supported?

Python 3.10 through 3.14. Python 3.9 support was dropped in v0.0.6 (EOL 2025-10-31). All versions are tested in CI with 718 tests at 100% branch coverage.

What are the dependencies?

The core library has 5 direct dependencies:

lxml -- XML parsing with security hardening
pandas -- DataFrames and data manipulation
openpyxl -- Excel export
pydantic -- Data validation and models
defusedxml -- XXE protection

Optional extras add: litellm, pypdf, pdfplumber, pypdfium2, fastapi, uvicorn, polars.

All dependencies have SHA-256 hash-locked versions. The CycloneDX SBOM maps every runtime component.

Does it work on macOS, Linux, and Windows?

Yes. The library works on macOS, Linux, and Windows (via WSL). It has no platform-specific dependencies.

Is there a REST API?

Yes (v0.0.8+). Install with pip install 'bankstatementparser[api]' and run:

bankstatementparser-api --port 8000

Endpoints: POST /ingest (parse a statement) and GET /health (health check).

Reproducibility and Security

How can I verify reproducibility?

python -m pytest                              # 718 tests, 100% branch coverage
python scripts/verify_locked_hashes.py        # SHA-256 hash verification
git log --show-signature -1                   # Verify commit signature

What security protections are built in?

XXE Protection: resolve_entities=False, no_network=True, load_dtd=False
ZIP Bomb Protection: Compression ratio limits, entry size caps, encrypted entry rejection
Path Traversal Prevention: Dangerous pattern blocklist and symlink resolution
Input Validation: File size limits (100 MB default), extension/format validation
Supply Chain: SHA-256 hash-locked dependencies, CycloneDX SBOM, build provenance attestation
Signed Commits: Enforced in CI
Local LLMs: Hybrid PDF pipeline uses Ollama — no cloud API calls

How does Bank Statement Parser compare to pyiso20022?

pyiso20022 is a broad ISO 20022 toolkit that generates Python dataclasses from ISO XML schemas. It covers a wide range of ISO 20022 message types (PACS, PAIN, CAMT, ADMI) with schema validation. Bank Statement Parser is purpose-built for bank statement parsing with hybrid PDF support, balance verification, enrichment, ledger export, and a unified API across seven formats including non-ISO formats (CSV, OFX, QFX, MT940, PDF). If you need to parse bank statements into DataFrames with production-grade security, use Bank Statement Parser. If you need to work with the full ISO 20022 message catalogue, use pyiso20022.

What are the SWIFT ISO 20022 migration deadlines?

SWIFT has published a phased migration timeline:

November 2026: Structured and hybrid addresses become mandatory. MT101 multi-instruction messages will be rejected. Case Management Phase 1 begins.
November 2027: All financial institutions must be able to receive CAMT.053 statements natively. SWIFT will stop converting MT to ISO format.
November 2028: Full retirement of MT940, MT942, MT950, MT900, and MT910. These will be replaced by CAMT.052, CAMT.053, and CAMT.054 equivalents.

Bank Statement Parser supports both the legacy MT940 format and the modern CAMT.053/PAIN.001 formats, making it ideal for the transition period.