Bank Statement Parser Security: Data Protection and Supply Chain

TL;DR: Bank Statement Parser processes all data locally, redacts PII by default, hardens XML parsing against XXE attacks, runs LLMs locally via Ollama, and ships with SHA-256 hash-locked dependencies and a CycloneDX SBOM.

Security by Design

Bank Statement Parser is built for processing sensitive financial data. Every design decision prioritises security, privacy, and auditability.

Zero Cloud Dependency

All processing happens locally within your runtime. The deterministic parsers make zero network calls. The hybrid PDF pipeline uses Ollama for local LLM inference — no data is sent to cloud APIs. XML parsers are explicitly configured with no_network=True, resolve_entities=False, and load_dtd=False to prevent any outbound access.

PII Redaction

Personally identifiable information (names, IBANs, postal addresses) is automatically redacted in CLI output and streaming mode. This is on by default.

CLI: Sensitive fields show as ***REDACTED***
Streaming: parse_streaming(redact_pii=True) (default)
Exports: CSV/JSON/Excel retain full data for downstream processing
Opt-in: Use --show-pii or redact_pii=False when you need unredacted output

XML Security (XXE Protection)

All XML parsing uses lxml with hardened settings:

resolve_entities=False -- prevents XML entity expansion attacks
no_network=True -- blocks all outbound network access from the parser
load_dtd=False -- prevents DTD-based attacks
Namespace stripping before processing -- handles any CAMT.053 variant safely

ZIP Archive Security

iter_secure_xml_entries() validates every ZIP member before extraction:

Entry size cap: 10 MB per entry (configurable)
Total size cap: 50 MB total uncompressed (configurable)
Compression ratio limit: 100:1 default -- detects ZIP bombs
Encrypted entry rejection: Encrypted entries are skipped with a warning
No disk writes: XML bytes pass directly to the parser via from_bytes()

Path Traversal Prevention

Input validation blocks dangerous file paths:

Null bytes, directory traversal patterns (../), and symlinks are rejected
File extension validation against expected formats
File size limits (100 MB default, configurable)

Balance Verification (Golden Rule)

Every PDF extraction is verified with the equation: opening balance + credits − debits == closing balance. Results are tagged as VERIFIED, DISCREPANCY, or FAILED. Discrepancies can be reviewed interactively with --type review.

Deterministic Output

For structured formats (CAMT, PAIN.001, CSV, OFX, QFX, MT940), given the same input file, the parser produces byte-identical output every run. No randomness, no model inference, no heuristic sampling. This is critical for:

Audit reproducibility: Run the same file twice and diff the output
Regulatory compliance: Demonstrate consistent processing
CI verification: 718 tests enforce determinism with 100% branch coverage

Supply Chain Security

SHA-256 hash-locked dependencies: Every package in poetry.lock has verified file hashes
CycloneDX SBOM: Every release includes a Software Bill of Materials
GitHub build provenance: Attestation links each artifact to its source commit
Signed commits: All commits are SSH-signed and verified in CI
Dependency verification: scripts/verify_locked_hashes.py validates all hashes locally

Verify Locally

python -m pytest                          # 718 tests, 100% branch coverage
python scripts/verify_locked_hashes.py    # SHA-256 hash verification
git log --show-signature -1               # Verify commit signature