TL;DR: Bank Statement Parser processes all data locally, redacts PII by default, hardens XML parsing against XXE attacks, runs LLMs locally via Ollama, and ships with SHA-256 hash-locked dependencies and a CycloneDX SBOM.
Security by Design
Bank Statement Parser is built for processing sensitive financial data. Every design decision prioritises security, privacy, and auditability.
Zero Cloud Dependency
All processing happens locally within your runtime. The deterministic parsers make zero network calls. The hybrid PDF pipeline uses Ollama for local LLM inference — no data is sent to cloud APIs. XML parsers are explicitly configured with no_network=True, resolve_entities=False, and load_dtd=False to prevent any outbound access.
PII Redaction
Personally identifiable information (names, IBANs, postal addresses) is automatically redacted in CLI output and streaming mode. This is on by default.
- CLI: Sensitive fields show as
***REDACTED*** - Streaming:
parse_streaming(redact_pii=True)(default) - Exports: CSV/JSON/Excel retain full data for downstream processing
- Opt-in: Use
--show-piiorredact_pii=Falsewhen you need unredacted output
XML Security (XXE Protection)
All XML parsing uses lxml with hardened settings:
resolve_entities=False-- prevents XML entity expansion attacksno_network=True-- blocks all outbound network access from the parserload_dtd=False-- prevents DTD-based attacks- Namespace stripping before processing -- handles any CAMT.053 variant safely
ZIP Archive Security
iter_secure_xml_entries() validates every ZIP member before extraction:
- Entry size cap: 10 MB per entry (configurable)
- Total size cap: 50 MB total uncompressed (configurable)
- Compression ratio limit: 100:1 default -- detects ZIP bombs
- Encrypted entry rejection: Encrypted entries are skipped with a warning
- No disk writes: XML bytes pass directly to the parser via
from_bytes()
Path Traversal Prevention
Input validation blocks dangerous file paths:
- Null bytes, directory traversal patterns (
../), and symlinks are rejected - File extension validation against expected formats
- File size limits (100 MB default, configurable)
Balance Verification (Golden Rule)
Every PDF extraction is verified with the equation: opening balance + credits − debits == closing balance. Results are tagged as VERIFIED, DISCREPANCY, or FAILED. Discrepancies can be reviewed interactively with --type review.
Deterministic Output
For structured formats (CAMT, PAIN.001, CSV, OFX, QFX, MT940), given the same input file, the parser produces byte-identical output every run. No randomness, no model inference, no heuristic sampling. This is critical for:
- Audit reproducibility: Run the same file twice and diff the output
- Regulatory compliance: Demonstrate consistent processing
- CI verification: 718 tests enforce determinism with 100% branch coverage
Supply Chain Security
- SHA-256 hash-locked dependencies: Every package in
poetry.lockhas verified file hashes - CycloneDX SBOM: Every release includes a Software Bill of Materials
- GitHub build provenance: Attestation links each artifact to its source commit
- Signed commits: All commits are SSH-signed and verified in CI
- Dependency verification:
scripts/verify_locked_hashes.pyvalidates all hashes locally
Verify Locally
python -m pytest # 718 tests, 100% branch coverage
python scripts/verify_locked_hashes.py # SHA-256 hash verification
git log --show-signature -1 # Verify commit signature