TL;DR: Bank Statement Parser makes zero network calls, redacts PII by default, hardens XML parsing against XXE attacks, and ships with SHA-256 hash-locked dependencies and a CycloneDX SBOM.
Security by Design
Bank Statement Parser is built for processing sensitive financial data. Every design decision prioritises security, privacy, and auditability.
Zero Network Access
All processing happens locally within your runtime. The library makes zero API calls, zero cloud connections, and collects zero telemetry. XML parsers are explicitly configured with no_network=True, resolve_entities=False, and load_dtd=False to prevent any outbound access.
PII Redaction
Personally identifiable information (names, IBANs, postal addresses) is automatically redacted in CLI output and streaming mode. This is on by default.
- CLI: Sensitive fields show as
***REDACTED*** - Streaming:
parse_streaming(redact_pii=True)(default) - Exports: CSV/JSON/Excel retain full data for downstream processing
- Opt-in: Use
--show-piiorredact_pii=Falsewhen you need unredacted output
XML Security (XXE Protection)
All XML parsing uses lxml with hardened settings:
resolve_entities=False-- prevents XML entity expansion attacksno_network=True-- blocks all outbound network access from the parserload_dtd=False-- prevents DTD-based attacks- Namespace stripping before processing -- handles any CAMT.053 variant safely
ZIP Archive Security
iter_secure_xml_entries() validates every ZIP member before extraction:
- Entry size cap: 10 MB per entry (configurable)
- Total size cap: 50 MB total uncompressed (configurable)
- Compression ratio limit: 100:1 default -- detects ZIP bombs
- Encrypted entry rejection: Encrypted entries are skipped with a warning
- No disk writes: XML bytes pass directly to the parser via
from_bytes()
Path Traversal Prevention
Input validation blocks dangerous file paths:
- Null bytes, directory traversal patterns (
../), and symlinks are rejected - File extension validation against expected formats
- File size limits (100 MB default, configurable)
Deterministic Output
Given the same input file, the parser produces byte-identical output every run. No randomness, no model inference, no heuristic sampling. This is critical for:
- Audit reproducibility: Run the same file twice and diff the output
- Regulatory compliance: Demonstrate consistent processing
- CI verification: 467 tests enforce determinism with 100% branch coverage
Supply Chain Security
- SHA-256 hash-locked dependencies: Every package in
poetry.lockhas verified file hashes - CycloneDX SBOM: Every release includes a Software Bill of Materials
- GitHub build provenance: Attestation links each artifact to its source commit
- Signed commits: All commits are SSH-signed and verified in CI
- Dependency verification:
scripts/verify_locked_hashes.pyvalidates all hashes locally
Verify Locally
python -m pytest # 467 tests, 100% branch coverage
python scripts/verify_locked_hashes.py # SHA-256 hash verification
git log --show-signature -1 # Verify commit signature