Sirrin Bayanai da Bin Doka
Akwai wani bayanan da ke barin kayan aikina?
A'a — har ma da cirowa daga PDF. Bank Statement Parser yana aiki azaman ɗakin karatu maras jiha. Duk aiki -- tantancewa, share PII, cire kayan tarihi -- yana faruwa a cikin ƙwaƙwalwar ajiyar lokacin aiki na gida. Hybrid PDF pipeline yana amfani da Ollama don sarrafa LLM na gida — babu cloud APIs. XML parsers an taurare da no_network=True, suna toshe duk hanyar fita waje a matakin parser. Bayanan kuɗin ku ba su taɓa barin yanayin ku ba.
Ta yaya share PII yake aiki?
Ana rufe filaye masu mahimmanci kafin su kai ga dabarar aikace-aikacen ku. Mai binciken yana gano sunayen masu bashi, sunayen masu bashi, IBANs, da adiresoshin gidan waya, yana maye gurbinsu da ***REDACTED*** a cikin fitarwar console da yanayin streaming.
- Ana kunna share ta tsohuwa a cikin fitowar CLI da yanayin streaming.
- Fitar da fayil (CSV, JSON, Excel) yana riƙe bayanan da ba a share ba don sarrafa bayan haka.
- Kunna cikakkun bayanai tare da
--show-piia CLI koredact_pii=Falsea cikin API.
Shin tsarin cirewa yana da tabbatarwa?
Ee don tsarin da aka tsara -- fitarwa iri ɗaya na byte akan kowane gudu. Idan aka ba da fayil ɗin shigarwa iri ɗaya, masu fassara na deterministic (CAMT, PAIN.001, CSV, OFX, QFX, MT940) suna samar da sakamako iri ɗaya kowane lokaci. Babu bazuwar, babu ƙididdiga ta ƙira, babu sampling na heuristic.
Don hybrid PDF pipeline, hanyoyin cirowa ta LLM na iya samar da ɗan bambance-bambance tsakanin gudu. Shi ya sa ake tabbatar da kowanne cirowa ta PDF da Golden Rule (opening + credits − debits == closing) kuma ana iya bitar bambance-bambancen da aka tuta ta hanyar hulɗa.
CI tana tilasta tabbatarwa tare da gwaje-gwaje 718 a 100% branch coverage, gami da fuzzing ta hanyar Hypothesis.
Wadanne ka'idoji ne aikin ke bi?
Aikin yana kula da takaddun daidaitattun ISO 13485 tare da cikakken bin diddigin:
- Ƙididdigar Rijistar Haɗari tare da tsananin/yuwuwar ƙima da kimanta ragowar haɗari.
- Shirin Tabbatarwa da Tantancewa tare da matakan gated 19 a cikin matakai 5.
- Tsarin Sarrafa Canji tare da kimanta tasiri da ka'idojin juyawa.
- Rijistar SOUP tana rufe duk abubuwan dogaro tare da matakan haɗari da bin diddigin EOL.
- Traceability Matrix tana taswirar shigarwar ƙira zuwa aiwatarwa da tabbatarwa.
Kowane sakin ya haɗa da CycloneDX SBOM, SHA-256 checksums, da GitHub build provenance attestation.
Aiki da Ƙarfafawa
Yaya saurin Bank Statement Parser?
An tabbatar da matakan aiki a cikin CI akan kowane commit:
| Ma'auni | Daraja |
|---|---|
| CAMT.053 throughput | 27,000+ ma'amaloli/daƙiƙa |
| PAIN.001 throughput | 52,000+ ma'amaloli/daƙiƙa |
| Jinkirin kowane ma'amala (CAMT) | 37 microseconds |
| Jinkirin kowane ma'amala (PAIN.001) | 19 microseconds |
| Lokaci zuwa sakamako na farko | < 2 ms |
Saurin cirowa daga PDF ya dogara da hanyar turawa: deterministic (ƙasa da daƙiƙa ɗaya), text-LLM (daƙiƙu), vision-LLM (daƙiƙu ga kowace shafi).
Yaya ake sarrafa manyan fayiloli?
Streaming tare da ƙayyadaddun ƙwaƙwalwar ajiya -- an gwada shi a ma'amaloli 50,000 a kowane fayil. Yi amfani da parse_streaming() don sarrafa fayilolin XML a hankali. Ana ba da kowace ma'amala azaman ƙamus; ana share abubuwa bayan aiki don hana haɓaka ƙwaƙwalwar ajiya. Ƙwaƙwalwar ajiya ba ta girma da girman fayil ba -- gwajin ma'amala na 50K (25+ MB) yana amfani da ƙasa da 2x ƙwaƙwalwar gwajin ma'amala ta 10K.
Don fayilolin da suka wuce 50 MB (misali, batches PAIN.001 na host-to-host tare da biyan kuɗi 100K+), parser yana gudana ta hanyar fayil na wucin gadi tare da cire namespace ta chunks -- ba a taɓa loda cikakken takaddun cikin ƙwaƙwalwar ajiya ba.
Ta yaya ake sarrafa ajiyayyen ZIP cikin aminci?
iter_secure_xml_entries() yana tabbatar da kowane memba kafin cirewa:
- Iyakar girman shigarwa (tsoho 10 MB kowace shigarwa)
- Jimlar girman mara matsi (tsoho 50 MB)
- Iyakar rabo na matsi (tsoho 100:1) don hana bama-bamai na ZIP
- Ƙin shigar da rufaffen
Babu fayil da aka rubuta zuwa faifai. XML bytes suna wucewa kai tsaye zuwa parser ta from_bytes().
Zan iya fassara fayiloli da yawa a lokaci guda?
Ee. Yi amfani da parse_files_parallel() wanda ke rarraba aiki a fadin ProcessPoolExecutor:
from bankstatementparser import parse_files_parallel
results = parse_files_parallel([
"statements/jan.xml",
"statements/feb.xml",
"statements/mar.xml",
])
for r in results:
print(r.path, r.status, len(r.transactions), "rows")
Don shigar da PDF na tari, yi amfani da scan_and_ingest() wanda ke sarrafa dukkan itacen fayiloli tare da cire kwafi ta atomatik.
Tsarin da Ake Tallafawa
Wadanne tsarin bayanan banki ne ake tallafawa?
| Tsarin | Ma'auni | Nau'in Fayil | Parser/Hanya |
|---|---|---|---|
| CAMT.053 | ISO 20022 Bank-to-Customer Statement | .xml |
CamtParser |
| PAIN.001 | ISO 20022 Credit Transfer Initiation | .xml |
Pain001Parser |
| CSV | Fitar da banki na gaba ɗaya | .csv |
CsvStatementParser |
| OFX | Open Financial Exchange | .ofx |
OfxParser |
| QFX | Quicken Financial Exchange | .qfx |
QfxParser |
| MT940 | SWIFT standard | .mt940, .sta |
Mt940Parser |
| Bayanan dijital da na bugu | .pdf |
smart_ingest() |
Ta yaya hybrid PDF pipeline yake aiki?
Hybrid pipeline (v0.0.5+) yana tura PDFs ta hanyoyi uku na cirowa cikin hikima:
- Hanya A (Deterministic): Teburorin PDF masu tsari ana fassara su kai tsaye — kyauta, mafi sauri, babu buƙatar LLM.
- Hanya B (Text-LLM): PDF na dijital masu tsari masu rikitarwa ana ciro su ta LLM na gida (LiteLLM/Ollama).
- Hanya C (Vision-LLM): Bayanan da aka bincika ko aka kwafa ana sarrafa su da ƙirar vision.
Ana tabbatar da kowane cirowa da Golden Rule (opening + credits − debits == closing). Ana iya bitar bambance-bambance ta hanyar hulɗa da --type review.
Shin parser yana sarrafa takamaiman yarukan banki na CAMT.053?
Ee -- namespace-agnostic ta ƙira. Parser yana cire namespaces na XML kafin aiki, yana sarrafa kowane bambance-bambancen CAMT.053 (camt.053.001.02, camt.053.001.04, ko naɗaɗɗen banki na mallaka) ba tare da ƙayyadaddun namespace ba. Tambayoyin XPath suna nufin tsarin kashi, ba namespace URIs ba.
Don bankunan da ke naɗe CAMT a cikin ambulaf na al'ada, yi amfani da from_string() ko from_bytes() don ciyar da takaddun ciki kai tsaye.
Zan iya taswirar kanun shafi na CSV na al'ada zuwa daidaitaccen tsari?
Ee -- daidaitawa ta atomatik, saitin sifili. CsvStatementParser ya gane bambance-bambancen taken gama gari: "Date", "Transaction Date", "Booking Date" duk suna taswirar zuwa filin date. "Amount", "Value", "Sum" suna taswirar zuwa amount. Rabaɓɓun ginshiƙan credit/debit (misali, "Credit" da "Debit") ana gano su kuma ana haɗa su cikin adadin sa hannu ɗaya ta atomatik.
Menene tsarin fitarwa?
Duk masu fassara suna samar da daidaitattun pandas DataFrames tare da daidaitattun nau'ikan ginshiƙi:
| Tsarin | Manyan Ginshiƙai |
|---|---|
| CAMT | Amount, Currency, DrCr, Debtor, Creditor, Reference, ValDt, BookgDt, AccountId |
| PAIN.001 | PmtInfId, PmtMtd, InstdAmt, Currency, CdtrNm, EndToEndId, MsgId, CreDtTm, NbOfTxs |
| CSV/OFX/QFX/MT940 | date, description, amount (daidaitacce) |
Hakanan za ku iya fitarwa zuwa CSV, JSON, Excel, Polars DataFrames, hledger, ko tsarin beancount journal.
PDF da Fasalullukan LLM
Wadanne ƙirar LLM ne hybrid pipeline ke tallafawa?
Pipeline yana amfani da LiteLLM azaman shimfiɗar ƙira, tare da gadar Ollama kai tsaye don buƙatun vision. Ƙirar da ake ba da shawara:
- Ciro rubutu: Kowace ƙirar da ta dace da LiteLLM (na gida ko na nesa).
- Ciro hoto:
ollama/minicpm-v(da ake ba da shawara) don fayilolin PDF da aka bincika. - Rarraba: Kowace ƙirar da ta dace da LiteLLM.
Duk ƙirar za su iya gudana 100% a gida ta Ollama — babu buƙatar API keys.
Menene tabbatar da Golden Rule?
Ana tabbatar da kowane cirowa daga PDF tare da lissafin: opening balance + credits − debits == closing balance. Ana yiwa sakamakon alama kamar:
- VERIFIED: Balance sun dace daidai.
- DISCREPANCY: Balance ba su dace ba — ana ba da shawarar bita.
- FAILED: Ba a iya yin tabbatarwa ba (bayanin balance ba su cika ba).
Zan iya rarraba ma'amaloli ta atomatik?
Ee. Sashen enrichment (v0.0.6+) yana ba da rarraba ma'amaloli ta LLM:
from bankstatementparser.enrichment import Categorizer
categorizer = Categorizer()
enriched = categorizer.categorize_batch(transactions)
Tsarin tsohuwa yana amfani da rukunoni 13 masu dacewa da Plaid. Za ku iya samar da naku tsarin rukunoni.
Zan iya fitarwa zuwa hledger ko beancount?
Ee (v0.0.8+). Fitar da ma'amaloli zuwa tsarin plaintext-accounting journal tare da taswirar asusun:
from bankstatementparser.export import to_hledger, to_beancount
journal = to_hledger(transactions, account="Assets:Bank:Checking")
Ayyukan Baitulmali
Ta yaya parser ke sarrafa bayanan kuɗi da yawa?
Kowace ma'amala tana adana ainihin kuɗinta -- babu jujjuyawar da ake sa. Ana ciro filin Currency daga sifar XML Ccy ta kowane ciniki. Bayanan kuɗi da yawa sun kasance kamar yadda suke. Hanyar get_account_balances() tana dawo da balance na buɗewa da rufewa ga kowane asusu tare da lambobin kuɗi na asali.
Tun v0.0.8, verify_balance_multi_currency() yana rarraba ma'amaloli ta kuɗi kuma yana gudanar da Golden Rule daban-daban ga kowane rukuni — mai amfani ga asusun da ke riƙe kuɗi da yawa.
Shin parser yana tallafawa tsarin masu fita da masu shigowa?
Ee. Pain001Parser yana sarrafa ISO 20022 PAIN.001 fayilolin farawa canja wurin kuɗi (biyan kuɗi masu fita). CamtParser yana sarrafa CAMT.053 fayilolin bayanin banki-zuwa-abokin ciniki (rahoton mai shigowa). Dukansu suna tallafawa streaming, share PII, da fitarwa zuwa CSV, JSON, Excel, hledger, da beancount. Yi amfani da detect_statement_format() don gano tsarin ta atomatik.
Menene zai faru idan shigar da ma'amala ta lalace?
Halin ya dogara da yanayin fassara:
parse()(yanayin tsari) -- Shigarwar da ba su da kyau da suka rasa filayen da ake buƙata (Amount,Currency, koCdtDbtInd) an tsallake su da gargaɗi. Sauran bayanin yana fassara bisa al'ada.parse_streaming()(yanayin streaming) -- Kurakurai suna yaɗa kai tsaye azaman keɓantacce. Babu asarar bayanan shiru. Wannan hali na fail-fast da niyya ne don ayyukan kuɗi inda kowace ma'amala dole ne a lissafta ta.smart_ingest()(hybrid PDF) -- Ana riƙe kurakuran cirowa a cikinIngestResulttare da matsayin tabbatarwa, yana ba da damar bita ta hanyar hulɗa.
Ta yaya cire kwafi yake aiki?
Ana ba kowace ma'amala idempotent transaction_hash (MD5 fingerprint) bisa manyan filayen ta. Wannan yana ba da damar amintaccen shigar da bayani a hankali — sake sarrafa fayil ɗaya yana samar da hashes iri ɗaya, don haka ana gano kwafi ta atomatik.
from bankstatementparser import CamtParser, Deduplicator
parser = CamtParser("statement.xml")
dedup = Deduplicator()
result = dedup.deduplicate(dedup.from_dataframe(parser.parse()))
print(f"Unique: {len(result.unique_transactions)}")
print(f"Exact duplicates: {len(result.exact_duplicates)}")
print(f"Suspected matches: {len(result.suspected_matches)}")
Shigarwa da Daidaitawa
Ta yaya zan girka Bank Statement Parser?
# Core install (deterministic parsers only)
pip install bankstatementparser
# PDF hybrid pipeline
pip install 'bankstatementparser[hybrid]' # Text-LLM path
pip install 'bankstatementparser[hybrid-vision]' # Vision-LLM path
# Extras
pip install 'bankstatementparser[enrichment]' # Transaction categorisation
pip install 'bankstatementparser[api]' # REST API microservice
pip install 'bankstatementparser[polars]' # Polars DataFrame support
Wadanne nau'ikan Python ne ake tallafawa?
Python 3.10 zuwa 3.14. An daina tallafin Python 3.9 a v0.0.6 (EOL 2025-10-31). Ana gwada duk nau'ikan a cikin CI tare da gwaje-gwaje 718 a 100% branch coverage.
Menene abubuwan dogaro?
Laburaren yana da abubuwan dogaro kai tsaye guda 5:
lxml-- Fassarar XML tare da hardening tsaropandas-- DataFrames da sarrafa bayanaiopenpyxl-- Fitarwar Excelpydantic-- Tabbatar da bayanai da ƙiradefusedxml-- Kariyar XXE
Ƙarin zaɓuɓɓuka suna ƙara: litellm, pypdf, pdfplumber, pypdfium2, fastapi, uvicorn, polars.
Duk abubuwan dogaro suna da nau'ikan SHA-256 hash-locked. CycloneDX SBOM yana taswirar kowane ɓangaren lokacin gudu.
Yana aiki akan macOS, Linux, da Windows?
Ee. Laburaren yana aiki akan macOS, Linux, da Windows (ta hanyar WSL). Ba shi da abubuwan dogaro na musamman ga dandamali.
Akwai REST API?
Ee (v0.0.8+). Girka tare da pip install 'bankstatementparser[api]' kuma gudanar da:
bankstatementparser-api --port 8000
Endpoints: POST /ingest (fassara bayanin) da GET /health (duba lafiya).
Sake Haifuwa da Tsaro
Ta yaya zan iya tabbatar da sake haifuwa?
python -m pytest # 718 tests, 100% branch coverage
python scripts/verify_locked_hashes.py # SHA-256 hash verification
git log --show-signature -1 # Verify commit signature
Wadanne kariyar tsaro aka gina a ciki?
- Kariyar XXE:
resolve_entities=False,no_network=True,load_dtd=False - Kariyar Bam na ZIP: Iyakokin rabo na matsi, iyakokin girman shigarwa, ƙin shigar da rufaffen
- Rigakafin Ƙetare Tafarki: Toshe ƙirar haɗari da warware symlink
- Tabbatar da Shigarwa: Iyakar girman fayil (tsoho 100 MB), tabbatar da haɓakawa/tsari
- Sarkar Kaya: SHA-256 hash-locked dependencies, CycloneDX SBOM, build provenance attestation
- Sa hannu kan Commits: An tilasta shi a cikin CI
- LLMs Na Gida: Hybrid PDF pipeline yana amfani da Ollama — babu kiran cloud API
Ta yaya Bank Statement Parser ya kwatanta da pyiso20022?
pyiso20022 babban kayan aiki ne na ISO 20022 wanda ke samar da Python dataclasses daga tsarin ISO XML. Ya ƙunshi nau'ikan saƙon ISO 20022 da yawa (PACS, PAIN, CAMT, ADMI) tare da tabbatar da tsari. Bank Statement Parser an gina shi musamman don fassarar bayanan banki tare da tallafin hybrid PDF, tabbatar da balance, ƙarin bayani, fitar da ledger, da API ɗin haɗin kai a cikin tsari bakwai ciki har da tsarin da ba na ISO ba (CSV, OFX, QFX, MT940, PDF). Idan kuna buƙatar fassara bayanan banki zuwa DataFrames tare da matakan tsaro na samarwa, yi amfani da Bank Statement Parser. Idan kuna buƙatar aiki tare da cikakken kundin saƙon ISO 20022, yi amfani da pyiso20022.
Menene ƙarshen ƙaurar SWIFT ISO 20022?
SWIFT ta wallafa jadawalin ƙaura mai mataki:
- Nuwamba 2026: Tsare-tsare da adiresoshin gauraye sun zama tilas. Za a ƙi saƙonnin MT101 masu umarni da yawa. Matakin Gudanar da Ƙararraki na 1 ya fara.
- Nuwamba 2027: Duk cibiyoyin kuɗi dole ne su sami damar karɓar bayanan CAMT.053 na asali. SWIFT zai daina canza MT zuwa tsarin ISO.
- Nuwamba 2028: Cikakken ritaya na MT940, MT942, MT950, MT900, da MT910. Za a maye gurbin waɗannan da CAMT.052, CAMT.053, da CAMT.054.
Bank Statement Parser yana tallafawa tsarin MT940 na gado da kuma tsarin CAMT.053/PAIN.001 na zamani, yana mai da shi mafi dacewa don lokacin sauyi.