Repository: github.com/bertorico/pdf2conta
A self-hosted document processing pipeline for Italian accounting workflows. Converts bank statements (estratti conto) and paper invoices into structured CSV files ready for import into Ago Zucchetti.
Built around dots.ocr, a specialized OCR model for structured documents, served via vLLM.
Inserire manualmente le righe dell'estratto conto in Ago Zucchetti richiede decine di minuti per documento. Questo strumento converte il PDF direttamente in un CSV pronto per l'import, con causali già assegnate.
| Banca | Formato |
|---|---|
| Intesa Sanpaolo | Layout ufficiale (con riepilogo saldi) |
| Intesa Sanpaolo | Formato generico |
| BNL | Standard con causale ABI |
| BNL | Lista Movimenti |
| BNL | Rendiconto POS / finanziamenti |
- Carica il PDF dell'estratto conto nell'interfaccia web (porta 8224)
- Il sistema riconosce la banca automaticamente e assegna le causali
- Revisiona le transazioni nella tabella (modificabile)
- Scarica il CSV — pronto per l'import in Ago Zucchetti
L'elaborazione avviene interamente in locale: nessun documento viene inviato a servizi esterni.
Hai bisogno di supporto o di un template per la tua banca? Contattami su LinkedIn
Converts bank statement PDFs into structured transaction data:
PDF → page images → dots.ocr → bank template parser → normalizer → CSV (Ago Zucchetti)
- Auto-detects the bank from the first page (or select manually)
- Normalizes OCR output: handles dots.ocr-specific artifacts like
"3,420, 00"→3420.00 - Assigns causali automatically via configurable pattern matching
- Gradio UI: preview and edit transactions before exporting
- Exports: CSV with Dare/Avere columns or single signed column
Converts paper pharmacy invoices (PDF with selectable text) into structured data:
PDF → pdftotext -layout → regex parser → Fattura dataclass → CSV
- Extracts: document type (TD01/TD04), document number, date, tax codes, VAT breakdown by rate
- Handles:
FATTURA,NOTA DI CREDITO, multiple VAT rates, exempt amounts - Batch processing: moves processed PDFs to
processate/subfolder - Configurable
CF_CEDENTEvia environment variable
Watches an input folder and automatically OCRs any PDF dropped in:
/input/*.pdf → vLLM OCR → /output/*.md + *.json + *_tables.csv
- Exports to Markdown, JSON, and CSV (tables only)
- Configurable DPI, prompt mode, check interval
- Moves processed files to
/input/processed/
┌─────────────────────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────┐ ┌─────────────────────────┐ │
│ │ dots-ocr │ │ dots-ocr-ui │ │
│ │ (vLLM) │◄───│ (Gradio UI) │ │
│ │ port 8222 │ │ port 8223 │ │
│ └──────┬───────┘ └─────────────────────────┘ │
│ │ │
│ │ ┌─────────────────────────┐ │
│ └───────────►│ ec-converter-ui │ │
│ │ │ (Gradio UI) │ │
│ │ │ port 8224 │ │
│ │ └─────────────────────────┘ │
│ │ │
│ └───────────►┌─────────────────────────┐ │
│ │ processor │ │
│ │ (batch watcher) │ │
│ └─────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ fatture-converter (standalone batch script) │ │
│ └──────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
- Docker + Docker Compose
- NVIDIA GPU with CUDA (tested on RTX 3080 10GB)
- NVIDIA Container Toolkit
- ~8GB VRAM for dots.ocr model
git clone https://github.com/bertorico/pdf2conta
cd pdf2conta
docker compose up -dFirst startup downloads the rednote-hilab/dots.ocr model (~8GB). Wait for the healthcheck to pass before using the UIs.
docker compose ps
curl http://localhost:8222/health # dots-ocr vLLM| Service | URL | Description |
|---|---|---|
| dots-ocr-ui | http://localhost:8223 | Generic document OCR |
| ec-converter | http://localhost:8224 | Bank statement → CSV |
- Open http://localhost:8224
- Upload a bank statement PDF
- Select bank (or leave "Auto-detect")
- Click Elabora PDF
- Review and edit the transaction table
- Select CSV format (two columns or signed single column)
- Click Genera CSV → download
| Bank | Template | Notes |
|---|---|---|
| Intesa Sanpaolo | intesa_sanpaolo_ufficiale |
Official layout with balance summary |
| Intesa Sanpaolo | intesa_sanpaolo |
Generic fallback |
| BNL | bnl |
Includes ABI causale column |
| BNL | bnl_lista_movimenti |
"Lista movimenti" variant |
| BNL | bnl_pos |
POS / financing statement |
Auto-detected from the first page. To add a new bank, create a template class in ec_converter/templates/ implementing estrai_movimenti(pages_html: list[str]) -> list[Movimento].
Open the Gestione Causali tab to define automatic matching rules:
{
"causali": [
{
"codice": "BBAN",
"nome": "Bonifico bancario",
"pattern": ["bonifico", "accredito stipendio", "rimessa"]
}
]
}Patterns are matched case-insensitively as substrings of the transaction description.
Open Gestione Replace to configure text substitutions applied before matching:
{
"replace": [
{"trova": "PAGAMENTO TRAMITE POS", "sostituisci": "POS", "nota": "Semplifica descrizioni POS"}
]
}Place pharmacy invoice PDFs in fatture_converter/e_fatture/ and run:
docker compose run --rm fatture-converter
# Or with custom paths:
docker compose run --rm fatture-converter \
python -m fatture_converter.process_fatture \
--input-dir /app/e_fatture \
--output-dir /app/output \
--output-file fatture.csvCompatible with Ago Zucchetti import. Each row contains:
tipo_documento: TD01 (fattura) or TD04 (nota di credito)numero_documentodata_documento(DD/MM/YYYY)cf_cedente,cf_cessionarionome_cessionario,cognome_cessionario- One row per VAT rate with
netto,iva,aliquota
# Environment variable for cedente tax code
CF_CEDENTE=YOUR_TAX_CODE docker compose run --rm fatture-converterOr set in docker-compose.yml:
environment:
- CF_CEDENTE=YOUR_TAX_CODEDrop PDFs into the input/ folder. The processor picks them up automatically.
cp document.pdf input/
# Wait for CHECK_INTERVAL seconds
ls output/ # document.md, document.json, document_tables.csv| Variable | Default | Description |
|---|---|---|
VLLM_URL |
http://dots-ocr:8000 |
vLLM server URL |
CHECK_INTERVAL |
30 |
Seconds between folder checks |
DPI |
200 |
PDF to image resolution |
PROMPT_MODE |
full |
OCR prompt: full/ocr/layout/tables/ordered |
The normalizer.py module handles dots.ocr-specific output artifacts:
# dots.ocr commonly produces:
"3,420, 00" → 3420.00 # spurious spaces
"7, 244, 87" → 7244.87 # commas as thousands separators
"2.500,00" → 2500.00 # standard Italian format
"26, 452, 20 €" → 26452.20 # with currency symbolThe normalizer uses pattern matching on the last separator + 2 digits to determine decimal vs thousands separators — no heuristics, no locale assumptions.
- All OCR inference runs locally via vLLM
- No documents are sent to external services
- The
dots.ocrmodel runs entirely on your GPU
Runtime dependencies are pinned per service (each */requirements.txt); the host-side
dev tooling lives in pyproject.toml.
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
ruff check . # lint
ruff format . # format
mypy # type-check (scoped)
pytest --cov # tests + coverage
pre-commit install # run lint/format automatically on every commitCI (GitHub Actions) runs lint, format-check, mypy and pytest on every push and pull request.
The unit suite (tests/, 131 tests) covers the pure, testable modules:
amount/date/description normalization, bank-template parsing, invoice parsing
and CSV export. Coverage is measured on these modules only — the Gradio UIs
and the CLI/batch entry-points (app.py, pipeline.py, process_fatture.py)
need a live GPU / vLLM / Gradio runtime and are exercised manually rather
than in unit tests. The CI fails when coverage on the measured set drops
below 75%; the badge above reflects the current value (update it when it
shifts meaningfully).
MIT
@bertorico — self-hosted AI, Italian fiscal domain, n8n automation.
LinkedIn · Built for real Italian accounting workflows with Ago Zucchetti.