📄 Italian Document OCR Stack — dots.ocr + vLLM + Ago Zucchetti

Repository: github.com/bertorico/pdf2conta

A self-hosted document processing pipeline for Italian accounting workflows. Converts bank statements (estratti conto) and paper invoices into structured CSV files ready for import into Ago Zucchetti.

Built around dots.ocr, a specialized OCR model for structured documents, served via vLLM.

🇮🇹 Per studi commercialisti — Estratti conto PDF → Ago Zucchetti

Inserire manualmente le righe dell'estratto conto in Ago Zucchetti richiede decine di minuti per documento. Questo strumento converte il PDF direttamente in un CSV pronto per l'import, con causali già assegnate.

Banche supportate

Banca	Formato
Intesa Sanpaolo	Layout ufficiale (con riepilogo saldi)
Intesa Sanpaolo	Formato generico
BNL	Standard con causale ABI
BNL	Lista Movimenti
BNL	Rendiconto POS / finanziamenti

Come funziona

Carica il PDF dell'estratto conto nell'interfaccia web (porta 8224)
Il sistema riconosce la banca automaticamente e assegna le causali
Revisiona le transazioni nella tabella (modificabile)
Scarica il CSV — pronto per l'import in Ago Zucchetti

L'elaborazione avviene interamente in locale: nessun documento viene inviato a servizi esterni.

Hai bisogno di supporto o di un template per la tua banca? Contattami su LinkedIn

What it does

EC Converter (Estratto Conto)

Converts bank statement PDFs into structured transaction data:

PDF → page images → dots.ocr → bank template parser → normalizer → CSV (Ago Zucchetti)

Auto-detects the bank from the first page (or select manually)
Normalizes OCR output: handles dots.ocr-specific artifacts like "3,420, 00" → 3420.00
Assigns causali automatically via configurable pattern matching
Gradio UI: preview and edit transactions before exporting
Exports: CSV with Dare/Avere columns or single signed column

Fatture Converter

Converts paper pharmacy invoices (PDF with selectable text) into structured data:

PDF → pdftotext -layout → regex parser → Fattura dataclass → CSV

Extracts: document type (TD01/TD04), document number, date, tax codes, VAT breakdown by rate
Handles: FATTURA, NOTA DI CREDITO, multiple VAT rates, exempt amounts
Batch processing: moves processed PDFs to processate/ subfolder
Configurable CF_CEDENTE via environment variable

Batch Processor

Watches an input folder and automatically OCRs any PDF dropped in:

/input/*.pdf → vLLM OCR → /output/*.md + *.json + *_tables.csv

Exports to Markdown, JSON, and CSV (tables only)
Configurable DPI, prompt mode, check interval
Moves processed files to /input/processed/

Architecture

┌─────────────────────────────────────────────────────┐
│                  Docker Compose                      │
│                                                      │
│  ┌──────────────┐    ┌─────────────────────────┐    │
│  │  dots-ocr    │    │    dots-ocr-ui          │    │
│  │  (vLLM)      │◄───│    (Gradio UI)          │    │
│  │  port 8222   │    │    port 8223            │    │
│  └──────┬───────┘    └─────────────────────────┘    │
│         │                                            │
│         │            ┌─────────────────────────┐    │
│         └───────────►│    ec-converter-ui      │    │
│         │            │    (Gradio UI)          │    │
│         │            │    port 8224            │    │
│         │            └─────────────────────────┘    │
│         │                                            │
│         └───────────►┌─────────────────────────┐    │
│                      │    processor            │    │
│                      │    (batch watcher)      │    │
│                      └─────────────────────────┘    │
│                                                      │
│  ┌──────────────────────────────────────────────┐   │
│  │  fatture-converter (standalone batch script) │   │
│  └──────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Requirements

Docker + Docker Compose
NVIDIA GPU with CUDA (tested on RTX 3080 10GB)
NVIDIA Container Toolkit
~8GB VRAM for dots.ocr model

Quick Start

1. Clone and start

git clone https://github.com/bertorico/pdf2conta
cd pdf2conta
docker compose up -d

First startup downloads the rednote-hilab/dots.ocr model (~8GB). Wait for the healthcheck to pass before using the UIs.

2. Check service status

docker compose ps
curl http://localhost:8222/health  # dots-ocr vLLM

3. Access the interfaces

Service	URL	Description
dots-ocr-ui	http://localhost:8223	Generic document OCR
ec-converter	http://localhost:8224	Bank statement → CSV

EC Converter Usage

Via UI (recommended)

Open http://localhost:8224
Upload a bank statement PDF
Select bank (or leave "Auto-detect")
Click Elabora PDF
Review and edit the transaction table
Select CSV format (two columns or signed single column)
Click Genera CSV → download

Supported banks

Bank	Template	Notes
Intesa Sanpaolo	`intesa_sanpaolo_ufficiale`	Official layout with balance summary
Intesa Sanpaolo	`intesa_sanpaolo`	Generic fallback
BNL	`bnl`	Includes ABI causale column
BNL	`bnl_lista_movimenti`	"Lista movimenti" variant
BNL	`bnl_pos`	POS / financing statement

Auto-detected from the first page. To add a new bank, create a template class in ec_converter/templates/ implementing estrai_movimenti(pages_html: list[str]) -> list[Movimento].

Causali configuration

Open the Gestione Causali tab to define automatic matching rules:

{
  "causali": [
    {
      "codice": "BBAN",
      "nome": "Bonifico bancario",
      "pattern": ["bonifico", "accredito stipendio", "rimessa"]
    }
  ]
}

Patterns are matched case-insensitively as substrings of the transaction description.

Description cleanup

Open Gestione Replace to configure text substitutions applied before matching:

{
  "replace": [
    {"trova": "PAGAMENTO TRAMITE POS", "sostituisci": "POS", "nota": "Semplifica descrizioni POS"}
  ]
}

Fatture Converter Usage

Place pharmacy invoice PDFs in fatture_converter/e_fatture/ and run:

docker compose run --rm fatture-converter

# Or with custom paths:
docker compose run --rm fatture-converter \
  python -m fatture_converter.process_fatture \
  --input-dir /app/e_fatture \
  --output-dir /app/output \
  --output-file fatture.csv

Output CSV format

Compatible with Ago Zucchetti import. Each row contains:

tipo_documento: TD01 (fattura) or TD04 (nota di credito)
numero_documento
data_documento (DD/MM/YYYY)
cf_cedente, cf_cessionario
nome_cessionario, cognome_cessionario
One row per VAT rate with netto, iva, aliquota

Configuration

# Environment variable for cedente tax code
CF_CEDENTE=YOUR_TAX_CODE docker compose run --rm fatture-converter

Or set in docker-compose.yml:

environment:
  - CF_CEDENTE=YOUR_TAX_CODE

Batch Processor Usage

Drop PDFs into the input/ folder. The processor picks them up automatically.

cp document.pdf input/
# Wait for CHECK_INTERVAL seconds
ls output/  # document.md, document.json, document_tables.csv

Configuration

Variable	Default	Description
`VLLM_URL`	`http://dots-ocr:8000`	vLLM server URL
`CHECK_INTERVAL`	`30`	Seconds between folder checks
`DPI`	`200`	PDF to image resolution
`PROMPT_MODE`	`full`	OCR prompt: full/ocr/layout/tables/ordered

OCR Normalization

The normalizer.py module handles dots.ocr-specific output artifacts:

# dots.ocr commonly produces:
"3,420, 00"   → 3420.00   # spurious spaces
"7, 244, 87"  → 7244.87   # commas as thousands separators
"2.500,00"    → 2500.00   # standard Italian format
"26, 452, 20 €" → 26452.20  # with currency symbol

The normalizer uses pattern matching on the last separator + 2 digits to determine decimal vs thousands separators — no heuristics, no locale assumptions.

Privacy

All OCR inference runs locally via vLLM
No documents are sent to external services
The dots.ocr model runs entirely on your GPU

Development

Runtime dependencies are pinned per service (each */requirements.txt); the host-side dev tooling lives in pyproject.toml.

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

ruff check .          # lint
ruff format .         # format
mypy                  # type-check (scoped)
pytest --cov          # tests + coverage
pre-commit install    # run lint/format automatically on every commit

CI (GitHub Actions) runs lint, format-check, mypy and pytest on every push and pull request.

Testing

The unit suite (tests/, 131 tests) covers the pure, testable modules: amount/date/description normalization, bank-template parsing, invoice parsing and CSV export. Coverage is measured on these modules only — the Gradio UIs and the CLI/batch entry-points (app.py, pipeline.py, process_fatture.py) need a live GPU / vLLM / Gradio runtime and are exercised manually rather than in unit tests. The CI fails when coverage on the measured set drops below 75%; the badge above reflects the current value (update it when it shifts meaningfully).

License

MIT

Author

@bertorico — self-hosted AI, Italian fiscal domain, n8n automation.
LinkedIn · Built for real Italian accounting workflows with Ago Zucchetti.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
agentic		agentic
ec_converter		ec_converter
fatture_converter		fatture_converter
input		input
output		output
processor		processor
tests		tests
ui		ui
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
compose.qwen-vl.yml		compose.qwen-vl.yml
compose.yml		compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

📄 Italian Document OCR Stack — dots.ocr + vLLM + Ago Zucchetti

🇮🇹 Per studi commercialisti — Estratti conto PDF → Ago Zucchetti

Banche supportate

Come funziona

What it does

EC Converter (Estratto Conto)

Fatture Converter

Batch Processor

Architecture

Requirements

Quick Start

1. Clone and start

2. Check service status

3. Access the interfaces

EC Converter Usage

Via UI (recommended)

Supported banks

Causali configuration

Description cleanup

Fatture Converter Usage

Output CSV format

Configuration

Batch Processor Usage

Configuration

OCR Normalization

Privacy

Development

Testing

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages