Skip to content

bertorico/pdf2conta

Repository files navigation

📄 Italian Document OCR Stack — dots.ocr + vLLM + Ago Zucchetti

CI Coverage License: MIT Python 3.12+

Repository: github.com/bertorico/pdf2conta

A self-hosted document processing pipeline for Italian accounting workflows. Converts bank statements (estratti conto) and paper invoices into structured CSV files ready for import into Ago Zucchetti.

Built around dots.ocr, a specialized OCR model for structured documents, served via vLLM.


🇮🇹 Per studi commercialisti — Estratti conto PDF → Ago Zucchetti

Inserire manualmente le righe dell'estratto conto in Ago Zucchetti richiede decine di minuti per documento. Questo strumento converte il PDF direttamente in un CSV pronto per l'import, con causali già assegnate.

Banche supportate

Banca Formato
Intesa Sanpaolo Layout ufficiale (con riepilogo saldi)
Intesa Sanpaolo Formato generico
BNL Standard con causale ABI
BNL Lista Movimenti
BNL Rendiconto POS / finanziamenti

Come funziona

  1. Carica il PDF dell'estratto conto nell'interfaccia web (porta 8224)
  2. Il sistema riconosce la banca automaticamente e assegna le causali
  3. Revisiona le transazioni nella tabella (modificabile)
  4. Scarica il CSV — pronto per l'import in Ago Zucchetti

L'elaborazione avviene interamente in locale: nessun documento viene inviato a servizi esterni.

Hai bisogno di supporto o di un template per la tua banca? Contattami su LinkedIn


What it does

EC Converter (Estratto Conto)

Converts bank statement PDFs into structured transaction data:

PDF → page images → dots.ocr → bank template parser → normalizer → CSV (Ago Zucchetti)
  • Auto-detects the bank from the first page (or select manually)
  • Normalizes OCR output: handles dots.ocr-specific artifacts like "3,420, 00"3420.00
  • Assigns causali automatically via configurable pattern matching
  • Gradio UI: preview and edit transactions before exporting
  • Exports: CSV with Dare/Avere columns or single signed column

Fatture Converter

Converts paper pharmacy invoices (PDF with selectable text) into structured data:

PDF → pdftotext -layout → regex parser → Fattura dataclass → CSV
  • Extracts: document type (TD01/TD04), document number, date, tax codes, VAT breakdown by rate
  • Handles: FATTURA, NOTA DI CREDITO, multiple VAT rates, exempt amounts
  • Batch processing: moves processed PDFs to processate/ subfolder
  • Configurable CF_CEDENTE via environment variable

Batch Processor

Watches an input folder and automatically OCRs any PDF dropped in:

/input/*.pdf → vLLM OCR → /output/*.md + *.json + *_tables.csv
  • Exports to Markdown, JSON, and CSV (tables only)
  • Configurable DPI, prompt mode, check interval
  • Moves processed files to /input/processed/

Architecture

┌─────────────────────────────────────────────────────┐
│                  Docker Compose                      │
│                                                      │
│  ┌──────────────┐    ┌─────────────────────────┐    │
│  │  dots-ocr    │    │    dots-ocr-ui          │    │
│  │  (vLLM)      │◄───│    (Gradio UI)          │    │
│  │  port 8222   │    │    port 8223            │    │
│  └──────┬───────┘    └─────────────────────────┘    │
│         │                                            │
│         │            ┌─────────────────────────┐    │
│         └───────────►│    ec-converter-ui      │    │
│         │            │    (Gradio UI)          │    │
│         │            │    port 8224            │    │
│         │            └─────────────────────────┘    │
│         │                                            │
│         └───────────►┌─────────────────────────┐    │
│                      │    processor            │    │
│                      │    (batch watcher)      │    │
│                      └─────────────────────────┘    │
│                                                      │
│  ┌──────────────────────────────────────────────┐   │
│  │  fatture-converter (standalone batch script) │   │
│  └──────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────┘

Requirements

  • Docker + Docker Compose
  • NVIDIA GPU with CUDA (tested on RTX 3080 10GB)
  • NVIDIA Container Toolkit
  • ~8GB VRAM for dots.ocr model

Quick Start

1. Clone and start

git clone https://github.com/bertorico/pdf2conta
cd pdf2conta
docker compose up -d

First startup downloads the rednote-hilab/dots.ocr model (~8GB). Wait for the healthcheck to pass before using the UIs.

2. Check service status

docker compose ps
curl http://localhost:8222/health  # dots-ocr vLLM

3. Access the interfaces

Service URL Description
dots-ocr-ui http://localhost:8223 Generic document OCR
ec-converter http://localhost:8224 Bank statement → CSV

EC Converter Usage

Via UI (recommended)

  1. Open http://localhost:8224
  2. Upload a bank statement PDF
  3. Select bank (or leave "Auto-detect")
  4. Click Elabora PDF
  5. Review and edit the transaction table
  6. Select CSV format (two columns or signed single column)
  7. Click Genera CSV → download

Supported banks

Bank Template Notes
Intesa Sanpaolo intesa_sanpaolo_ufficiale Official layout with balance summary
Intesa Sanpaolo intesa_sanpaolo Generic fallback
BNL bnl Includes ABI causale column
BNL bnl_lista_movimenti "Lista movimenti" variant
BNL bnl_pos POS / financing statement

Auto-detected from the first page. To add a new bank, create a template class in ec_converter/templates/ implementing estrai_movimenti(pages_html: list[str]) -> list[Movimento].

Causali configuration

Open the Gestione Causali tab to define automatic matching rules:

{
  "causali": [
    {
      "codice": "BBAN",
      "nome": "Bonifico bancario",
      "pattern": ["bonifico", "accredito stipendio", "rimessa"]
    }
  ]
}

Patterns are matched case-insensitively as substrings of the transaction description.

Description cleanup

Open Gestione Replace to configure text substitutions applied before matching:

{
  "replace": [
    {"trova": "PAGAMENTO TRAMITE POS", "sostituisci": "POS", "nota": "Semplifica descrizioni POS"}
  ]
}

Fatture Converter Usage

Place pharmacy invoice PDFs in fatture_converter/e_fatture/ and run:

docker compose run --rm fatture-converter

# Or with custom paths:
docker compose run --rm fatture-converter \
  python -m fatture_converter.process_fatture \
  --input-dir /app/e_fatture \
  --output-dir /app/output \
  --output-file fatture.csv

Output CSV format

Compatible with Ago Zucchetti import. Each row contains:

  • tipo_documento: TD01 (fattura) or TD04 (nota di credito)
  • numero_documento
  • data_documento (DD/MM/YYYY)
  • cf_cedente, cf_cessionario
  • nome_cessionario, cognome_cessionario
  • One row per VAT rate with netto, iva, aliquota

Configuration

# Environment variable for cedente tax code
CF_CEDENTE=YOUR_TAX_CODE docker compose run --rm fatture-converter

Or set in docker-compose.yml:

environment:
  - CF_CEDENTE=YOUR_TAX_CODE

Batch Processor Usage

Drop PDFs into the input/ folder. The processor picks them up automatically.

cp document.pdf input/
# Wait for CHECK_INTERVAL seconds
ls output/  # document.md, document.json, document_tables.csv

Configuration

Variable Default Description
VLLM_URL http://dots-ocr:8000 vLLM server URL
CHECK_INTERVAL 30 Seconds between folder checks
DPI 200 PDF to image resolution
PROMPT_MODE full OCR prompt: full/ocr/layout/tables/ordered

OCR Normalization

The normalizer.py module handles dots.ocr-specific output artifacts:

# dots.ocr commonly produces:
"3,420, 00"3420.00   # spurious spaces
"7, 244, 87"7244.87   # commas as thousands separators
"2.500,00"2500.00   # standard Italian format
"26, 452, 20 €"26452.20  # with currency symbol

The normalizer uses pattern matching on the last separator + 2 digits to determine decimal vs thousands separators — no heuristics, no locale assumptions.


Privacy

  • All OCR inference runs locally via vLLM
  • No documents are sent to external services
  • The dots.ocr model runs entirely on your GPU

Development

Runtime dependencies are pinned per service (each */requirements.txt); the host-side dev tooling lives in pyproject.toml.

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

ruff check .          # lint
ruff format .         # format
mypy                  # type-check (scoped)
pytest --cov          # tests + coverage
pre-commit install    # run lint/format automatically on every commit

CI (GitHub Actions) runs lint, format-check, mypy and pytest on every push and pull request.

Testing

The unit suite (tests/, 131 tests) covers the pure, testable modules: amount/date/description normalization, bank-template parsing, invoice parsing and CSV export. Coverage is measured on these modules only — the Gradio UIs and the CLI/batch entry-points (app.py, pipeline.py, process_fatture.py) need a live GPU / vLLM / Gradio runtime and are exercised manually rather than in unit tests. The CI fails when coverage on the measured set drops below 75%; the badge above reflects the current value (update it when it shifts meaningfully).


License

MIT


Author

@bertorico — self-hosted AI, Italian fiscal domain, n8n automation.
LinkedIn · Built for real Italian accounting workflows with Ago Zucchetti.

About

Estratti conto PDF in CSV Ago Zucchetti - pipeline OCR locale per studi commercialisti italiani

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors