This is Shubo's custom model repository β the single home for the models Shubo serves and the model research it owns. It holds two kinds of thing:
- Served models β model definitions + serving code that
model-backenddeploys and that the rest of the stack triggers (chat / VLM / embedding / detection models, and thedoclingdocument parser that powers structure-aware RAG ingestion). - Model research projects β Shubo's own MLX/graph work that produces or evaluates models
(
linkpred-mlx,gidn,plnlpβ Apple-Silicon link-prediction experiments).
It is part of the Shubo workspace and is managed by buckle: buckle init clones it as a
sibling of backend/frontend/deploy, and provisions the shared agent context
(AGENTS.md/CLAUDE.md/.claude/skills) into it. It is not a sandbox service tier β it does not
boot in buckle sandbox and is not worktreed per task (it sits alongside deploy/cloud: present and
agent-aware, not part of the running stack). See ../buckle and the workspace ../AGENTS.md.
models/
βββ docling/ # the docling document parser served by model-backend (v0.1.x);
β # emits DoclingDocument structure for structure-aware RAG (see below)
βββ custom/ # other custom served-model scaffolding
βββ <served-model>/ # one dir per served model (LLM / VLM / embedding / detection),
β # each with its own README + versioned vX.Y.Z/ folders
β # e.g. qwen-2-5-vl-7b-instruct, gte-Qwen2-1.5B-instruct, yolov7, β¦
βββ linkpred-mlx/ # MLX link-prediction (ogbl-collab / arxiv-semantic) β research
βββ gidn/ # Graph Inception Diffusion Networks link-prediction β research
βββ plnlp/ # Pairwise Learning for Neural Link Prediction β research
Each served model folder carries its own README.md (config, weights, build/push steps) and one
or more vX.Y.Z/ version folders. Open the folder README for that model's specifics.
Served models run on model-backend (the Ray-Serve plane historically; Ray is disabled in
production today β see backend/services/model). Because the production fleet is Apple-Silicon
MacBook Pro k3s nodes, GPU-accelerated inference does not run inside the Linux containers (no
Metal passthrough). Instead the established pattern is host-managed model servers: an MLX/Metal
FastAPI process runs on the macOS host (supervised by buckle via launchd), and model-backend
routes to it through the staticruntime / runtime_ref seam (the same way gemma/mlx-vlm/ASR are
served today). See buckle/scripts/sandbox/qwen3-asr-server.py for the host-server template and
backend/services/model/pkg/llm/runtime/ for the routing seam.
docling is the document parser behind structure-aware RAG (it must emit the
DoclingDocument export_to_dict() tree the backend consumes β see
backend/docs/artifact/m7-w1b-producer-wiring.md). To get Metal acceleration on the Apple-Silicon
fleet, docling is hosted as an MLX host server (mirroring the ASR/VLM host servers) rather than a
Ray container. The design β host server, buckle role registration, and the two routing options
(redirect the parsing-router model_url, vs. a model-backend external-utility runtime) β lives in
docling/docs/mlx-host-serving.md.
| Runtime | AMD64 CPU | ARM64 CPU | AMD64 GPU (CUDA) | Apple GPU (Metal/MLX) |
|---|---|---|---|---|
| vLLM | β | β | β | β |
| mlx-vlm | β | β | β | β |
| Transformers | β | β | β | β (MPS) |
| llama.cpp | β | β | β | β |
On the Apple-Silicon fleet, MLX/Metal runtimes are the accelerated path (host-managed, as above).
linkpred-mlxβ Apple-Silicon (MLX) link prediction onogbl-collab/ arxiv-semantic graphs.gidnβ Graph Inception Diffusion Networks for link prediction.plnlpβ Pairwise Learning for Neural Link Prediction.
These are reproducible experiments (data + scripts + logs), not served models; they feed model design.
MIT β see the workspace LICENSE.