Universal Go core for project-scoped context management, retrieval, indexing, lexical evidence, and agent orchestration. Context Runtime — project-scoped corpus engine that assembles evidence-backed context packs.
fastygo/context is not a chat application, a generic RAG wrapper,
or a product companion. It is a reusable context operating layer for systems
that need to turn user intent, files, documents, logs, tool outputs, rules,
lexicons, and external sources into precise, inspectable, auditable context for
automated work.
The repository is in an early planning-and-implementation stage: architecture
decisions are recorded, and the first proof-of-concept code path is defined in
.project/progress.md. Target capabilities below describe the intended design,
not a finished runtime.
Large language models are useful, but they do not solve context management. A serious agent system needs to know what evidence was selected, where it came from, why other evidence was rejected, which tool was allowed to run, which model saw which context, and how the final result can be replayed or debugged.
This module exists because plain RAG is not enough for long-lived projects:
- vector search misses exact facts, citations, wordforms, morphology, source authority, and operational boundaries;
- unbounded chat history is noisy, lossy, and hard to audit;
- normalized text cannot replace original source text, snippets, quotes, or attestations;
- tool calls without typed policy create hidden side effects;
- background agents need ownership, cancellation, traces, and verification;
- source-backed work must preserve spans, versions, checksums, and decisions;
- model, vector, storage, language, lexicon, and tool providers must remain replaceable.
The engine is designed to combine deterministic information retrieval, source-backed context packs, typed tools, replaceable adapters, and agent/subagent orchestration without baking product identity, language-specific grammar, dictionary content, or scenario-specific products into the core.
The module is designed to provide foundations for:
- project-scoped source and artifact memory;
- deterministic indexing with manifests and incremental updates;
- hybrid retrieval over dense vectors, sparse/keyword indexes, exact matching, morphology-aware lexical paths, graph traversal, source filters, recency, and tool outputs;
- language-neutral contracts for tokens, lemmas, lexemes, wordforms, morphology features, analyses, and query expansion;
- lexicographic contracts for senses, concepts, attestations, variants, registers, regions, time periods, and lexicon sources;
- focus policies that keep retrieval and context packing scoped to the current task;
- replayable
ContextPackobjects for model calls, tools, and subagents; - typed tool registration with permissions, risk levels, and structured results;
- foreground and background
AgentRuntraces; - verification and evaluation loops for retrieval quality and factuality;
- adapter boundaries for LLMs, embeddings, rerankers, vector stores, metadata stores, artifact stores, language analyzers, lexicon resources, crawlers, and product integrations.
Nearest work follows .project/progress.md plan chunks:
| Phase | Status | Goal |
|---|---|---|
| 0 — Architecture baseline | ADRs accepted (Chunk 01) | Lock package, storage, index, trace, and adapter boundaries before broad code |
| 1 — Proof of concept | Chunks 02–12 pending | Prove ingest → retrieve → ContextPack → fake model/tool step → verifier → replayable trace |
Immediate next step: Chunk 02 — internal package skeleton, neutral domain
models, and interface-only ports with go test ./... and no external services.
Before or during Chunk 02, add follow-up ADRs for multilingual linguistic
contracts and lexicographic context contracts so domain types such as
TokenOccurrence, MorphAnalysis, Sense, and Attestation stay in core
without importing context-lang-* or dictionary adapters.
PoC storage progression (from accepted ADRs):
- Artifacts: local filesystem first; object storage later.
- Metadata: in-memory first; PostgreSQL when the live stack is introduced.
- Dense vectors:
VectorStoreport first; PostgreSQL + pgvector as the first live backend behind that port. - Sparse search: fake or PostgreSQL full-text baseline first; dedicated sparse
services such as
context-sparseonly when measurements justify them. - Models and language: fake deterministic providers and simple/no-op linguistic
fixtures first; real providers and
context-lang-*adapters after the CLI proof.
Downstream UX shells such as Lab may consume CLI JSON and proof artifacts.
They must not become imports or dependencies of the core module.
- Project scoped by default: every index, artifact, run, and decision belongs to an explicit project or workspace.
- Evidence before generation: model calls receive selected context, not unbounded history.
- Provenance is mandatory: facts point back to source spans, versions, checksums, attestations, or tool outputs.
- Original text is preserved: normalization, lemmatization, query expansion, and concept mapping never replace source text.
- Hybrid retrieval wins: dense vectors, sparse search, exact matching, morphology, graph traversal, recency, sense/concept filters, and citation signals cooperate.
- Multilingual by contract: the core defines stable contracts; language
complexity lives in
context-lang-*adapters. - Lexicons are evidence resources: dictionaries, thesauri, historical lexicons, slang, and regional vocabularies live in resource adapters, not in the core.
- Models are replaceable: LLMs, embedding models, and rerankers are adapters, not hardcoded infrastructure.
- Tools are typed: every tool has a name, schema, permission policy, risk level, and structured result.
- Agents are configurations: orchestration policy, rules, skills, tools, and model preferences are data-driven where possible.
- Background work is explicit: scheduled or event-triggered agents are observable, cancellable, and auditable.
- Brand neutrality is required: downstream products and companions configure identity on top of the core; core packages stay generic.
TaskIntent
-> PolicySnapshot
-> FocusProfile
-> RetrievalPlan
-> RetrieverCalls
-> CandidateSet
-> RerankedEvidence
-> ContextPack
-> ModelCall | ToolCall | SubagentRun
-> Verification
-> Decision | Artifact | Result
-> EvaluationTrace
The central handoff object is ContextPack: selected, ranked, budget-aware,
source-backed context for a model, tool, verifier, or subagent. Every step should
remain traceable to a task, policy snapshot, retrieval plan, accepted and
rejected evidence, source spans, checksums, permissions, and evaluation
signals. ContextPack and AgentRun records must be versioned and replayable
so bad retrieval, generation, or tool decisions can be debugged later.
Project: an isolated workspace or tenant boundary.Source: a file, document, URL, log stream, database snapshot, spec, chat history, or tool output.Artifact: stored source material or generated intermediate output.IndexSnapshot: an immutable index commit with parser, chunker, embedding, morphology, and lexicon-resource version fields.ManifestNode: a Merkle-style manifest node for incremental source-tree sync.Chunk: an indexed source span with metadata and provenance.ContextRef,PathAlias: stable references for model-visible source paths without leaking host filesystem layout.VectorNamespace,SparseIndexRef: project- and snapshot-scoped index handles behind replaceable vector and sparse backends.TokenOccurrence: original token text with stable source offsets.Lexeme,Lemma,WordForm: language-neutral references for lexical forms.MorphAnalysis: one possible morphology analysis; ambiguity stays explicit.QueryExpansion: explainable lexical or morphology-driven expansion.Sense: a specific meaning of a lexeme.Concept: a language-independent or domain concept connected to labels and senses.Attestation: witnessed usage in a source, with quote, span, date, region, register, authority, and confidence.Variant: orthographic, historical, regional, slang, spelling, or script variant.MultiwordExpression: lexical unit spanning multiple tokens or syntactic words.LexiconSource: dictionary, corpus, thesaurus, glossary, authority list, or community vocabulary source.PolicySnapshot: frozen permission, risk, and approval policy for one run.FocusProfile: the task-specific lens that defines scope, freshness, exactness, citation strictness, budgets, allowed tools, and irrelevant areas.RetrievalPlan: chosen retriever paths, filters, and budgets for a task.EvidenceItem: one source-backed candidate with rank signals and rejection reasons.ContextPack: selected evidence and instructions for a model/tool/agent step.ModelCall: one model invocation with inputs, outputs, and provider version.AgentRun: a foreground, background, scheduled, or event-triggered execution trace.ToolCall: a typed invocation with input, output, status, permissions, and side-effect metadata.Evaluation: a reproducible check for retrieval quality or task correctness.
The core is multilingual by contract, not by embedding every language inside the repository.
fastygo/context
-> language-neutral contracts
-> source spans, snapshots, retrieval, ContextPack, traces
-> no language-specific dictionaries or grammar rules
context-lang-*
-> normalization
-> tokenization
-> lexeme and wordform analysis
-> morphology generation
-> query expansion
-> language-specific eval fixtures
Language adapters may support Russian, English, German, Spanish, French, Hindi, Indic languages, and future languages. They must preserve source offsets, analyzer versions, dictionary versions, ambiguity candidates, and expansion provenance. Core contracts should stay compatible with portable schemes such as Universal Dependencies and UniMorph while allowing adapter-owned raw metadata.
Lexicon resource adapters are separate from language analyzers. They map
dictionaries, TEI resources, SKOS/ISO 25964 concept schemes, historical
lexicons, regional vocabularies, slang, and community terminology to neutral
contracts such as Sense, Concept, Attestation, and LexiconSource.
TokenOccurrence
-> WordForm
-> Lemma
-> Lexeme
-> Sense
-> Concept
-> Attestation
-> SourceSpan
-> ContextPackEvidence
Lexeme and morphology answer "which form." Sense, concept, and attestation answer "which meaning, where, when, in which register, and according to which evidence."
The indexing pipeline is source-agnostic and snapshot-oriented:
source adapter
-> artifact store
-> parser
-> chunker
-> tokenizer
-> language adapter
-> enricher
-> dual Merkle manifest (source tree + chunk set)
-> IndexSnapshot commit
-> dense vector index (VectorStore)
-> sparse/exact index (SparseIndexRef)
-> graph index
-> metadata store
Incremental sync compares manifests and checksums so unchanged sources and chunks are not reprocessed. Different source types need different chunking strategies. Source code, technical documentation, scientific text, legal text, dictionary entries, usage citations, chat history, logs, web captures, and tool output should not be split with the same rules.
First PoC scope: local file and artifact adapters, plain text and Markdown parsing, paragraph- and section-aware chunking, neutral token-span capture, and simple or no-op language hooks—not production morphology or dictionary imports.
Retrieval is a planning problem, not one vector query. Every retriever call should
carry project_id, snapshot_id, analyzer or embedding versions, and
explainable match reasons:
task intent
-> policy snapshot
-> focus profile
-> retrieval plan
-> parallel retriever calls
-> candidate merge
-> deduplication
-> reranking
-> evidence validation
-> context pack
Supported retrieval paths include:
- dense vector search;
- sparse/BM25-style search;
- exact phrase and source-span search;
- lemma, wordform, and morphology-expanded search;
- sense, concept, attestation, register, region, and time-period filters;
- entity and metadata filters;
- citation lookup;
- graph traversal;
- recent activity retrieval;
- tool result retrieval;
- external source retrieval through explicit adapters.
Every retrieval contribution should be explainable. A candidate that matched through a generated wordform, fuzzy variant, concept label, or attestation must preserve the original surface text and source span.
Tools are registered through typed metadata:
tool
name
description
input schema
output schema
permission policy
risk level
side-effect class
timeout
background support
Agents coordinate retrieval, context packs, model calls, tools, verifiers, and subagents. Subagents run with isolated context and return structured summaries or artifacts to the parent run. Concrete products should integrate through adapters, tools, graph projections, rules, skills, contracts, or companion configuration.
The canonical planning documents live in .project/:
.project/roadmap-context-core.md: architectural baseline and phased roadmap..project/progress.md: copy-paste plan chunks from baseline to PoC..project/decisions/: accepted ADRs for package, storage, index, trace, and adapter boundaries..project/future-layer.md: deferred production-grade layers and review gates..project/plugins/language-adapters.md: roadmap forcontext-lang-*language adapters..project/plugins/lexicon-resources.md: roadmap for dictionaries, thesauri, attestations, historical lexicons, and controlled vocabulary resources.
The project skill lives in .cursor/skills/context-core-steward/ and should be
used when planning, implementing, reviewing, or debugging this repository. It
keeps work aligned with DDD, Clean Architecture, SOLID, DRY, TDD, traceability,
brand-neutral API boundaries, and the active plan chunk in progress.md.
The public API should stay small until boundaries are stable. Layout below is the target skeleton for Chunk 02 onward, not an existing tree.
cmd/
context-dev/ # local developer CLI for indexing, search, evals
internal/
agentruntime/ # agent runs, orchestration, subagents, scheduling
artifacts/ # artifact metadata and stores
config/ # project config, rules, ignore patterns
corpus/ # projects, sources, chunks, provenance
evals/ # retrieval and task evaluation harnesses
graph/ # entity, citation, co-occurrence, dependency edges
indexing/ # parsing, chunking, enrichment, manifests
lexicon/ # sense, concept, attestation, resource contracts
linguistic/ # language-neutral contracts and simple adapters
models/ # LLM, embedding, reranker interfaces
policy/ # permissions, risks, approvals
retrieval/ # planners, retrievers, rerankers, context packs
# and focus profiles
storage/ # metadata store abstractions and adapters
tools/ # registry, schemas, execution
tracing/ # append-only runtime events and redaction
pkg/
contextkit/ # stable public interfaces, added only when proven
Prefer internal while interfaces are changing. Move packages to pkg only
when another module needs a stable import surface.
The hypothesis-validation path is split into two stages in progress.md:
Stage A — no external services (Chunks 02–08)
domain models and store ports
-> local artifact store + in-memory metadata
-> deterministic indexing with Merkle manifest and IndexSnapshot
-> exact/sparse/vector retrieval through ports and fakes
-> context pack builder and verifier
-> fake model/tool agent run
-> context-dev CLI with machine-readable JSON
Stage B — live local stack (Chunks 09–12)
local project corpus
-> deterministic indexing
-> PostgreSQL metadata + pgvector-backed VectorStore
-> PostgreSQL full-text or fake sparse baseline
-> CLI ingest, search, context pack, and agent-run trace
-> source-backed verification and replay/debug output
The first proof is not a polished product or chat UI. It is a working CLI loop
that shows the architecture can index project sources incrementally, retrieve
relevant evidence through hybrid paths, build an inspectable ContextPack,
execute a typed model/tool step with policy enforced outside the model, verify
source-backed claims, and replay the trace. It must prove neutral linguistic and
lexicographic contracts with simple or fake fixtures before adding production
context-lang-* adapters or TEI/SKOS/dictionary importers.
Expected first demo corpus: repository docs such as README.md and
.project/*.md, with proof artifacts suitable for downstream UX replay.
- Chat UI, browser shell, or downstream
Labcode inside the core module. - Generic autonomous control of arbitrary systems.
- Unlimited web crawling.
- Plugin marketplace.
- Scenario-specific products in the core package tree.
- Multi-tenant billing.
- Complex UI framework ownership.
- Hard dependency on one model provider.
- Hard dependency on one vector database; QDrant and Turbopuffer remain optional
later adapters behind
VectorStore. - Language-specific dictionaries, grammar rules, or morphology engines in core.
- TEI/SKOS importers, historical dictionaries, regional vocabularies, slang lexicons, or community lexicon resources in the first PoC.
- Implicit background writes without audit and approval policy.
- Full distributed worker orchestration.
- Production-grade query language.
- Keep raw source storage, embeddings, metadata, indexes, traces, and generated artifacts separate.
- Version embedding models, parsers, chunkers, tokenizers, analyzers, dictionaries, sparse indexes, and graph schemas.
- Never replace original source text with normalized text.
- Do not collapse senses into lemmas or concepts into labels.
- Treat generated wordforms as expansion candidates, not as attestations unless they are witnessed in a source.
- Treat long tool outputs as artifacts that can be searched and read in slices.
- Prefer deterministic verification for claims that depend on source material.
- Record enough run data to reproduce bad retrieval and bad tool decisions.
- Enforce permission and side-effect decisions outside the model.
- Design every background action as an event with policy, trace, owner, and cancellation.
| Item | State |
|---|---|
| Planning baseline | Accepted (roadmap-context-core.md) |
| Architecture decisions | 14 ADRs under .project/decisions/ (Chunk 01 complete) |
| Go implementation | Not started; Chunk 02 is next |
| Public API | None exported yet; pkg/contextkit deferred until proven |
| Dependencies | go.mod intentionally minimal |
Public APIs are expected to change until the PoC CLI loop and core runtime stabilize.
License: MIT