SPEC.md — subagent-fleet
Project
Name: subagent-fleet
Repository: https://github.com/adityak74/subagent-fleet
Tagline: Run Claude Code-style subagents across your local model fleet.
One-line description:
subagent-fleet is a config-first CLI that discovers local/remote Ollama nodes, maps Claude Code-style subagents to the best model/machine, and generates LiteLLM + .claude/agents configuration so developers can run a private local subagent fleet.
1. Product Vision
Modern coding agents increasingly use subagents for planning, implementation, review, testing, and summarization. Local model users often have multiple capable machines — MacBooks, Mac minis, GPU workstations, home servers — but their workflow usually still points to a single local Ollama endpoint.
subagent-fleet turns those machines into a private local subagent fleet.
Example:
planner → small fast model on M4 Mac mini 16GB
implementer → large coding model on M4 Mac mini 64GB
reviewer → large coding model on M4 Mac mini 64GB
summarizer → small local model on laptop
The tool should not replace Ollama or LiteLLM. It should sit above them as a workflow/configuration layer.
2. Core Problem
Today, a developer can manually configure:
- Ollama on multiple machines
- LiteLLM proxy routing
- Claude Code subagent markdown files
- Environment variables for Claude Code
- Model warmup calls
- Health checks and status checks
But this is tedious and error-prone.
subagent-fleet should make this one config-driven workflow.
3. Non-Goals
This project is not:
- a new inference engine
- a replacement for Ollama
- a replacement for LiteLLM
- a model sharding framework
- Kubernetes for local LLMs
- a cloud orchestration platform
- a public model hosting tool
It should avoid overengineering.
The MVP should focus on:
- config
- discovery
- generation
- health checks
- warmup
- clear local developer workflow
4. Target Users
Primary users:
- developers using Claude Code or Claude-Code-like coding harnesses
- developers running Ollama locally
- people with multiple Macs/workstations
- local-first AI developers
- open-source builders trying to reduce cloud token usage
- privacy-conscious developers
5. MVP Scope
Implement a Python CLI named:
subagent-fleet
MVP commands:
subagent-fleet init
subagent-fleet discover
subagent-fleet validate
subagent-fleet generate
subagent-fleet warmup
subagent-fleet status
Optional but useful:
subagent-fleet doctor
subagent-fleet clean
Do not implement a daemon, dashboard, dynamic scheduler, or full proxy in the MVP.
6. Recommended Tech Stack
Use Python.
Suggested dependencies:
typer CLI framework
pydantic config validation
pyyaml YAML parsing/writing
httpx HTTP calls to Ollama
jinja2 templates for generated files
rich pretty terminal output
Suggested packaging:
pyproject.toml
src/subagent_fleet/
7. Repository Structure
Create this structure:
subagent-fleet/
README.md
SPEC.md
LICENSE
pyproject.toml
.gitignore
src/
subagent_fleet/
__init__.py
cli.py
config.py
discovery.py
health.py
warmup.py
status.py
generators/
__init__.py
litellm.py
claude_agents.py
env_file.py
templates/
litellm_config.yaml.j2
claude_agent.md.j2
env.subagent-fleet.j2
examples/
fleet.yaml
litellm_config.generated.yaml
claude-agents/
planner.md
implementer.md
reviewer.md
tests/
test_config.py
test_discovery.py
test_generate_litellm.py
test_generate_claude_agents.py
8. Configuration File
Primary config file:
fleet.yaml
The config should define:
- project metadata
- gateway settings
- Ollama nodes
- model aliases
- agent mappings
Example:
project:
name: local-dev
gateway:
provider: litellm
host: 0.0.0.0
port: 4000
master_key_env: LITELLM_MASTER_KEY
nodes:
m5-local:
endpoint: http://localhost:11434
tags:
- controller
- local
- fast
m4-mini-64gb:
endpoint: http://192.168.1.50:11434
tags:
- heavy
- coder
- reviewer
m4-mini-16gb:
endpoint: http://192.168.1.51:11434
tags:
- small
- planner
- summarizer
models:
heavy-coder:
node: m4-mini-64gb
ollama_model: qwen2.5-coder:32b
litellm_alias: claude-sonnet-local
context: 32768
timeout: 600
max_parallel: 1
small-coder:
node: m4-mini-16gb
ollama_model: qwen2.5-coder:7b
litellm_alias: claude-haiku-local
context: 8192
timeout: 300
max_parallel: 1
agents:
planner:
model: small-coder
description: Use for planning, file discovery, task decomposition, and summarization.
tools:
- Read
- Grep
- Glob
prompt: |
You are a fast local planning agent.
Do not edit files.
Return a concise response with:
- plan
- relevant files
- risks
- next recommended agent
implementer:
model: heavy-coder
description: Use for implementation, bug fixes, refactors, and patch creation.
tools:
- Read
- Grep
- Glob
- Edit
- MultiEdit
- Bash
prompt: |
You are a senior implementation agent.
Make minimal, correct changes.
Prefer small patches.
Run relevant checks when possible.
Explain what changed and why.
reviewer:
model: heavy-coder
description: Use after implementation to review diffs, tests, regressions, and maintainability.
tools:
- Read
- Grep
- Glob
- Bash
prompt: |
You are a strict code reviewer.
Focus on correctness, regressions, missing tests, security issues,
over-engineering, and maintainability.
Review the diff and test output.
Return only actionable issues.
9. Config Validation Rules
Validate fleet.yaml with Pydantic.
Rules:
Project
project.namerequired.project.gateway.providerdefaults tolitellm.project.gateway.portdefaults to4000.project.gateway.hostdefaults to127.0.0.1unless explicitly set.
Nodes
Each node must have:
endpoint: http://host:port
Validation:
- endpoint must be valid HTTP/HTTPS URL.
- tags optional; default empty list.
- duplicate node names disallowed.
Models
Each model must have:
nodeollama_modellitellm_alias
Validation:
nodemust reference existingnodeskey.contextdefault:8192.timeoutdefault:300.max_paralleldefault:1.
Agents
Each agent must have:
modeldescription
Validation:
modelmust reference existingmodelskey.toolsdefaults to[].promptdefaults to a generic role prompt if omitted.- agent names should be filesystem-safe: lowercase letters, numbers, hyphens, underscores.
10. CLI Command Details
10.1 subagent-fleet init
Creates a starter fleet.yaml.
Behavior:
- If
fleet.yamlexists, do not overwrite unless--force. - Generate a useful local example with:
- local Ollama node
- one heavy-coder model placeholder
- planner, implementer, reviewer agents
Command:
subagent-fleet init
Options:
--force
--output fleet.yaml
Expected output:
Created fleet.yaml
Edit it with your Ollama node endpoints, then run:
subagent-fleet discover
subagent-fleet generate
10.2 subagent-fleet discover
Discovers models available on configured Ollama nodes.
For each node:
Call:
GET {node.endpoint}/api/tags
Expected Ollama response contains a models list.
Behavior:
- Load
fleet.yaml. - Check every node.
- Display online/offline status.
- Display discovered models.
- Optionally write discovery metadata to
.subagent-fleet/discovery.json.
Command:
subagent-fleet discover
Options:
--config fleet.yaml
--json
--write
Expected terminal output:
Fleet: local-dev
Node Status Models
-----------------------------------------------
m5-local online qwen-coder:14b, llama3.2:3b
m4-mini-64gb online qwen2.5-coder:32b
m4-mini-16gb online qwen2.5-coder:7b
Error handling:
- If a node fails, show it as offline.
- Do not crash the whole command unless config is invalid.
- Include connection error message in verbose mode.
10.3 subagent-fleet validate
Validates fleet.yaml.
Command:
subagent-fleet validate
Options:
--config fleet.yaml
Checks:
- config schema valid
- node references valid
- model references valid
- agent references valid
- endpoint format valid
- no duplicate aliases creating unintended collisions
Expected output:
fleet.yaml is valid.
If invalid:
Invalid fleet.yaml:
models.heavy-coder.node references unknown node: m4-mini-64
10.4 subagent-fleet generate
Generates:
litellm_config.yaml
.claude/agents/*.md
.env.subagent-fleet
Command:
subagent-fleet generate
Options:
--config fleet.yaml
--out .
--litellm-only
--claude-only
--force
Behavior:
- Validate config first.
- Create output directories if missing.
- Do not overwrite generated files unless
--force. - Add a generated-file comment header.
Expected output:
Generated:
litellm_config.yaml
.claude/agents/planner.md
.claude/agents/implementer.md
.claude/agents/reviewer.md
.env.subagent-fleet
10.5 subagent-fleet warmup
Preloads configured Ollama models.
For each configured model:
Call:
POST {node.endpoint}/api/chat
Payload:
{
"model": "qwen2.5-coder:32b",
"messages": [],
"keep_alive": -1
}
If Ollama does not accept empty messages, use a minimal warmup prompt:
{
"model": "qwen2.5-coder:32b",
"messages": [
{
"role": "user",
"content": "Reply with ok."
}
],
"stream": false,
"keep_alive": -1
}
Command:
subagent-fleet warmup
Options:
--config fleet.yaml
--model heavy-coder
--agent implementer
Expected output:
Warming models:
heavy-coder m4-mini-64gb qwen2.5-coder:32b ok
small-coder m4-mini-16gb qwen2.5-coder:7b ok
10.6 subagent-fleet status
Shows health/status of nodes and routes.
Command:
subagent-fleet status
Behavior:
- Validate config.
- Check
/api/tagsfor every node. - Optionally call
/api/psif available to show loaded models. - Show agent routing table.
Expected output:
Fleet: local-dev
Node Status Endpoint Models
---------------------------------------------------------------------------
m5-local online http://localhost:11434 qwen-coder:14b
m4-mini-64gb online http://192.168.1.50:11434 qwen2.5-coder:32b
m4-mini-16gb online http://192.168.1.51:11434 qwen2.5-coder:7b
Agent routing:
planner -> m4-mini-16gb -> qwen2.5-coder:7b -> claude-haiku-local
implementer -> m4-mini-64gb -> qwen2.5-coder:32b -> claude-sonnet-local
reviewer -> m4-mini-64gb -> qwen2.5-coder:32b -> claude-sonnet-local
Options:
--json
--config fleet.yaml
11. Generated Files
11.1 Generated LiteLLM Config
Output file:
litellm_config.yaml
Template output:
# Generated by subagent-fleet.
# Do not edit manually unless you know what you are doing.
model_list:
{% for model_name, model in models.items() %}
- model_name: {{ model.litellm_alias }}
litellm_params:
model: ollama_chat/{{ model.ollama_model }}
api_base: {{ nodes[model.node].endpoint }}
api_key: ollama
timeout: {{ model.timeout }}
model_info:
max_input_tokens: {{ model.context }}
{% endfor %}
litellm_settings:
drop_params: true
master_key: os.environ/{{ project.gateway.master_key_env | default("LITELLM_MASTER_KEY") }}
router_settings:
routing_strategy: simple-shuffle
num_retries: 1
timeout: 600
Important:
- Use
ollama_chat/provider prefix for LiteLLM. - Use
litellm_aliasas the exposed model name. - Multiple models may share the same
litellm_aliasonly if the user intentionally wants load balancing. - Warn if multiple models share the same alias but point to different Ollama model names.
11.2 Generated Claude Code Agent Files
Output directory:
.claude/agents/
For each agent, create:
.claude/agents/{agent_name}.md
Template:
---
name: {{ agent_name }}
description: {{ agent.description }}
model: {{ model.litellm_alias }}
tools: {{ agent.tools | join(", ") }}
---
{{ agent.prompt }}
Example:
---
name: planner
description: Use for planning, file discovery, task decomposition, and summarization.
model: claude-haiku-local
tools: Read, Grep, Glob
---
You are a fast local planning agent.
Do not edit files.
Return a concise response with:
- plan
- relevant files
- risks
- next recommended agent
11.3 Generated Environment File
Output file:
.env.subagent-fleet
Template:
# Generated by subagent-fleet.
export LITELLM_MASTER_KEY="${LITELLM_MASTER_KEY:-sk-local-dev}"
export ANTHROPIC_BASE_URL="http://localhost:{{ project.gateway.port }}"
export ANTHROPIC_AUTH_TOKEN="$LITELLM_MASTER_KEY"
{% if default_sonnet_model %}
export ANTHROPIC_DEFAULT_SONNET_MODEL="{{ default_sonnet_model }}"
{% endif %}
{% if default_haiku_model %}
export ANTHROPIC_DEFAULT_HAIKU_MODEL="{{ default_haiku_model }}"
{% endif %}
Also print usage:
source .env.subagent-fleet
claude
12. LiteLLM Launch Instructions
The generated output should include a terminal hint:
export LITELLM_MASTER_KEY="sk-local-dev"
litellm \
--config ./litellm_config.yaml \
--host 0.0.0.0 \
--port 4000
If gateway host is 127.0.0.1, use that instead.
13. Ollama Node Setup Instructions
README and/or CLI output should mention:
On each worker machine:
launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
launchctl setenv OLLAMA_NUM_PARALLEL "1"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "1"
killall Ollama
open -a Ollama
Then from controller:
curl http://NODE_IP:11434/api/tags
Security warning:
Do not expose Ollama or LiteLLM to the public internet. Use LAN, firewall, Tailscale, or WireGuard.
14. Security Requirements
The tool should assume private local networking.
Warnings should appear in README and maybe doctor command:
- Do not expose Ollama directly to the public internet.
- Do not expose LiteLLM without authentication.
- Prefer Tailscale, WireGuard, or LAN.
- Use a non-default
LITELLM_MASTER_KEYfor anything beyond local dev.
The generated LiteLLM config should use:
master_key: os.environ/LITELLM_MASTER_KEY
Never hardcode a real secret.
15. UX Principles
The CLI should feel practical and simple.
Good UX:
subagent-fleet init
subagent-fleet discover
subagent-fleet generate
subagent-fleet warmup
subagent-fleet status
Avoid making users manually understand every LiteLLM detail.
Make the output feel like:
Your fleet is ready.
Planner goes to your small model.
Implementer goes to your big model.
Reviewer goes to your big model.
Claude Code can now connect to LiteLLM.
16. Recommended Defaults
Default roles:
planner:
small model
tools: Read, Grep, Glob
summarizer:
small model
tools: Read
implementer:
heavy model
tools: Read, Grep, Glob, Edit, MultiEdit, Bash
reviewer:
heavy model
tools: Read, Grep, Glob, Bash
Default context:
small model: 8192
heavy model: 32768
Default timeouts:
small model: 300
heavy model: 600
Default max_parallel:
1
Reason: local machines often hit VRAM/memory limits with parallel context buffers.
17. Implementation Details
17.1 Pydantic Models
Create models roughly like:
class GatewayConfig(BaseModel):
provider: str = "litellm"
host: str = "127.0.0.1"
port: int = 4000
master_key_env: str = "LITELLM_MASTER_KEY"
class ProjectConfig(BaseModel):
name: str
gateway: GatewayConfig = GatewayConfig()
class NodeConfig(BaseModel):
endpoint: AnyHttpUrl
tags: list[str] = []
class ModelConfig(BaseModel):
node: str
ollama_model: str
litellm_alias: str
context: int = 8192
timeout: int = 300
max_parallel: int = 1
class AgentConfig(BaseModel):
model: str
description: str
tools: list[str] = []
prompt: str | None = None
class FleetConfig(BaseModel):
project: ProjectConfig
nodes: dict[str, NodeConfig]
models: dict[str, ModelConfig]
agents: dict[str, AgentConfig]
Add cross-field validation:
- model.node exists
- agent.model exists
- agent name valid
- optional duplicate alias warning
17.2 Discovery
Use httpx.AsyncClient or simple synchronous httpx.Client.
Function:
def get_ollama_tags(endpoint: str, timeout: float = 5.0) -> list[str]:
...
Call:
GET /api/tags
Return:
["qwen2.5-coder:7b", "llama3.2:3b"]
Handle:
- connection refused
- timeout
- invalid JSON
- missing models key
17.3 Status
Status should combine:
- config routes
- node health
- discovered models
Optional call:
GET /api/ps
If supported, show loaded/running models.
17.4 Generation
Use Jinja2 templates.
Functions:
generate_litellm_config(config: FleetConfig, output_path: Path) -> None
generate_claude_agents(config: FleetConfig, output_dir: Path) -> None
generate_env_file(config: FleetConfig, output_path: Path) -> None
Do not overwrite unless force.
Add headers:
# Generated by subagent-fleet.
# Source: fleet.yaml
For markdown:
<!-- Generated by subagent-fleet. Source: fleet.yaml -->
18. Testing Requirements
Unit Tests
Config tests:
- valid example config loads
- missing node reference fails
- missing model reference fails
- invalid URL fails
- default context applied
- default timeout applied
Generator tests:
- LiteLLM output contains
ollama_chat/model-name - LiteLLM output contains correct
api_base - Claude agent markdown has correct frontmatter
- environment file contains
ANTHROPIC_BASE_URL
Discovery tests:
- mock
/api/tags - online node returns models
- offline node returns offline status without crashing
CLI tests:
initcreatesfleet.yamlvalidatepasses on examplegeneratecreates expected files in temp dir
19. Acceptance Criteria for MVP
MVP is complete when:
- User can install the CLI locally.
- User can run:
subagent-fleet init
- User can edit
fleet.yaml. - User can run:
subagent-fleet validate
- User can run:
subagent-fleet discover
and see Ollama models from configured nodes.
- User can run:
subagent-fleet generate
and receive:
litellm_config.yaml
.claude/agents/*.md
.env.subagent-fleet
- User can start LiteLLM using the generated config.
- User can source the generated env file and run Claude Code.
- Claude Code subagent files reference the generated LiteLLM aliases.
- The README explains the local network security model.
20. Example First Implementation Plan
Build in this order:
Step 1: package skeleton
pyproject.tomlsrc/subagent_fleet/cli.py- basic Typer CLI
subagent-fleet --help
Step 2: config parser
- Pydantic models
- YAML loading
validatecommand- tests
Step 3: init command
- write starter
fleet.yaml - do not overwrite by default
Step 4: discovery
- call
/api/tags - display table
- support offline nodes gracefully
Step 5: generators
- LiteLLM config generator
- Claude agents generator
- env file generator
Step 6: warmup
- call
/api/chat - support all configured models
- show success/failure
Step 7: status
- show nodes
- show model routes
- show agent routes
Step 8: docs polish
- README examples
- security warning
- quickstart
21. Future Roadmap
After MVP:
subagent-fleet benchmark
subagent-fleet recommend
subagent-fleet dashboard
subagent-fleet trace
Possible future features:
- latency benchmarking
- automatic role recommendation
- Tailscale-aware node discovery
- dynamic fallback models
- LiteLLM health/fallback generation
- model load monitoring
- Claude Code request tracing
- subagent execution trace viewer
- OpenAI-compatible harness examples
- support for vLLM, LM Studio, llama.cpp, OpenRouter, cloud APIs
22. Viral Demo Goal
The repo should support this demo:
One Claude Code task.
Planner runs on Mac mini 16GB.
Implementer runs on Mac mini 64GB.
Reviewer runs on laptop or big node.
Terminal shows:
planner -> m4-mini-16gb -> qwen-coder:7b
implementer -> m4-mini-64gb -> qwen-coder:32b
reviewer -> m4-mini-64gb -> qwen-coder:32b
All local.
No cloud token burn.
Demo tagline:
I turned 3 Macs into a private Claude Code subagent swarm.
23. README Positioning
Use this phrasing:
Run Claude Code-style subagents across your local model fleet.
Avoid positioning as:
distributed Ollama
because that sounds like model sharding and is already a crowded space.
Better:
local subagent fleet manager
or:
config-driven subagent orchestration for Ollama + LiteLLM
24. Important Design Principle
Prefer role-based routing over blind load balancing.
Good:
planner -> small model
implementer -> big coding model
reviewer -> big coding model
summarizer -> small model
Less useful for this project:
send any request to any machine randomly
Load balancing only makes sense when the same model is loaded on multiple similar machines.
25. Final MVP Definition
The first release should be a small, reliable CLI that:
reads fleet.yaml
checks Ollama nodes
generates LiteLLM config
generates Claude Code subagent files
warms models
shows status
Do not build a complex scheduler yet.
The value is making multi-machine local subagents easy, visible, and reproducible.