Skip to content

SPEC.md — subagent-fleet

Project

Name: subagent-fleet

Repository: https://github.com/adityak74/subagent-fleet

Tagline: Run Claude Code-style subagents across your local model fleet.

One-line description:
subagent-fleet is a config-first CLI that discovers local/remote Ollama nodes, maps Claude Code-style subagents to the best model/machine, and generates LiteLLM + .claude/agents configuration so developers can run a private local subagent fleet.


1. Product Vision

Modern coding agents increasingly use subagents for planning, implementation, review, testing, and summarization. Local model users often have multiple capable machines — MacBooks, Mac minis, GPU workstations, home servers — but their workflow usually still points to a single local Ollama endpoint.

subagent-fleet turns those machines into a private local subagent fleet.

Example:

planner     → small fast model on M4 Mac mini 16GB
implementer → large coding model on M4 Mac mini 64GB
reviewer    → large coding model on M4 Mac mini 64GB
summarizer  → small local model on laptop

The tool should not replace Ollama or LiteLLM. It should sit above them as a workflow/configuration layer.


2. Core Problem

Today, a developer can manually configure:

  • Ollama on multiple machines
  • LiteLLM proxy routing
  • Claude Code subagent markdown files
  • Environment variables for Claude Code
  • Model warmup calls
  • Health checks and status checks

But this is tedious and error-prone.

subagent-fleet should make this one config-driven workflow.


3. Non-Goals

This project is not:

  • a new inference engine
  • a replacement for Ollama
  • a replacement for LiteLLM
  • a model sharding framework
  • Kubernetes for local LLMs
  • a cloud orchestration platform
  • a public model hosting tool

It should avoid overengineering.

The MVP should focus on:

  • config
  • discovery
  • generation
  • health checks
  • warmup
  • clear local developer workflow

4. Target Users

Primary users:

  • developers using Claude Code or Claude-Code-like coding harnesses
  • developers running Ollama locally
  • people with multiple Macs/workstations
  • local-first AI developers
  • open-source builders trying to reduce cloud token usage
  • privacy-conscious developers

5. MVP Scope

Implement a Python CLI named:

subagent-fleet

MVP commands:

subagent-fleet init
subagent-fleet discover
subagent-fleet validate
subagent-fleet generate
subagent-fleet warmup
subagent-fleet status

Optional but useful:

subagent-fleet doctor
subagent-fleet clean

Do not implement a daemon, dashboard, dynamic scheduler, or full proxy in the MVP.


Use Python.

Suggested dependencies:

typer          CLI framework
pydantic       config validation
pyyaml         YAML parsing/writing
httpx          HTTP calls to Ollama
jinja2         templates for generated files
rich           pretty terminal output

Suggested packaging:

pyproject.toml
src/subagent_fleet/

7. Repository Structure

Create this structure:

subagent-fleet/
  README.md
  SPEC.md
  LICENSE
  pyproject.toml
  .gitignore

  src/
    subagent_fleet/
      __init__.py
      cli.py
      config.py
      discovery.py
      health.py
      warmup.py
      status.py
      generators/
        __init__.py
        litellm.py
        claude_agents.py
        env_file.py
      templates/
        litellm_config.yaml.j2
        claude_agent.md.j2
        env.subagent-fleet.j2

  examples/
    fleet.yaml
    litellm_config.generated.yaml
    claude-agents/
      planner.md
      implementer.md
      reviewer.md

  tests/
    test_config.py
    test_discovery.py
    test_generate_litellm.py
    test_generate_claude_agents.py

8. Configuration File

Primary config file:

fleet.yaml

The config should define:

  • project metadata
  • gateway settings
  • Ollama nodes
  • model aliases
  • agent mappings

Example:

project:
  name: local-dev
  gateway:
    provider: litellm
    host: 0.0.0.0
    port: 4000
    master_key_env: LITELLM_MASTER_KEY

nodes:
  m5-local:
    endpoint: http://localhost:11434
    tags:
      - controller
      - local
      - fast

  m4-mini-64gb:
    endpoint: http://192.168.1.50:11434
    tags:
      - heavy
      - coder
      - reviewer

  m4-mini-16gb:
    endpoint: http://192.168.1.51:11434
    tags:
      - small
      - planner
      - summarizer

models:
  heavy-coder:
    node: m4-mini-64gb
    ollama_model: qwen2.5-coder:32b
    litellm_alias: claude-sonnet-local
    context: 32768
    timeout: 600
    max_parallel: 1

  small-coder:
    node: m4-mini-16gb
    ollama_model: qwen2.5-coder:7b
    litellm_alias: claude-haiku-local
    context: 8192
    timeout: 300
    max_parallel: 1

agents:
  planner:
    model: small-coder
    description: Use for planning, file discovery, task decomposition, and summarization.
    tools:
      - Read
      - Grep
      - Glob
    prompt: |
      You are a fast local planning agent.
      Do not edit files.
      Return a concise response with:
      - plan
      - relevant files
      - risks
      - next recommended agent

  implementer:
    model: heavy-coder
    description: Use for implementation, bug fixes, refactors, and patch creation.
    tools:
      - Read
      - Grep
      - Glob
      - Edit
      - MultiEdit
      - Bash
    prompt: |
      You are a senior implementation agent.
      Make minimal, correct changes.
      Prefer small patches.
      Run relevant checks when possible.
      Explain what changed and why.

  reviewer:
    model: heavy-coder
    description: Use after implementation to review diffs, tests, regressions, and maintainability.
    tools:
      - Read
      - Grep
      - Glob
      - Bash
    prompt: |
      You are a strict code reviewer.
      Focus on correctness, regressions, missing tests, security issues,
      over-engineering, and maintainability.
      Review the diff and test output.
      Return only actionable issues.

9. Config Validation Rules

Validate fleet.yaml with Pydantic.

Rules:

Project

  • project.name required.
  • project.gateway.provider defaults to litellm.
  • project.gateway.port defaults to 4000.
  • project.gateway.host defaults to 127.0.0.1 unless explicitly set.

Nodes

Each node must have:

endpoint: http://host:port

Validation:

  • endpoint must be valid HTTP/HTTPS URL.
  • tags optional; default empty list.
  • duplicate node names disallowed.

Models

Each model must have:

  • node
  • ollama_model
  • litellm_alias

Validation:

  • node must reference existing nodes key.
  • context default: 8192.
  • timeout default: 300.
  • max_parallel default: 1.

Agents

Each agent must have:

  • model
  • description

Validation:

  • model must reference existing models key.
  • tools defaults to [].
  • prompt defaults to a generic role prompt if omitted.
  • agent names should be filesystem-safe: lowercase letters, numbers, hyphens, underscores.

10. CLI Command Details

10.1 subagent-fleet init

Creates a starter fleet.yaml.

Behavior:

  • If fleet.yaml exists, do not overwrite unless --force.
  • Generate a useful local example with:
  • local Ollama node
  • one heavy-coder model placeholder
  • planner, implementer, reviewer agents

Command:

subagent-fleet init

Options:

--force
--output fleet.yaml

Expected output:

Created fleet.yaml
Edit it with your Ollama node endpoints, then run:

  subagent-fleet discover
  subagent-fleet generate

10.2 subagent-fleet discover

Discovers models available on configured Ollama nodes.

For each node:

Call:

GET {node.endpoint}/api/tags

Expected Ollama response contains a models list.

Behavior:

  • Load fleet.yaml.
  • Check every node.
  • Display online/offline status.
  • Display discovered models.
  • Optionally write discovery metadata to .subagent-fleet/discovery.json.

Command:

subagent-fleet discover

Options:

--config fleet.yaml
--json
--write

Expected terminal output:

Fleet: local-dev

Node              Status   Models
-----------------------------------------------
m5-local          online   qwen-coder:14b, llama3.2:3b
m4-mini-64gb      online   qwen2.5-coder:32b
m4-mini-16gb      online   qwen2.5-coder:7b

Error handling:

  • If a node fails, show it as offline.
  • Do not crash the whole command unless config is invalid.
  • Include connection error message in verbose mode.

10.3 subagent-fleet validate

Validates fleet.yaml.

Command:

subagent-fleet validate

Options:

--config fleet.yaml

Checks:

  • config schema valid
  • node references valid
  • model references valid
  • agent references valid
  • endpoint format valid
  • no duplicate aliases creating unintended collisions

Expected output:

fleet.yaml is valid.

If invalid:

Invalid fleet.yaml:

models.heavy-coder.node references unknown node: m4-mini-64

10.4 subagent-fleet generate

Generates:

litellm_config.yaml
.claude/agents/*.md
.env.subagent-fleet

Command:

subagent-fleet generate

Options:

--config fleet.yaml
--out .
--litellm-only
--claude-only
--force

Behavior:

  • Validate config first.
  • Create output directories if missing.
  • Do not overwrite generated files unless --force.
  • Add a generated-file comment header.

Expected output:

Generated:
  litellm_config.yaml
  .claude/agents/planner.md
  .claude/agents/implementer.md
  .claude/agents/reviewer.md
  .env.subagent-fleet

10.5 subagent-fleet warmup

Preloads configured Ollama models.

For each configured model:

Call:

POST {node.endpoint}/api/chat

Payload:

{
  "model": "qwen2.5-coder:32b",
  "messages": [],
  "keep_alive": -1
}

If Ollama does not accept empty messages, use a minimal warmup prompt:

{
  "model": "qwen2.5-coder:32b",
  "messages": [
    {
      "role": "user",
      "content": "Reply with ok."
    }
  ],
  "stream": false,
  "keep_alive": -1
}

Command:

subagent-fleet warmup

Options:

--config fleet.yaml
--model heavy-coder
--agent implementer

Expected output:

Warming models:

heavy-coder  m4-mini-64gb  qwen2.5-coder:32b  ok
small-coder  m4-mini-16gb  qwen2.5-coder:7b   ok

10.6 subagent-fleet status

Shows health/status of nodes and routes.

Command:

subagent-fleet status

Behavior:

  • Validate config.
  • Check /api/tags for every node.
  • Optionally call /api/ps if available to show loaded models.
  • Show agent routing table.

Expected output:

Fleet: local-dev

Node              Status   Endpoint                    Models
---------------------------------------------------------------------------
m5-local          online   http://localhost:11434       qwen-coder:14b
m4-mini-64gb      online   http://192.168.1.50:11434    qwen2.5-coder:32b
m4-mini-16gb      online   http://192.168.1.51:11434    qwen2.5-coder:7b

Agent routing:

planner      -> m4-mini-16gb  -> qwen2.5-coder:7b   -> claude-haiku-local
implementer  -> m4-mini-64gb  -> qwen2.5-coder:32b  -> claude-sonnet-local
reviewer     -> m4-mini-64gb  -> qwen2.5-coder:32b  -> claude-sonnet-local

Options:

--json
--config fleet.yaml

11. Generated Files

11.1 Generated LiteLLM Config

Output file:

litellm_config.yaml

Template output:

# Generated by subagent-fleet.
# Do not edit manually unless you know what you are doing.

model_list:
{% for model_name, model in models.items() %}
  - model_name: {{ model.litellm_alias }}
    litellm_params:
      model: ollama_chat/{{ model.ollama_model }}
      api_base: {{ nodes[model.node].endpoint }}
      api_key: ollama
      timeout: {{ model.timeout }}
    model_info:
      max_input_tokens: {{ model.context }}
{% endfor %}

litellm_settings:
  drop_params: true
  master_key: os.environ/{{ project.gateway.master_key_env | default("LITELLM_MASTER_KEY") }}

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 1
  timeout: 600

Important:

  • Use ollama_chat/ provider prefix for LiteLLM.
  • Use litellm_alias as the exposed model name.
  • Multiple models may share the same litellm_alias only if the user intentionally wants load balancing.
  • Warn if multiple models share the same alias but point to different Ollama model names.

11.2 Generated Claude Code Agent Files

Output directory:

.claude/agents/

For each agent, create:

.claude/agents/{agent_name}.md

Template:

---
name: {{ agent_name }}
description: {{ agent.description }}
model: {{ model.litellm_alias }}
tools: {{ agent.tools | join(", ") }}
---

{{ agent.prompt }}

Example:

---
name: planner
description: Use for planning, file discovery, task decomposition, and summarization.
model: claude-haiku-local
tools: Read, Grep, Glob
---

You are a fast local planning agent.
Do not edit files.
Return a concise response with:
- plan
- relevant files
- risks
- next recommended agent

11.3 Generated Environment File

Output file:

.env.subagent-fleet

Template:

# Generated by subagent-fleet.

export LITELLM_MASTER_KEY="${LITELLM_MASTER_KEY:-sk-local-dev}"

export ANTHROPIC_BASE_URL="http://localhost:{{ project.gateway.port }}"
export ANTHROPIC_AUTH_TOKEN="$LITELLM_MASTER_KEY"

{% if default_sonnet_model %}
export ANTHROPIC_DEFAULT_SONNET_MODEL="{{ default_sonnet_model }}"
{% endif %}

{% if default_haiku_model %}
export ANTHROPIC_DEFAULT_HAIKU_MODEL="{{ default_haiku_model }}"
{% endif %}

Also print usage:

source .env.subagent-fleet
claude

12. LiteLLM Launch Instructions

The generated output should include a terminal hint:

export LITELLM_MASTER_KEY="sk-local-dev"

litellm \
  --config ./litellm_config.yaml \
  --host 0.0.0.0 \
  --port 4000

If gateway host is 127.0.0.1, use that instead.


13. Ollama Node Setup Instructions

README and/or CLI output should mention:

On each worker machine:

launchctl setenv OLLAMA_HOST "0.0.0.0:11434"
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
launchctl setenv OLLAMA_NUM_PARALLEL "1"
launchctl setenv OLLAMA_MAX_LOADED_MODELS "1"

killall Ollama
open -a Ollama

Then from controller:

curl http://NODE_IP:11434/api/tags

Security warning:

Do not expose Ollama or LiteLLM to the public internet. Use LAN, firewall, Tailscale, or WireGuard.


14. Security Requirements

The tool should assume private local networking.

Warnings should appear in README and maybe doctor command:

  • Do not expose Ollama directly to the public internet.
  • Do not expose LiteLLM without authentication.
  • Prefer Tailscale, WireGuard, or LAN.
  • Use a non-default LITELLM_MASTER_KEY for anything beyond local dev.

The generated LiteLLM config should use:

master_key: os.environ/LITELLM_MASTER_KEY

Never hardcode a real secret.


15. UX Principles

The CLI should feel practical and simple.

Good UX:

subagent-fleet init
subagent-fleet discover
subagent-fleet generate
subagent-fleet warmup
subagent-fleet status

Avoid making users manually understand every LiteLLM detail.

Make the output feel like:

Your fleet is ready.
Planner goes to your small model.
Implementer goes to your big model.
Reviewer goes to your big model.
Claude Code can now connect to LiteLLM.

Default roles:

planner:
  small model
  tools: Read, Grep, Glob

summarizer:
  small model
  tools: Read

implementer:
  heavy model
  tools: Read, Grep, Glob, Edit, MultiEdit, Bash

reviewer:
  heavy model
  tools: Read, Grep, Glob, Bash

Default context:

small model: 8192
heavy model: 32768

Default timeouts:

small model: 300
heavy model: 600

Default max_parallel:

1

Reason: local machines often hit VRAM/memory limits with parallel context buffers.


17. Implementation Details

17.1 Pydantic Models

Create models roughly like:

class GatewayConfig(BaseModel):
    provider: str = "litellm"
    host: str = "127.0.0.1"
    port: int = 4000
    master_key_env: str = "LITELLM_MASTER_KEY"

class ProjectConfig(BaseModel):
    name: str
    gateway: GatewayConfig = GatewayConfig()

class NodeConfig(BaseModel):
    endpoint: AnyHttpUrl
    tags: list[str] = []

class ModelConfig(BaseModel):
    node: str
    ollama_model: str
    litellm_alias: str
    context: int = 8192
    timeout: int = 300
    max_parallel: int = 1

class AgentConfig(BaseModel):
    model: str
    description: str
    tools: list[str] = []
    prompt: str | None = None

class FleetConfig(BaseModel):
    project: ProjectConfig
    nodes: dict[str, NodeConfig]
    models: dict[str, ModelConfig]
    agents: dict[str, AgentConfig]

Add cross-field validation:

  • model.node exists
  • agent.model exists
  • agent name valid
  • optional duplicate alias warning

17.2 Discovery

Use httpx.AsyncClient or simple synchronous httpx.Client.

Function:

def get_ollama_tags(endpoint: str, timeout: float = 5.0) -> list[str]:
    ...

Call:

GET /api/tags

Return:

["qwen2.5-coder:7b", "llama3.2:3b"]

Handle:

  • connection refused
  • timeout
  • invalid JSON
  • missing models key

17.3 Status

Status should combine:

  • config routes
  • node health
  • discovered models

Optional call:

GET /api/ps

If supported, show loaded/running models.


17.4 Generation

Use Jinja2 templates.

Functions:

generate_litellm_config(config: FleetConfig, output_path: Path) -> None
generate_claude_agents(config: FleetConfig, output_dir: Path) -> None
generate_env_file(config: FleetConfig, output_path: Path) -> None

Do not overwrite unless force.

Add headers:

# Generated by subagent-fleet.
# Source: fleet.yaml

For markdown:

<!-- Generated by subagent-fleet. Source: fleet.yaml -->

18. Testing Requirements

Unit Tests

Config tests:

  • valid example config loads
  • missing node reference fails
  • missing model reference fails
  • invalid URL fails
  • default context applied
  • default timeout applied

Generator tests:

  • LiteLLM output contains ollama_chat/model-name
  • LiteLLM output contains correct api_base
  • Claude agent markdown has correct frontmatter
  • environment file contains ANTHROPIC_BASE_URL

Discovery tests:

  • mock /api/tags
  • online node returns models
  • offline node returns offline status without crashing

CLI tests:

  • init creates fleet.yaml
  • validate passes on example
  • generate creates expected files in temp dir

19. Acceptance Criteria for MVP

MVP is complete when:

  1. User can install the CLI locally.
  2. User can run:
subagent-fleet init
  1. User can edit fleet.yaml.
  2. User can run:
subagent-fleet validate
  1. User can run:
subagent-fleet discover

and see Ollama models from configured nodes.

  1. User can run:
subagent-fleet generate

and receive:

litellm_config.yaml
.claude/agents/*.md
.env.subagent-fleet
  1. User can start LiteLLM using the generated config.
  2. User can source the generated env file and run Claude Code.
  3. Claude Code subagent files reference the generated LiteLLM aliases.
  4. The README explains the local network security model.

20. Example First Implementation Plan

Build in this order:

Step 1: package skeleton

  • pyproject.toml
  • src/subagent_fleet/cli.py
  • basic Typer CLI
  • subagent-fleet --help

Step 2: config parser

  • Pydantic models
  • YAML loading
  • validate command
  • tests

Step 3: init command

  • write starter fleet.yaml
  • do not overwrite by default

Step 4: discovery

  • call /api/tags
  • display table
  • support offline nodes gracefully

Step 5: generators

  • LiteLLM config generator
  • Claude agents generator
  • env file generator

Step 6: warmup

  • call /api/chat
  • support all configured models
  • show success/failure

Step 7: status

  • show nodes
  • show model routes
  • show agent routes

Step 8: docs polish

  • README examples
  • security warning
  • quickstart

21. Future Roadmap

After MVP:

subagent-fleet benchmark
subagent-fleet recommend
subagent-fleet dashboard
subagent-fleet trace

Possible future features:

  • latency benchmarking
  • automatic role recommendation
  • Tailscale-aware node discovery
  • dynamic fallback models
  • LiteLLM health/fallback generation
  • model load monitoring
  • Claude Code request tracing
  • subagent execution trace viewer
  • OpenAI-compatible harness examples
  • support for vLLM, LM Studio, llama.cpp, OpenRouter, cloud APIs

22. Viral Demo Goal

The repo should support this demo:

One Claude Code task.

Planner runs on Mac mini 16GB.
Implementer runs on Mac mini 64GB.
Reviewer runs on laptop or big node.

Terminal shows:
  planner     -> m4-mini-16gb  -> qwen-coder:7b
  implementer -> m4-mini-64gb  -> qwen-coder:32b
  reviewer    -> m4-mini-64gb  -> qwen-coder:32b

All local.
No cloud token burn.

Demo tagline:

I turned 3 Macs into a private Claude Code subagent swarm.

23. README Positioning

Use this phrasing:

Run Claude Code-style subagents across your local model fleet.

Avoid positioning as:

distributed Ollama

because that sounds like model sharding and is already a crowded space.

Better:

local subagent fleet manager

or:

config-driven subagent orchestration for Ollama + LiteLLM

24. Important Design Principle

Prefer role-based routing over blind load balancing.

Good:

planner     -> small model
implementer -> big coding model
reviewer    -> big coding model
summarizer  -> small model

Less useful for this project:

send any request to any machine randomly

Load balancing only makes sense when the same model is loaded on multiple similar machines.


25. Final MVP Definition

The first release should be a small, reliable CLI that:

reads fleet.yaml
checks Ollama nodes
generates LiteLLM config
generates Claude Code subagent files
warms models
shows status

Do not build a complex scheduler yet.

The value is making multi-machine local subagents easy, visible, and reproducible.