LiteLLM Local LLM Setup

Status: 🟑 In progress β€” connected but cold-start timeouts persist
Started: 2026-05-11
Goal: Run Claude Code against local qwen3_8b / DeepSeek via LiteLLM proxy


Progress log

2026-05-11

  • Got /v1/chat/completions working βœ…
  • Got /v1/messages (Anthropic pass-through) working βœ…
  • Claude Code connecting but timing out on first message
  • Discovered root cause: 5000ms+ cold-start on local inference server

2026-05-12 (morning)

  • Added all Claude model name aliases to config.yaml βœ…
  • Confirmed streaming works βœ…
  • Cold-start still causing Claude Code retries (attempt 4/10)
  • DISABLE_INTERLEAVED_THINKING=1 set β€” reduced noise but timeouts persist

2026-05-12 (evening)

  • Tested with deepseek-r1-distill-qwen-32B model specifically
  • settings.json config used:
    "ANTHROPIC_AUTH_TOKEN": "sk-i5Qh...",
    "ANTHROPIC_BASE_URL": "http://172.18.0.1:4001",
    "API_TIMEOUT_MS": "3000000",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-r1-distill-qwen-32B",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-r1-distill-qwen-32B",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-r1-distill-qwen-32B",
    "CLAUDE_CODE_SUBAGENT_MODEL": "deepseek-r1-distill-qwen-32B",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "TRUE"
  • curl to LiteLLM works fine; Claude Code still fails with β€œRetrying in 0s Β· attempt 1/10”
  • Even with API_TIMEOUT_MS=3000000, Claude Code shows timeout immediately
  • Suspicion: Claude Code now requires /v1/responses endpoint (Responses API), not just /v1/messages
  • LiteLLM v1.66.3+ added Responses API support β€” need to verify version in use

Current config snapshot

# config.yaml (working)
model_list:
  - model_name: claude-sonnet-4-20250514
    litellm_params:
      model: openai/qwen3_8b
      api_base: http://192.168.35.9:8000/v1
      api_key: "dummy"
      supports_system_message: false
      timeout: 300
 
general_settings:
  enable_anthropic_pass_through: true
# .env (Claude Code side)
ANTHROPIC_BASE_URL=http://172.18.0.1:4000
ANTHROPIC_AUTH_TOKEN=sk-...
DISABLE_PROMPT_CACHING=1
DISABLE_INTERLEAVED_THINKING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

Open tasks β€” for this week

  • Check LiteLLM version: litellm --version β€” need β‰₯1.66.3 for Responses API support
  • Test /v1/responses endpoint directly with curl to confirm LiteLLM supports it
  • If not supported: pin LiteLLM to a version that has /v1/responses or find workaround
  • Implement pre-warm script (ping every 20s before claude)
  • Investigate if inference server has a /preload or keep-alive endpoint
  • Check docker logs -f litellm while launching claude to trace exact failure point

Reference

β†’ reference/litellm-claude-code-setup