LiteLLM Local LLM Setup
Status: π‘ In progress β connected but cold-start timeouts persist
Started: 2026-05-11
Goal: Run Claude Code against local qwen3_8b / DeepSeek via LiteLLM proxy
Progress log
2026-05-11
- Got
/v1/chat/completionsworking β - Got
/v1/messages(Anthropic pass-through) working β - Claude Code connecting but timing out on first message
- Discovered root cause: 5000ms+ cold-start on local inference server
2026-05-12 (morning)
- Added all Claude model name aliases to
config.yamlβ - Confirmed streaming works β
- Cold-start still causing Claude Code retries (
attempt 4/10) DISABLE_INTERLEAVED_THINKING=1set β reduced noise but timeouts persist
2026-05-12 (evening)
- Tested with
deepseek-r1-distill-qwen-32Bmodel specifically settings.jsonconfig used:"ANTHROPIC_AUTH_TOKEN": "sk-i5Qh...", "ANTHROPIC_BASE_URL": "http://172.18.0.1:4001", "API_TIMEOUT_MS": "3000000", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-r1-distill-qwen-32B", "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-r1-distill-qwen-32B", "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-r1-distill-qwen-32B", "CLAUDE_CODE_SUBAGENT_MODEL": "deepseek-r1-distill-qwen-32B", "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "TRUE"- curl to LiteLLM works fine; Claude Code still fails with βRetrying in 0s Β· attempt 1/10β
- Even with
API_TIMEOUT_MS=3000000, Claude Code shows timeout immediately - Suspicion: Claude Code now requires
/v1/responsesendpoint (Responses API), not just/v1/messages - LiteLLM v1.66.3+ added Responses API support β need to verify version in use
Current config snapshot
# config.yaml (working)
model_list:
- model_name: claude-sonnet-4-20250514
litellm_params:
model: openai/qwen3_8b
api_base: http://192.168.35.9:8000/v1
api_key: "dummy"
supports_system_message: false
timeout: 300
general_settings:
enable_anthropic_pass_through: true# .env (Claude Code side)
ANTHROPIC_BASE_URL=http://172.18.0.1:4000
ANTHROPIC_AUTH_TOKEN=sk-...
DISABLE_PROMPT_CACHING=1
DISABLE_INTERLEAVED_THINKING=1
CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1Open tasks β for this week
- Check LiteLLM version:
litellm --versionβ need β₯1.66.3 for Responses API support - Test
/v1/responsesendpoint directly with curl to confirm LiteLLM supports it - If not supported: pin LiteLLM to a version that has
/v1/responsesor find workaround - Implement pre-warm script (ping every 20s before
claude) - Investigate if inference server has a
/preloador keep-alive endpoint - Check
docker logs -f litellmwhile launchingclaudeto trace exact failure point