From 00083627f788beaa83d9dfd4e6360c8008679c85 Mon Sep 17 00:00:00 2001
From: aszerW <aszer27937@gmail.com>
Date: Sun, 19 Apr 2026 01:06:55 +0800
Subject: [PATCH] =?UTF-8?q?docs:=20=E6=B7=BB=E5=8A=A0=E9=A1=B9=E7=9B=AE=20?=
 =?UTF-8?q?README=20=E6=96=87=E6=A1=A3?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

内容:
- 项目背景: LLM 路由的必要性，灵感来源 tx402.ai
- 核心方法: NVIDIA 多头分类器 8 维度分析 + 综合评分公式
- 技术架构: FastAPI + NVIDIA Classifier + LiteLLM 分层架构
- Apple Silicon 优化: MPS + FP16 加速说明
- 实现效果: 路由准确性表格、成本优化数据（16次调用$0.011）
- 路由延迟: M4 Pro 稳态 60-90ms，占 LLM 调用 < 2%
- 快速开始: 安装/配置/启动完整指南
- API 使用: Python OpenAI SDK 示例和响应格式说明
- 项目结构和后续计划
---
 README.md | 281 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 281 insertions(+)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..55dbb3e
--- /dev/null
+++ b/README.md
@@ -0,0 +1,281 @@
+# LLM Compass
+
+智能 LLM 路由服务，基于 NVIDIA 多头分类器和 Apple Silicon MPS 加速，为查询自动选择最优模型，兼顾质量与成本。
+
+---
+
+## 项目背景
+
+在大规模使用 LLM 的场景中，不同复杂度的查询适合不同规格的模型：
+- 简单问候用 `qwen-flash` 即可，成本低、延迟小
+- 代码生成需要 `qwen-plus` 保证质量
+- 复杂分析任务才值得调用 `qwen-max`
+
+手动选择模型效率低下，而全部使用最强模型又浪费成本。**LLM Compass** 的目标是自动为每个查询选择"刚刚好"的模型。
+
+灵感来源于 [tx402.ai](https://tx402.ai) 的三层路由架构，本项目采用开源 NVIDIA 多头分类器实现了类似能力。
+
+---
+
+## 核心方法
+
+### NVIDIA 多头分类器
+
+采用 [nvidia/prompt-task-and-complexity-classifier](https://huggingface.co/nvidia/prompt-task-and-complexity-classifier)（184M 参数，DeBERTa-v3-base 架构）：
+
+```
+用户查询 → DeBERTa Backbone → 8个分类头 → 综合评分 → 3-tier路由
+                                    ↓
+                    task_type (12类)
+                    creativity (3类)
+                    reasoning (2类)
+                    domain_knowledge (4类)
+                    contextual_knowledge
+                    number_of_few_shots
+                    no_label_reason
+                    constraint_ct
+```
+
+### 复杂度评分公式
+
+```python
+score = (
+    0.4 * domain_knowledge  +  # High=1.0, Medium=0.6, Low=0.3, No=0.0
+    0.3 * reasoning         +  # Yes=1.0, No=0.0
+    0.2 * creativity        +  # High=1.0, Low=0.4, No=0.0
+    0.1 * task_type            # Code=0.8, QA=0.5, Chatbot=0.2, ...
+)
+
+# 3-tier 路由
+score < 0.35   → simple  → qwen-flash
+0.35 ≤ score < 0.65 → medium → qwen-plus
+score ≥ 0.65   → complex → qwen-max
+```
+
+---
+
+## 技术架构
+
+```
+┌──────────────────────────────────────────────────────────┐
+│                    LLM Compass                            │
+├──────────────────────────────────────────────────────────┤
+│  API Layer: FastAPI (OpenAI 兼容)                        │
+│  ├─ POST /v1/chat/completions (流式/非流式)              │
+│  ├─ GET  /v1/models                                      │
+│  ├─ GET  /stats                                          │
+│  └─ GET  /docs (Swagger UI)                              │
+├──────────────────────────────────────────────────────────┤
+│  Routing Layer: NVIDIA Multi-Head Classifier (184M)      │
+│  ├─ 8 维度查询分析                                       │
+│  ├─ 综合复杂度评分                                       │
+│  └─ 3-tier 智能路由                                      │
+├──────────────────────────────────────────────────────────┤
+│  LLM Backend: LiteLLM (多提供商统一接口)                 │
+│  ├─ DashScope (Qwen)                                     │
+│  ├─ OpenAI (GPT)                                         │
+│  ├─ Anthropic (Claude)                                   │
+│  └─ Google (Gemini)                                      │
+└──────────────────────────────────────────────────────────┘
+```
+
+### Apple Silicon 优化 (M4 Pro)
+
+- **MPS 加速**: 使用 Metal Performance Shaders GPU 后端
+- **FP16 推理**: 半精度浮点，避免 MPS 矩阵乘法类型冲突
+- **统一内存**: M4 Pro 64GB 统一内存，模型加载零拷贝
+
+---
+
+## 实现效果
+
+### 路由准确性
+
+| 查询示例 | Tier | Score | 路由模型 |
+|---------|------|-------|---------|
+| "你好" | simple | 0.17 | qwen-flash |
+| "1+1等于几" | simple | 0.17 | qwen-flash |
+| "Write quicksort in Python" | medium | 0.45 | qwen-plus |
+| "分析深度学习的注意力机制原理" | medium | 0.47 | qwen-plus |
+| "请详细分析量子计算对密码学的影响" | complex | 0.72 | qwen-max |
+
+### 成本优化
+
+根据实际调用统计：
+- **16 次调用总成本**: $0.011
+- **模型分布**: qwen-flash 87.5% (14次), qwen-plus 12.5% (2次)
+- **任务类型**: 主要为 Open QA (13次)
+- **复杂度分布**: simple 73%, medium 13%
+
+---
+
+## 路由延迟
+
+| 环境 | 首次加载 | 稳态延迟 | 备注 |
+|------|---------|---------|------|
+| **M4 Pro MPS + FP16** | ~2s (模型加载) | **~60-90ms** | 当前生产环境 |
+| x86 CPU | ~3s | ~100-150ms | Docker 容器 |
+| NVIDIA 官方报告 | - | 5-15ms | 数据中心 CPU |
+
+**说明**：
+- 首次加载包含模型下载和 MPS kernel 编译，后续请求无需重新加载
+- 稳态延迟约 60-90ms，其中分类器推理 ~53ms，其余为 FastAPI 开销
+- 对于 LLM 调用本身（通常 2-10s），路由开销占比 < 2%
+
+---
+
+## 快速开始
+
+### 前置要求
+
+- Python 3.12+
+- macOS (Apple Silicon 推荐，支持 MPS 加速)
+- DashScope API Key（阿里云 Qwen）
+
+### 安装
+
+```bash
+# 1. 克隆项目
+git clone <repo-url>
+cd llm-compass
+
+# 2. 创建虚拟环境
+python3 -m venv .venv
+source .venv/bin/activate
+
+# 3. 安装依赖
+pip install -r requirements.txt
+
+# 4. 配置 API Key
+cp .env.example .env
+# 编辑 .env 填入 DASHSCOPE_API_KEY
+```
+
+### 启动服务
+
+```bash
+./start.sh           # 默认端口 8402
+./start.sh 9000      # 自定义端口
+```
+
+### 测试
+
+```bash
+# 健康检查
+curl http://localhost:8402/health
+
+# API 测试 (自动路由)
+curl http://localhost:8402/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages":[{"role":"user","content":"你好"}]}'
+
+# Swagger UI
+# 访问 http://localhost:8402/docs
+```
+
+---
+
+## API 使用
+
+### OpenAI 兼容接口
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:8402/v1",
+    api_key="not-needed"  # 可选
+)
+
+# 自动路由（推荐）
+response = client.chat.completions.create(
+    messages=[{"role": "user", "content": "解释量子计算"}]
+)
+print(response.choices[0].message.content)
+print(response.routing)  # 路由详情
+
+# 指定模型
+response = client.chat.completions.create(
+    model="qwen-plus",
+    messages=[{"role": "user", "content": "写一个排序算法"}]
+)
+```
+
+### 响应格式
+
+```json
+{
+  "id": "chatcmpl-xxx",
+  "model": "qwen-flash",
+  "choices": [{"message": {"content": "..."}}],
+  "usage": {"prompt_tokens": 13, "completion_tokens": 25, "total_tokens": 38},
+  "routing": {
+    "method": "nvidia_classifier",
+    "tier": "simple",
+    "complexity_score": 0.17,
+    "task_type": "Open QA",
+    "domain_knowledge": "Low",
+    "reasoning": false,
+    "creativity": "No",
+    "routing_latency_ms": 63.27
+  }
+}
+```
+
+---
+
+## 技术栈
+
+- **Web 框架**: FastAPI + Uvicorn
+- **路由模型**: NVIDIA Multi-Head Classifier (DeBERTa-v3-base)
+- **LLM 调用**: LiteLLM (多提供商统一接口)
+- **GPU 加速**: PyTorch MPS (Metal Performance Shaders)
+- **Token 计算**: tiktoken
+
+---
+
+## 项目结构
+
+```
+llm-compass/
+├── main.py                 # FastAPI 主服务
+├── nvidia_router.py        # NVIDIA 分类器实现
+├── config.py               # 模型配置和路由阈值
+├── start.sh                # 启动脚本
+├── .env                    # 环境变量（不提交到 Git）
+├── .env.example            # 环境变量模板
+├── requirements.txt        # Python 依赖
+├── Dockerfile              # Docker 构建文件
+├── docker-compose.yml      # Docker Compose 配置
+├── data/                   # 调用历史日志（自动创建）
+└── docs/                   # 技术文档
+    └── llm-router-open-source-research.md
+```
+
+---
+
+## 已知限制
+
+1. **路由维度**: 当前仅支持 3 个 Qwen 模型，可扩展至 40+ 模型
+2. **在线学习**: 缺少多臂老虎机等在线学习机制
+3. **语义缓存**: 未实现查询缓存优化
+4. **Docker MPS**: macOS Docker 容器无法使用 Metal GPU，需原生运行
+
+---
+
+## 后续计划
+
+- [ ] Layer 2: 多臂老虎机在线学习 (Thompson Sampling)
+- [ ] Layer 3: 语义缓存 + 批量优化
+- [ ] 扩展模型池至 40+ 模型
+- [ ] 基于业务数据微调 NVIDIA 分类器
+
+---
+
+## 许可证
+
+Apache 2.0
+
+---
+
+**LLM Compass** - 让每个查询都找到最优的模型。