feat(repo): 整理 Qoder Skills 和 MCP 配置到仓库
- 添加 5 个用户级别 Skills: - auto-commit: 自动 Git 提交 - karpathy-guidelines: 编码规范指南 - opencli-websearch: 多源网络搜索 - pdf-reader: PDF 内容提取 - repo-analyzer: 项目深度分析 - 添加 Playwright MCP 配置 (21 个浏览器自动化工具) - 创建完整的 README.md 文档说明
This commit is contained in:
286
skills/opencli-websearch/SKILL.md
Normal file
286
skills/opencli-websearch/SKILL.md
Normal file
@@ -0,0 +1,286 @@
|
||||
---
|
||||
name: opencli-websearch
|
||||
description: 使用 Qoder WebSearch 和 OpenCLI 同时进行多源网络搜索,合并结果并提供全面的信息检索。支持谷歌、知乎、ArXiv、小红书、StackOverflow、HackerNews 等多个数据源。当用户需要搜索信息、研究话题或获取多源数据时自动使用。
|
||||
---
|
||||
|
||||
# OpenCLI 多源 Web 搜索
|
||||
|
||||
## 概述
|
||||
|
||||
本 Skill 提供**并行搜索能力**,同时使用 Qoder 内置 WebSearch 和 OpenCLI 的多源适配器进行搜索,合并结果以提供更全面的信息覆盖。
|
||||
|
||||
## 搜索策略
|
||||
|
||||
### 并行搜索架构
|
||||
|
||||
```
|
||||
用户查询
|
||||
├──→ Qoder WebSearch (通用搜索)
|
||||
└──→ OpenCLI 多源搜索
|
||||
├──→ 学术源: arxiv
|
||||
├──→ 技术源: stackoverflow, hackernews
|
||||
├──→ 社交源: xiaohongshu, zhihu (需浏览器)
|
||||
├──→ 新闻源: 36kr, bbc, reuters
|
||||
└──→ 通用源: google (需浏览器)
|
||||
```
|
||||
|
||||
### 数据源分类
|
||||
|
||||
| 类别 | 数据源 | 模式 | 适用场景 |
|
||||
|-----|-------|------|---------|
|
||||
| **学术** | arxiv | 公开 | 论文、研究 |
|
||||
| **技术** | stackoverflow, hackernews | 公开 | 编程、技术讨论 |
|
||||
| **中文社交** | zhihu, xiaohongshu | 浏览器 | 中文社区内容 |
|
||||
| **新闻** | 36kr, bbc, reuters | 公开 | 时事新闻 |
|
||||
| **通用** | google | 浏览器 | 广泛搜索 |
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 基本搜索
|
||||
|
||||
当用户需要搜索信息时,自动执行以下流程:
|
||||
|
||||
1. **启动并行搜索**
|
||||
- 调用 Qoder WebSearch
|
||||
- 同时调用 OpenCLI 多源搜索
|
||||
|
||||
2. **OpenCLI 搜索执行**
|
||||
|
||||
```bash
|
||||
# 学术搜索
|
||||
opencli arxiv search "{query}" --limit 5
|
||||
|
||||
# 技术搜索
|
||||
opencli stackoverflow search "{query}" --limit 5
|
||||
opencli hackernews top # 或搜索相关
|
||||
|
||||
# 新闻搜索 (36kr 支持中文)
|
||||
opencli 36kr search "{query}" --limit 5
|
||||
|
||||
# 其他公开源
|
||||
opencli gitee search "{query}" --limit 5
|
||||
```
|
||||
|
||||
3. **结果合并与去重**
|
||||
- 合并所有来源的结果
|
||||
- 按相关性和来源多样性排序
|
||||
- 标注每个结果的来源
|
||||
|
||||
### 深度内容获取
|
||||
|
||||
对于重要结果,使用 OpenCLI 下载完整内容:
|
||||
|
||||
```bash
|
||||
# 下载网页内容为 Markdown
|
||||
opencli web read --url "{url}" --output "{output_path}"
|
||||
```
|
||||
|
||||
### 临时数据存储
|
||||
|
||||
所有下载的内容存储在:
|
||||
```
|
||||
~/Downloads/opencli-websearch-data/
|
||||
├── {timestamp}_{query_hash}/
|
||||
│ ├── metadata.json # 搜索元数据
|
||||
│ ├── results.json # 合并后的搜索结果
|
||||
│ └── content/
|
||||
│ ├── arxiv_{id}.md
|
||||
│ ├── web_{hash}.md
|
||||
│ └── ...
|
||||
```
|
||||
|
||||
## 执行流程
|
||||
|
||||
### Step 1: 分析查询意图
|
||||
|
||||
判断查询类型以选择最佳数据源:
|
||||
- **学术/研究** → 优先 arxiv, google-scholar
|
||||
- **编程/技术** → 优先 stackoverflow, hackernews
|
||||
- **中文内容** → 优先 zhihu, xiaohongshu, 36kr
|
||||
- **新闻/时事** → 优先 bbc, reuters, 36kr
|
||||
- **通用查询** → 全源搜索
|
||||
|
||||
### Step 2: 并行执行搜索
|
||||
|
||||
```python
|
||||
# 伪代码示意
|
||||
sources = select_sources(query_intent)
|
||||
results = {}
|
||||
|
||||
# Qoder WebSearch
|
||||
results['qoder'] = websearch(query)
|
||||
|
||||
# OpenCLI 多源搜索
|
||||
for source in sources:
|
||||
results[source] = opencli_search(source, query)
|
||||
```
|
||||
|
||||
### Step 3: 结果处理
|
||||
|
||||
1. **格式化**: 统一不同来源的结果格式
|
||||
2. **去重**: 基于 URL 和标题相似度去重
|
||||
3. **排序**: 按来源权威性和相关性排序
|
||||
4. **摘要**: 为每个结果生成简要摘要
|
||||
|
||||
### Step 4: 深度获取(可选)
|
||||
|
||||
对于高相关性结果:
|
||||
1. 使用 `opencli web read` 获取完整内容
|
||||
2. 存储到本地临时目录
|
||||
3. 提供内容摘要给用户
|
||||
|
||||
## 输出格式
|
||||
|
||||
### 搜索结果报告
|
||||
|
||||
```markdown
|
||||
## 搜索结果: {query}
|
||||
|
||||
### 概览
|
||||
- 搜索源: {sources}
|
||||
- 总结果数: {count}
|
||||
- 存储位置: ~/Downloads/opencli-websearch-data/{timestamp}/
|
||||
|
||||
### 按来源分类
|
||||
|
||||
#### 学术来源
|
||||
1. [标题](url) - arxiv
|
||||
- 摘要: ...
|
||||
|
||||
#### 技术来源
|
||||
1. [标题](url) - stackoverflow
|
||||
- 摘要: ...
|
||||
|
||||
#### 中文来源
|
||||
1. [标题](url) - zhihu
|
||||
- 摘要: ...
|
||||
|
||||
### 推荐深度阅读
|
||||
- [文档1](path) - 已下载完整内容
|
||||
- [文档2](path) - 已下载完整内容
|
||||
```
|
||||
|
||||
## 工具函数
|
||||
|
||||
### 执行 OpenCLI 搜索
|
||||
|
||||
```python
|
||||
def opencli_search(source: str, query: str, limit: int = 5) -> list:
|
||||
"""
|
||||
使用 OpenCLI 搜索指定数据源
|
||||
|
||||
Args:
|
||||
source: 数据源名称 (arxiv, stackoverflow, etc.)
|
||||
query: 搜索查询
|
||||
limit: 结果数量限制
|
||||
|
||||
Returns:
|
||||
搜索结果列表
|
||||
"""
|
||||
# 构建命令
|
||||
cmd = f"opencli {source} search '{query}' --limit {limit}"
|
||||
# 执行并解析结果
|
||||
...
|
||||
```
|
||||
|
||||
### 下载文档内容
|
||||
|
||||
```python
|
||||
def download_content(url: str, output_dir: str) -> str:
|
||||
"""
|
||||
使用 OpenCLI web read 下载文档
|
||||
|
||||
Args:
|
||||
url: 文档 URL
|
||||
output_dir: 输出目录
|
||||
|
||||
Returns:
|
||||
下载文件的本地路径
|
||||
"""
|
||||
filename = hash(url) + ".md"
|
||||
output_path = os.path.join(output_dir, filename)
|
||||
cmd = f"opencli web read --url '{url}' --output '{output_path}'"
|
||||
# 执行命令
|
||||
...
|
||||
return output_path
|
||||
```
|
||||
|
||||
### 创建存储目录
|
||||
|
||||
```python
|
||||
def create_storage_dir(query: str) -> str:
|
||||
"""
|
||||
创建临时存储目录
|
||||
|
||||
Returns:
|
||||
存储目录路径
|
||||
"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
query_hash = hashlib.md5(query.encode()).hexdigest()[:8]
|
||||
dir_name = f"{timestamp}_{query_hash}"
|
||||
storage_path = os.path.expanduser(f"~/Downloads/opencli-websearch-data/{dir_name}")
|
||||
os.makedirs(storage_path, exist_ok=True)
|
||||
os.makedirs(os.path.join(storage_path, "content"), exist_ok=True)
|
||||
return storage_path
|
||||
```
|
||||
|
||||
## 错误处理
|
||||
|
||||
### 常见错误
|
||||
|
||||
| 错误码 | 原因 | 解决方案 |
|
||||
|-------|------|---------|
|
||||
| BROWSER_CONNECT | 浏览器扩展未连接 | 提示用户打开 Chrome 并启用扩展 |
|
||||
| TIMEOUT | 搜索超时 | 减少结果数量或更换数据源 |
|
||||
| NOT_FOUND | 无搜索结果 | 尝试其他数据源或修改查询词 |
|
||||
|
||||
### 降级策略
|
||||
|
||||
当某个数据源失败时:
|
||||
1. 记录错误信息
|
||||
2. 继续处理其他数据源
|
||||
3. 在结果中标注失败的数据源
|
||||
4. 建议用户可选的替代方案
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **查询优化**: 对于中文查询,优先使用中文数据源
|
||||
2. **结果数量**: 每个源默认获取 5 条,避免过多噪声
|
||||
3. **深度获取**: 只对高相关性结果下载完整内容
|
||||
4. **存储管理**: 定期清理 ~/Downloads/opencli-websearch-data/ 下的旧数据
|
||||
5. **来源标注**: 始终标注每个结果的来源,便于用户判断可信度
|
||||
|
||||
## 示例
|
||||
|
||||
### 示例 1: 学术研究
|
||||
|
||||
用户: "搜索关于大语言模型路由的论文"
|
||||
|
||||
执行:
|
||||
```bash
|
||||
# 并行搜索
|
||||
opencli arxiv search "large language model routing" --limit 5
|
||||
opencli arxiv search "LLM router" --limit 5
|
||||
# Qoder websearch 同时执行
|
||||
```
|
||||
|
||||
### 示例 2: 技术问题
|
||||
|
||||
用户: "Python 异步编程最佳实践"
|
||||
|
||||
执行:
|
||||
```bash
|
||||
opencli stackoverflow search "python async best practices" --limit 5
|
||||
opencli hackernews top | grep -i python
|
||||
# Qoder websearch 同时执行
|
||||
```
|
||||
|
||||
### 示例 3: 中文内容
|
||||
|
||||
用户: "小红书上的 AI 工具推荐"
|
||||
|
||||
执行:
|
||||
```bash
|
||||
opencli xiaohongshu search "AI工具推荐" --limit 5
|
||||
# 注意: 需要浏览器扩展已连接
|
||||
```
|
||||
193
skills/opencli-websearch/scripts/download_content.py
Normal file
193
skills/opencli-websearch/scripts/download_content.py
Normal file
@@ -0,0 +1,193 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
使用 OpenCLI web read 下载文档内容
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import hashlib
|
||||
import subprocess
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
from typing import Optional, List
|
||||
from urllib.parse import urlparse
|
||||
|
||||
|
||||
def download_with_opencli(url: str, output_dir: str, timeout: int = 60) -> Optional[str]:
|
||||
"""
|
||||
使用 OpenCLI web read 下载文档内容
|
||||
|
||||
Args:
|
||||
url: 要下载的 URL
|
||||
output_dir: 输出目录
|
||||
timeout: 超时时间(秒)
|
||||
|
||||
Returns:
|
||||
下载文件的本地路径,失败返回 None
|
||||
"""
|
||||
# 生成文件名
|
||||
url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
|
||||
parsed = urlparse(url)
|
||||
domain = parsed.netloc.replace(".", "_")
|
||||
filename = f"{domain}_{url_hash}.md"
|
||||
output_path = os.path.join(output_dir, filename)
|
||||
|
||||
# 构建命令
|
||||
cmd = ["opencli", "web", "read", "--url", url, "--output", output_path]
|
||||
|
||||
print(f"下载: {url}")
|
||||
print(f"输出: {output_path}")
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
if os.path.exists(output_path):
|
||||
file_size = os.path.getsize(output_path)
|
||||
print(f"✓ 成功下载 ({file_size} bytes)")
|
||||
return output_path
|
||||
else:
|
||||
print(f"✗ 文件未生成")
|
||||
return None
|
||||
else:
|
||||
print(f"✗ 下载失败: {result.stderr[:200]}")
|
||||
return None
|
||||
|
||||
except subprocess.TimeoutExpired:
|
||||
print(f"✗ 下载超时")
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"✗ 错误: {str(e)}")
|
||||
return None
|
||||
|
||||
|
||||
def batch_download(urls: List[str], output_dir: str, max_workers: int = 3) -> dict:
|
||||
"""
|
||||
批量下载多个 URL
|
||||
|
||||
Args:
|
||||
urls: URL 列表
|
||||
output_dir: 输出目录
|
||||
max_workers: 最大并行数
|
||||
|
||||
Returns:
|
||||
下载结果字典 {url: local_path or None}
|
||||
"""
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
results = {}
|
||||
|
||||
with ThreadPoolExecutor(max_workers=max_workers) as executor:
|
||||
future_to_url = {
|
||||
executor.submit(download_with_opencli, url, output_dir): url
|
||||
for url in urls
|
||||
}
|
||||
|
||||
for future in as_completed(future_to_url):
|
||||
url = future_to_url[future]
|
||||
try:
|
||||
results[url] = future.result()
|
||||
except Exception as e:
|
||||
print(f"✗ {url} 异常: {str(e)}")
|
||||
results[url] = None
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def load_results_from_search(search_dir: str) -> List[str]:
|
||||
"""
|
||||
从之前的搜索结果中加载 URL 列表
|
||||
|
||||
Args:
|
||||
search_dir: 搜索结果目录
|
||||
|
||||
Returns:
|
||||
URL 列表
|
||||
"""
|
||||
results_file = os.path.join(search_dir, "results.json")
|
||||
|
||||
if not os.path.exists(results_file):
|
||||
print(f"未找到结果文件: {results_file}")
|
||||
return []
|
||||
|
||||
with open(results_file, "r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
urls = []
|
||||
for source, result in data.items():
|
||||
if result.get("success") and result.get("output"):
|
||||
# 简单解析输出中的 URL
|
||||
output = result["output"]
|
||||
for line in output.split("\n"):
|
||||
if "url:" in line.lower() or "http" in line:
|
||||
# 提取 URL
|
||||
import re
|
||||
url_match = re.search(r'https?://[^\s\'"<>]+', line)
|
||||
if url_match:
|
||||
urls.append(url_match.group())
|
||||
|
||||
return list(set(urls))
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="使用 OpenCLI 下载文档内容")
|
||||
parser.add_argument("--url", help="单个 URL 下载")
|
||||
parser.add_argument("--urls", nargs="+", help="多个 URL 下载")
|
||||
parser.add_argument("--from-search", help="从搜索结果目录加载 URL")
|
||||
parser.add_argument("--output-dir", required=True, help="输出目录")
|
||||
parser.add_argument("--max-workers", type=int, default=3, help="最大并行数")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 确保输出目录存在
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
|
||||
# 收集 URL 列表
|
||||
urls = []
|
||||
|
||||
if args.url:
|
||||
urls.append(args.url)
|
||||
|
||||
if args.urls:
|
||||
urls.extend(args.urls)
|
||||
|
||||
if args.from_search:
|
||||
search_urls = load_results_from_search(args.from_search)
|
||||
urls.extend(search_urls)
|
||||
print(f"从搜索结果加载 {len(search_urls)} 个 URL")
|
||||
|
||||
if not urls:
|
||||
print("错误: 未提供 URL")
|
||||
return 1
|
||||
|
||||
# 去重
|
||||
urls = list(set(urls))
|
||||
print(f"\n共 {len(urls)} 个唯一 URL 待下载\n")
|
||||
|
||||
# 批量下载
|
||||
results = batch_download(urls, args.output_dir, args.max_workers)
|
||||
|
||||
# 统计
|
||||
success_count = sum(1 for v in results.values() if v is not None)
|
||||
print(f"\n{'='*60}")
|
||||
print(f"下载完成: {success_count}/{len(urls)} 成功")
|
||||
print(f"{'='*60}")
|
||||
|
||||
# 保存下载记录
|
||||
record_file = os.path.join(args.output_dir, "download_record.json")
|
||||
with open(record_file, "w", encoding="utf-8") as f:
|
||||
json.dump(results, f, ensure_ascii=False, indent=2)
|
||||
|
||||
print(f"下载记录: {record_file}")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
245
skills/opencli-websearch/scripts/unified_search.py
Normal file
245
skills/opencli-websearch/scripts/unified_search.py
Normal file
@@ -0,0 +1,245 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
统一搜索入口 - 整合 Qoder WebSearch 和 OpenCLI 多源搜索
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import hashlib
|
||||
import subprocess
|
||||
import argparse
|
||||
from datetime import datetime
|
||||
from typing import List, Dict, Optional
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
|
||||
def create_storage_dir(query: str) -> str:
|
||||
"""创建临时存储目录"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
query_hash = hashlib.md5(query.encode()).hexdigest()[:8]
|
||||
dir_name = f"{timestamp}_{query_hash}"
|
||||
storage_path = os.path.expanduser(f"~/Downloads/opencli-websearch-data/{dir_name}")
|
||||
|
||||
os.makedirs(storage_path, exist_ok=True)
|
||||
os.makedirs(os.path.join(storage_path, "content"), exist_ok=True)
|
||||
|
||||
return storage_path
|
||||
|
||||
|
||||
def run_opencli_search(source: str, query: str, limit: int = 5) -> Dict:
|
||||
"""执行 OpenCLI 搜索"""
|
||||
if source == "hackernews":
|
||||
cmd = ["opencli", "hackernews", "top", "--limit", str(limit)]
|
||||
else:
|
||||
cmd = ["opencli", source, "search", query, "--limit", str(limit)]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
|
||||
return {
|
||||
"source": source,
|
||||
"success": result.returncode == 0,
|
||||
"output": result.stdout if result.returncode == 0 else None,
|
||||
"error": result.stderr if result.returncode != 0 else None
|
||||
}
|
||||
except subprocess.TimeoutExpired:
|
||||
return {"source": source, "success": False, "output": None, "error": "Timeout"}
|
||||
except Exception as e:
|
||||
return {"source": source, "success": False, "output": None, "error": str(e)}
|
||||
|
||||
|
||||
def run_qoder_websearch(query: str) -> Dict:
|
||||
"""
|
||||
执行 Qoder WebSearch
|
||||
注意:此函数需要 Qoder 环境支持,实际使用时通过 Qoder 工具调用
|
||||
"""
|
||||
# 这是一个占位符,实际使用时 Qoder 会直接调用 websearch
|
||||
# 这里返回一个标记,表示需要 Qoder 处理
|
||||
return {
|
||||
"source": "qoder_websearch",
|
||||
"success": True,
|
||||
"output": "[Qoder WebSearch 结果将在此处合并]",
|
||||
"error": None,
|
||||
"needs_qoder": True
|
||||
}
|
||||
|
||||
|
||||
def parallel_search(query: str, sources: List[str], use_qoder: bool = True) -> Dict[str, Dict]:
|
||||
"""并行执行多源搜索"""
|
||||
results = {}
|
||||
|
||||
# 如果启用 Qoder,先标记
|
||||
if use_qoder:
|
||||
results["qoder_websearch"] = run_qoder_websearch(query)
|
||||
|
||||
# 并行执行 OpenCLI 搜索
|
||||
with ThreadPoolExecutor(max_workers=5) as executor:
|
||||
future_to_source = {
|
||||
executor.submit(run_opencli_search, source, query, 5): source
|
||||
for source in sources
|
||||
}
|
||||
|
||||
for future in as_completed(future_to_source):
|
||||
source = future_to_source[future]
|
||||
try:
|
||||
results[source] = future.result()
|
||||
except Exception as e:
|
||||
results[source] = {"source": source, "success": False, "output": None, "error": str(e)}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def select_sources(query: str, intent: Optional[str] = None) -> List[str]:
|
||||
"""根据查询意图选择数据源"""
|
||||
sources = []
|
||||
|
||||
# 数据源配置
|
||||
SOURCE_CONFIG = {
|
||||
"academic": ["arxiv"],
|
||||
"technical": ["stackoverflow", "hackernews", "gitee"],
|
||||
"chinese": ["36kr", "zhihu", "xiaohongshu"],
|
||||
"news": ["bbc", "reuters"],
|
||||
"general": ["google"]
|
||||
}
|
||||
|
||||
if intent and intent in SOURCE_CONFIG:
|
||||
sources = SOURCE_CONFIG[intent]
|
||||
else:
|
||||
# 自动判断
|
||||
query_lower = query.lower()
|
||||
|
||||
# 学术关键词
|
||||
if any(kw in query_lower for kw in ["paper", "论文", "arxiv", "research", "study"]):
|
||||
sources.extend(SOURCE_CONFIG["academic"])
|
||||
|
||||
# 技术关键词
|
||||
if any(kw in query_lower for kw in ["python", "javascript", "code", "programming", "bug", "error"]):
|
||||
sources.extend(["stackoverflow", "hackernews"])
|
||||
|
||||
# 中文关键词 - 优先公开源
|
||||
if any('\u4e00' <= char <= '\u9fff' for char in query):
|
||||
sources.extend(["36kr"])
|
||||
|
||||
# 默认源
|
||||
if not sources:
|
||||
sources = ["arxiv", "stackoverflow", "36kr", "hackernews"]
|
||||
|
||||
return list(set(sources))
|
||||
|
||||
|
||||
def generate_report(query: str, results: Dict, storage_path: str) -> str:
|
||||
"""生成 Markdown 格式搜索报告"""
|
||||
report = []
|
||||
|
||||
report.append(f"# 搜索报告: {query}\n")
|
||||
report.append(f"**搜索时间**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
|
||||
report.append(f"**存储位置**: `{storage_path}`\n")
|
||||
|
||||
# 统计
|
||||
success_count = sum(1 for r in results.values() if r.get("success"))
|
||||
report.append(f"**数据源**: {len(results)} 个 | **成功**: {success_count} 个\n")
|
||||
|
||||
report.append("---\n")
|
||||
|
||||
# 按来源分类展示
|
||||
for source, result in sorted(results.items()):
|
||||
status_icon = "✅" if result.get("success") else "❌"
|
||||
report.append(f"\n## {status_icon} {source.upper()}\n")
|
||||
|
||||
if result.get("success") and result.get("output"):
|
||||
output = result["output"]
|
||||
# 截断过长输出
|
||||
if len(output) > 2000:
|
||||
output = output[:2000] + "\n\n... (内容已截断)"
|
||||
report.append(f"```\n{output}\n```\n")
|
||||
elif result.get("error"):
|
||||
report.append(f"```\n错误: {result['error'][:200]}\n```\n")
|
||||
|
||||
if result.get("needs_qoder"):
|
||||
report.append("> 📝 **注意**: Qoder WebSearch 结果将通过 Qoder 工具直接提供\n")
|
||||
|
||||
report.append("\n---\n")
|
||||
report.append("*由 OpenCLI WebSearch Skill 生成*\n")
|
||||
|
||||
return "\n".join(report)
|
||||
|
||||
|
||||
def save_results(storage_path: str, query: str, results: Dict, report: str):
|
||||
"""保存搜索结果"""
|
||||
# 保存元数据
|
||||
metadata = {
|
||||
"query": query,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"sources": list(results.keys()),
|
||||
"success_count": sum(1 for r in results.values() if r.get("success"))
|
||||
}
|
||||
|
||||
with open(os.path.join(storage_path, "metadata.json"), "w", encoding="utf-8") as f:
|
||||
json.dump(metadata, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# 保存原始结果
|
||||
with open(os.path.join(storage_path, "results.json"), "w", encoding="utf-8") as f:
|
||||
json.dump(results, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# 保存报告
|
||||
with open(os.path.join(storage_path, "report.md"), "w", encoding="utf-8") as f:
|
||||
f.write(report)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="统一搜索入口 - Qoder + OpenCLI")
|
||||
parser.add_argument("query", help="搜索查询")
|
||||
parser.add_argument("--intent", choices=["academic", "technical", "chinese", "news", "general"],
|
||||
help="搜索意图类型")
|
||||
parser.add_argument("--sources", nargs="+", help="指定 OpenCLI 数据源")
|
||||
parser.add_argument("--no-qoder", action="store_true", help="不使用 Qoder WebSearch")
|
||||
parser.add_argument("--output", help="输出目录")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 创建存储目录
|
||||
if args.output:
|
||||
storage_path = args.output
|
||||
os.makedirs(storage_path, exist_ok=True)
|
||||
else:
|
||||
storage_path = create_storage_dir(args.query)
|
||||
|
||||
print(f"📁 存储路径: {storage_path}\n")
|
||||
|
||||
# 选择数据源
|
||||
if args.sources:
|
||||
sources = args.sources
|
||||
else:
|
||||
sources = select_sources(args.query, args.intent)
|
||||
|
||||
print(f"🔍 OpenCLI 数据源: {', '.join(sources)}")
|
||||
print(f"🔍 Qoder WebSearch: {'禁用' if args.no_qoder else '启用'}\n")
|
||||
|
||||
# 执行并行搜索
|
||||
print("⏳ 正在并行搜索...\n")
|
||||
results = parallel_search(args.query, sources, use_qoder=not args.no_qoder)
|
||||
|
||||
# 生成报告
|
||||
report = generate_report(args.query, results, storage_path)
|
||||
|
||||
# 保存结果
|
||||
save_results(storage_path, args.query, results, report)
|
||||
|
||||
# 输出报告
|
||||
print(report)
|
||||
|
||||
print(f"\n✅ 搜索完成!")
|
||||
print(f"📄 报告: {os.path.join(storage_path, 'report.md')}")
|
||||
print(f"📊 数据: {os.path.join(storage_path, 'results.json')}")
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
322
skills/opencli-websearch/scripts/websearch.py
Normal file
322
skills/opencli-websearch/scripts/websearch.py
Normal file
@@ -0,0 +1,322 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
OpenCLI 多源 Web 搜索脚本
|
||||
支持 Qoder WebSearch 和 OpenCLI 并行搜索
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
import hashlib
|
||||
import subprocess
|
||||
import argparse
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import List, Dict, Optional
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
|
||||
|
||||
# 数据源配置
|
||||
SOURCES = {
|
||||
"academic": {
|
||||
"arxiv": {"type": "public", "limit": 5},
|
||||
},
|
||||
"technical": {
|
||||
"stackoverflow": {"type": "public", "limit": 5},
|
||||
"hackernews": {"type": "public", "limit": 5},
|
||||
"gitee": {"type": "public", "limit": 5},
|
||||
},
|
||||
"chinese": {
|
||||
"zhihu": {"type": "browser", "limit": 5},
|
||||
"xiaohongshu": {"type": "browser", "limit": 5},
|
||||
"36kr": {"type": "public", "limit": 5},
|
||||
},
|
||||
"news": {
|
||||
"bbc": {"type": "public", "limit": 5},
|
||||
"reuters": {"type": "public", "limit": 5},
|
||||
},
|
||||
"general": {
|
||||
"google": {"type": "browser", "limit": 5},
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
def create_storage_dir(query: str) -> str:
|
||||
"""创建临时存储目录"""
|
||||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
query_hash = hashlib.md5(query.encode()).hexdigest()[:8]
|
||||
dir_name = f"{timestamp}_{query_hash}"
|
||||
storage_path = os.path.expanduser(f"~/Downloads/opencli-websearch-data/{dir_name}")
|
||||
|
||||
os.makedirs(storage_path, exist_ok=True)
|
||||
os.makedirs(os.path.join(storage_path, "content"), exist_ok=True)
|
||||
|
||||
return storage_path
|
||||
|
||||
|
||||
def run_opencli_search(source: str, query: str, limit: int = 5) -> Dict:
|
||||
"""执行 OpenCLI 搜索"""
|
||||
cmd = ["opencli", source, "search", query, "--limit", str(limit)]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
|
||||
if result.returncode == 0:
|
||||
return {
|
||||
"source": source,
|
||||
"success": True,
|
||||
"output": result.stdout,
|
||||
"error": None
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"source": source,
|
||||
"success": False,
|
||||
"output": None,
|
||||
"error": result.stderr
|
||||
}
|
||||
except subprocess.TimeoutExpired:
|
||||
return {
|
||||
"source": source,
|
||||
"success": False,
|
||||
"output": None,
|
||||
"error": "Timeout"
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"source": source,
|
||||
"success": False,
|
||||
"output": None,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
def run_opencli_hackernews(limit: int = 5) -> Dict:
|
||||
"""获取 HackerNews 热门内容"""
|
||||
cmd = ["opencli", "hackernews", "top", "--limit", str(limit)]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=30
|
||||
)
|
||||
|
||||
return {
|
||||
"source": "hackernews",
|
||||
"success": result.returncode == 0,
|
||||
"output": result.stdout if result.returncode == 0 else None,
|
||||
"error": result.stderr if result.returncode != 0 else None
|
||||
}
|
||||
except Exception as e:
|
||||
return {
|
||||
"source": "hackernews",
|
||||
"success": False,
|
||||
"output": None,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
def download_content(url: str, output_dir: str) -> Optional[str]:
|
||||
"""使用 OpenCLI web read 下载文档内容"""
|
||||
url_hash = hashlib.md5(url.encode()).hexdigest()[:12]
|
||||
filename = f"web_{url_hash}.md"
|
||||
output_path = os.path.join(output_dir, filename)
|
||||
|
||||
cmd = ["opencli", "web", "read", "--url", url, "--output", output_path]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=60
|
||||
)
|
||||
|
||||
if result.returncode == 0 and os.path.exists(output_path):
|
||||
return output_path
|
||||
return None
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def select_sources(query: str, intent: Optional[str] = None) -> List[str]:
|
||||
"""根据查询意图选择数据源"""
|
||||
sources = []
|
||||
|
||||
if intent:
|
||||
if intent == "academic":
|
||||
sources.extend(SOURCES["academic"].keys())
|
||||
elif intent == "technical":
|
||||
sources.extend(SOURCES["technical"].keys())
|
||||
elif intent == "chinese":
|
||||
sources.extend(SOURCES["chinese"].keys())
|
||||
elif intent == "news":
|
||||
sources.extend(SOURCES["news"].keys())
|
||||
else:
|
||||
# 通用搜索 - 选择所有公开源
|
||||
for category in ["academic", "technical", "chinese", "news"]:
|
||||
for source, config in SOURCES[category].items():
|
||||
if config["type"] == "public":
|
||||
sources.append(source)
|
||||
else:
|
||||
# 自动判断
|
||||
query_lower = query.lower()
|
||||
|
||||
# 学术关键词
|
||||
if any(kw in query_lower for kw in ["paper", "论文", "arxiv", "research", "study"]):
|
||||
sources.extend(SOURCES["academic"].keys())
|
||||
|
||||
# 技术关键词
|
||||
if any(kw in query_lower for kw in ["python", "javascript", "code", "programming", "bug", "error"]):
|
||||
sources.extend(["stackoverflow", "hackernews"])
|
||||
|
||||
# 中文关键词
|
||||
if any('\u4e00' <= char <= '\u9fff' for char in query):
|
||||
sources.extend(["36kr"]) # 优先公开源
|
||||
|
||||
# 默认添加通用源
|
||||
if not sources:
|
||||
sources = ["arxiv", "stackoverflow", "36kr"]
|
||||
|
||||
return list(set(sources))
|
||||
|
||||
|
||||
def parallel_search(query: str, sources: List[str]) -> Dict[str, Dict]:
|
||||
"""并行执行多源搜索"""
|
||||
results = {}
|
||||
|
||||
with ThreadPoolExecutor(max_workers=5) as executor:
|
||||
future_to_source = {}
|
||||
|
||||
for source in sources:
|
||||
if source == "hackernews":
|
||||
future = executor.submit(run_opencli_hackernews)
|
||||
else:
|
||||
limit = 5
|
||||
for category in SOURCES.values():
|
||||
if source in category:
|
||||
limit = category[source].get("limit", 5)
|
||||
break
|
||||
future = executor.submit(run_opencli_search, source, query, limit)
|
||||
|
||||
future_to_source[future] = source
|
||||
|
||||
for future in as_completed(future_to_source):
|
||||
source = future_to_source[future]
|
||||
try:
|
||||
results[source] = future.result()
|
||||
except Exception as e:
|
||||
results[source] = {
|
||||
"source": source,
|
||||
"success": False,
|
||||
"output": None,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def save_results(storage_path: str, query: str, results: Dict) -> str:
|
||||
"""保存搜索结果到本地"""
|
||||
# 保存元数据
|
||||
metadata = {
|
||||
"query": query,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"sources": list(results.keys()),
|
||||
"success_count": sum(1 for r in results.values() if r["success"])
|
||||
}
|
||||
|
||||
metadata_path = os.path.join(storage_path, "metadata.json")
|
||||
with open(metadata_path, "w", encoding="utf-8") as f:
|
||||
json.dump(metadata, f, ensure_ascii=False, indent=2)
|
||||
|
||||
# 保存完整结果
|
||||
results_path = os.path.join(storage_path, "results.json")
|
||||
with open(results_path, "w", encoding="utf-8") as f:
|
||||
json.dump(results, f, ensure_ascii=False, indent=2)
|
||||
|
||||
return storage_path
|
||||
|
||||
|
||||
def print_report(query: str, results: Dict, storage_path: str):
|
||||
"""打印搜索结果报告"""
|
||||
print(f"\n{'='*60}")
|
||||
print(f"搜索报告: {query}")
|
||||
print(f"{'='*60}")
|
||||
|
||||
print(f"\n存储位置: {storage_path}")
|
||||
print(f"数据源: {', '.join(results.keys())}")
|
||||
|
||||
success_count = sum(1 for r in results.values() if r["success"])
|
||||
print(f"成功: {success_count}/{len(results)}")
|
||||
|
||||
print("\n" + "-"*60)
|
||||
print("各源结果:")
|
||||
print("-"*60)
|
||||
|
||||
for source, result in results.items():
|
||||
status = "✓" if result["success"] else "✗"
|
||||
print(f"\n[{status}] {source}")
|
||||
|
||||
if result["success"] and result["output"]:
|
||||
# 截断输出,避免过长
|
||||
output = result["output"][:500]
|
||||
if len(result["output"]) > 500:
|
||||
output += "..."
|
||||
print(output)
|
||||
elif result["error"]:
|
||||
print(f" 错误: {result['error'][:100]}")
|
||||
|
||||
print(f"\n{'='*60}\n")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="OpenCLI 多源 Web 搜索")
|
||||
parser.add_argument("query", help="搜索查询")
|
||||
parser.add_argument("--intent", choices=["academic", "technical", "chinese", "news", "general"],
|
||||
help="搜索意图类型")
|
||||
parser.add_argument("--sources", nargs="+", help="指定数据源")
|
||||
parser.add_argument("--download", action="store_true", help="下载高相关性文档")
|
||||
parser.add_argument("--output", help="输出目录")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 创建存储目录
|
||||
if args.output:
|
||||
storage_path = args.output
|
||||
os.makedirs(storage_path, exist_ok=True)
|
||||
else:
|
||||
storage_path = create_storage_dir(args.query)
|
||||
|
||||
print(f"存储路径: {storage_path}")
|
||||
|
||||
# 选择数据源
|
||||
if args.sources:
|
||||
sources = args.sources
|
||||
else:
|
||||
sources = select_sources(args.query, args.intent)
|
||||
|
||||
print(f"搜索源: {', '.join(sources)}")
|
||||
print("正在并行搜索...")
|
||||
|
||||
# 执行并行搜索
|
||||
results = parallel_search(args.query, sources)
|
||||
|
||||
# 保存结果
|
||||
save_results(storage_path, args.query, results)
|
||||
|
||||
# 打印报告
|
||||
print_report(args.query, results, storage_path)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user