feat(repo): 整理 Qoder Skills 和 MCP 配置到仓库
- 添加 5 个用户级别 Skills: - auto-commit: 自动 Git 提交 - karpathy-guidelines: 编码规范指南 - opencli-websearch: 多源网络搜索 - pdf-reader: PDF 内容提取 - repo-analyzer: 项目深度分析 - 添加 Playwright MCP 配置 (21 个浏览器自动化工具) - 创建完整的 README.md 文档说明
This commit is contained in:
130
skills/pdf-reader/SKILL.md
Normal file
130
skills/pdf-reader/SKILL.md
Normal file
@@ -0,0 +1,130 @@
|
||||
---
|
||||
name: pdf-reader
|
||||
description: Extract text and tables from PDF files and convert to Markdown format using pymupdf4llm. Use when working with PDF files, extracting content from PDFs, converting PDFs to text or Markdown, or when the user mentions reading or processing PDF documents.
|
||||
---
|
||||
|
||||
# PDF Reader
|
||||
|
||||
Extract text and tables from PDF files and convert them to Markdown format using pymupdf4llm.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before using this skill, ensure pymupdf4llm is installed:
|
||||
|
||||
```bash
|
||||
python -c "import pymupdf4llm" 2>/dev/null && echo "Installed" || echo "Not installed"
|
||||
```
|
||||
|
||||
If not installed, run:
|
||||
|
||||
```bash
|
||||
pip install pymupdf4llm
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
### Extract PDF to Markdown
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
|
||||
# Convert PDF to Markdown
|
||||
md_text = pymupdf4llm.to_markdown("path/to/file.pdf")
|
||||
|
||||
# Save to file
|
||||
with open("output.md", "w", encoding="utf-8") as f:
|
||||
f.write(md_text)
|
||||
|
||||
# Or print directly
|
||||
print(md_text)
|
||||
```
|
||||
|
||||
### Extract with Options
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
|
||||
# Convert specific pages
|
||||
md_text = pymupdf4llm.to_markdown(
|
||||
"document.pdf",
|
||||
pages=[0, 1, 2], # First 3 pages (0-indexed)
|
||||
)
|
||||
|
||||
# Convert all pages (default)
|
||||
md_text = pymupdf4llm.to_markdown("document.pdf")
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
When processing a PDF:
|
||||
|
||||
1. **Check dependencies**: Verify pymupdf4llm is installed
|
||||
2. **Extract content**: Use `pymupdf4llm.to_markdown()` to convert the PDF
|
||||
3. **Save to same directory**: Output the Markdown to the same directory as the PDF
|
||||
4. **Verify**: Check the output for completeness
|
||||
|
||||
### Quick Checklist
|
||||
|
||||
```
|
||||
Task Progress:
|
||||
- [ ] Step 1: Verify pymupdf4llm is installed
|
||||
- [ ] Step 2: Extract PDF content to Markdown
|
||||
- [ ] Step 3: Save .md file to same directory as PDF
|
||||
- [ ] Step 4: Verify extraction quality
|
||||
```
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Read and Display PDF Content
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
|
||||
# Extract and print
|
||||
md_content = pymupdf4llm.to_markdown("report.pdf")
|
||||
print(md_content)
|
||||
```
|
||||
|
||||
### Extract and Save to Same Directory
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
from pathlib import Path
|
||||
|
||||
# Convert and save to same directory as PDF
|
||||
pdf_path = Path("input.pdf")
|
||||
md_text = pymupdf4llm.to_markdown(pdf_path)
|
||||
|
||||
# Create output path with same name but .md extension
|
||||
output_path = pdf_path.with_suffix('.md')
|
||||
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
f.write(md_text)
|
||||
print(f"Saved to {output_path} ({len(md_text)} characters)")
|
||||
```
|
||||
|
||||
### Process Multiple PDFs
|
||||
|
||||
```python
|
||||
import pymupdf4llm
|
||||
from pathlib import Path
|
||||
|
||||
pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
|
||||
|
||||
for pdf_file in pdf_files:
|
||||
pdf_path = Path(pdf_file)
|
||||
md_text = pymupdf4llm.to_markdown(pdf_path)
|
||||
|
||||
# Save to same directory as PDF
|
||||
output_path = pdf_path.with_suffix('.md')
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
f.write(md_text)
|
||||
print(f"Converted {pdf_file} -> {output_path}")
|
||||
```
|
||||
|
||||
## Tips
|
||||
|
||||
- pymupdf4llm automatically extracts text, tables, and basic formatting
|
||||
- Output is in Markdown format, preserving document structure
|
||||
- Handles multi-column layouts and complex formatting
|
||||
- For scanned PDFs requiring OCR, additional tools may be needed
|
||||
Reference in New Issue
Block a user