---
name: pdf-reader
description: Extract text and tables from PDF files and convert to Markdown format using pymupdf4llm. Use when working with PDF files, extracting content from PDFs, converting PDFs to text or Markdown, or when the user mentions reading or processing PDF documents.
---

# PDF Reader

Extract text and tables from PDF files and convert them to Markdown format using pymupdf4llm.

## Prerequisites

Before using this skill, ensure pymupdf4llm is installed:

```bash
python -c "import pymupdf4llm" 2>/dev/null && echo "Installed" || echo "Not installed"
```

If not installed, run:

```bash
pip install pymupdf4llm
```

## Basic Usage

### Extract PDF to Markdown

```python
import pymupdf4llm

# Convert PDF to Markdown
md_text = pymupdf4llm.to_markdown("path/to/file.pdf")

# Save to file
with open("output.md", "w", encoding="utf-8") as f:
    f.write(md_text)

# Or print directly
print(md_text)
```

### Extract with Options

```python
import pymupdf4llm

# Convert specific pages
md_text = pymupdf4llm.to_markdown(
    "document.pdf",
    pages=[0, 1, 2],  # First 3 pages (0-indexed)
)

# Convert all pages (default)
md_text = pymupdf4llm.to_markdown("document.pdf")
```

## Workflow

When processing a PDF:

1. **Check dependencies**: Verify pymupdf4llm is installed
2. **Extract content**: Use `pymupdf4llm.to_markdown()` to convert the PDF
3. **Save to same directory**: Output the Markdown to the same directory as the PDF
4. **Verify**: Check the output for completeness

### Quick Checklist

```
Task Progress:
- [ ] Step 1: Verify pymupdf4llm is installed
- [ ] Step 2: Extract PDF content to Markdown
- [ ] Step 3: Save .md file to same directory as PDF
- [ ] Step 4: Verify extraction quality
```

## Common Patterns

### Read and Display PDF Content

```python
import pymupdf4llm

# Extract and print
md_content = pymupdf4llm.to_markdown("report.pdf")
print(md_content)
```

### Extract and Save to Same Directory

```python
import pymupdf4llm
from pathlib import Path

# Convert and save to same directory as PDF
pdf_path = Path("input.pdf")
md_text = pymupdf4llm.to_markdown(pdf_path)

# Create output path with same name but .md extension
output_path = pdf_path.with_suffix('.md')

with open(output_path, "w", encoding="utf-8") as f:
    f.write(md_text)
print(f"Saved to {output_path} ({len(md_text)} characters)")
```

### Process Multiple PDFs

```python
import pymupdf4llm
from pathlib import Path

pdf_files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]

for pdf_file in pdf_files:
    pdf_path = Path(pdf_file)
    md_text = pymupdf4llm.to_markdown(pdf_path)
    
    # Save to same directory as PDF
    output_path = pdf_path.with_suffix('.md')
    with open(output_path, "w", encoding="utf-8") as f:
        f.write(md_text)
    print(f"Converted {pdf_file} -> {output_path}")
```

## Tips

- pymupdf4llm automatically extracts text, tables, and basic formatting
- Output is in Markdown format, preserving document structure
- Handles multi-column layouts and complex formatting
- For scanned PDFs requiring OCR, additional tools may be needed