Copy deep-research skill from local Qoder installation to config repo for version control
5.0 KiB
Semantic Scholar API Verification Protocol
Status: v3.3
Used by: source_verification_agent, bibliography_agent, integrity_verification_agent
API base: https://api.semanticscholar.org/graph/v1
Rate limit: 1 request/second (unauthenticated), 10 requests/second (with API key)
API key env var: S2_API_KEY (optional; graceful degradation if unset)
Purpose
Provides programmatic verification of reference existence and bibliographic accuracy using the Semantic Scholar Academic Graph API. This supplements (not replaces) WebSearch-based verification by adding a structured, API-grounded check that returns machine-readable metadata.
PaperOrchestra (Song et al., 2026) demonstrated that a two-phase citation pipeline — (1) broad discovery via web search, (2) sequential verification via Semantic Scholar API — achieves significantly higher citation coverage (P0 Recall +2-6%, P1 Recall +12-14% over baselines). ARS adopts the verification phase as an additional tier in the existing multi-tier verification strategy.
Query Patterns
Pattern 1: Title Search (primary)
GET /paper/search?query={url_encoded_title}&limit=5&fields=title,authors,year,externalIds,venue,publicationDate
Matching rule: Compute Levenshtein similarity between query title and each result title (case-insensitive, stripped of punctuation). Accept if similarity >= 0.70 (matching PaperOrchestra's threshold). If multiple results >= 0.70, prefer the one with matching year.
Pattern 2: DOI Lookup (when DOI is available)
GET /paper/DOI:{doi}?fields=title,authors,year,externalIds,venue,publicationDate,citationCount
Matching rule: DOI match is exact. Cross-check that returned title matches the reference title (Levenshtein >= 0.70). If title mismatch despite DOI match, flag as DOI_MISMATCH — a known hallucination pattern where a fabricated DOI resolves to an unrelated paper.
Pattern 3: Semantic Scholar ID Lookup (for re-verification)
GET /paper/{paperId}?fields=title,authors,year,externalIds,venue,publicationDate,citationCount
Used when re-verifying a reference that was previously resolved to a Semantic Scholar ID (stored in the bibliography's semantic_scholar_id field).
Verification Tiers (Updated with S2 API)
| Tier | Method | Coverage | Purpose |
|---|---|---|---|
| Tier 0 (NEW) | Semantic Scholar API | 100% of references | Programmatic existence check + metadata extraction |
| Tier 1 | DOI resolution | 100% of DOI-bearing refs | URL-level existence check |
| Tier 2 | WebSearch spot-check | 50% of sources | Human-readable verification |
Execution order: Tier 0 first (batch, 1 req/sec). References that PASS Tier 0 skip Tier 2 unless flagged for other reasons. References that FAIL Tier 0 proceed to Tier 1 + Tier 2 for manual investigation.
Response Handling
On successful match
Record the following in the reference's verification audit trail:
semantic_scholar_id: the S2 paper ID (e.g.,"649def34f8be52c8b66281af98ae884c09aef38b")s2_title: returned titles2_authors: returned author lists2_year: returned years2_venue: returned venues2_citation_count: citation count (informational)match_score: Levenshtein similarity scoreverification_method:"s2_title_search"or"s2_doi_lookup"
On no match
- If 0 results with Levenshtein >= 0.70: classify as
S2_NOT_FOUND S2_NOT_FOUNDdoes NOT automatically mean fabrication — the paper may exist but not be indexed in Semantic Scholar (e.g., very recent, non-English, grey literature)- Proceed to Tier 1 (DOI) and Tier 2 (WebSearch) for further investigation
- If ALL tiers fail: classify as
NOT_FOUNDper existing protocol
On API failure
- HTTP 429 (rate limit): back off 2 seconds, retry up to 3 times
- HTTP 5xx: skip S2 for this reference, proceed to Tier 1
- Network error: skip S2 entirely for remaining batch, log
[S2-API-UNAVAILABLE] - Never block the pipeline on S2 API failure — graceful degradation to existing WebSearch-only verification
Deduplication via S2 ID
When two references resolve to the same semantic_scholar_id, flag as duplicate. The bibliography_agent uses this for deduplication during search (matching PaperOrchestra's approach of deduplicating via Semantic Scholar IDs).
Cost and Performance
- API calls per paper: ~30-80 (one per reference, typical paper has 30-80 references)
- Time: At 1 req/sec (unauthenticated), 30-80 seconds for a full paper. With API key (10 req/sec): 3-8 seconds
- Cost: Free (Semantic Scholar API is free for academic use)
- Recommendation: Set
S2_API_KEYfor faster verification. Obtain from https://www.semanticscholar.org/product/api#api-key
References
- Song, Y., Song, Y., Pfister, T., & Yoon, J. (2026). PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv preprint arXiv:2604.05018. — Section 4 Step 3 (Literature Review Agent), Appendix D.3 (Citation Verification).
- Semantic Scholar API documentation: https://api.semanticscholar.org/