Files
qoder-config/skills/deep-research/references/semantic_scholar_api_protocol.md
aszerW f571b20598 feat(skills): add deep-research skill
Copy deep-research skill from local Qoder installation to config repo for version control
2026-06-06 13:22:55 +08:00

5.0 KiB

Semantic Scholar API Verification Protocol

Status: v3.3 Used by: source_verification_agent, bibliography_agent, integrity_verification_agent API base: https://api.semanticscholar.org/graph/v1 Rate limit: 1 request/second (unauthenticated), 10 requests/second (with API key) API key env var: S2_API_KEY (optional; graceful degradation if unset)


Purpose

Provides programmatic verification of reference existence and bibliographic accuracy using the Semantic Scholar Academic Graph API. This supplements (not replaces) WebSearch-based verification by adding a structured, API-grounded check that returns machine-readable metadata.

PaperOrchestra (Song et al., 2026) demonstrated that a two-phase citation pipeline — (1) broad discovery via web search, (2) sequential verification via Semantic Scholar API — achieves significantly higher citation coverage (P0 Recall +2-6%, P1 Recall +12-14% over baselines). ARS adopts the verification phase as an additional tier in the existing multi-tier verification strategy.


Query Patterns

Pattern 1: Title Search (primary)

GET /paper/search?query={url_encoded_title}&limit=5&fields=title,authors,year,externalIds,venue,publicationDate

Matching rule: Compute Levenshtein similarity between query title and each result title (case-insensitive, stripped of punctuation). Accept if similarity >= 0.70 (matching PaperOrchestra's threshold). If multiple results >= 0.70, prefer the one with matching year.

Pattern 2: DOI Lookup (when DOI is available)

GET /paper/DOI:{doi}?fields=title,authors,year,externalIds,venue,publicationDate,citationCount

Matching rule: DOI match is exact. Cross-check that returned title matches the reference title (Levenshtein >= 0.70). If title mismatch despite DOI match, flag as DOI_MISMATCH — a known hallucination pattern where a fabricated DOI resolves to an unrelated paper.

Pattern 3: Semantic Scholar ID Lookup (for re-verification)

GET /paper/{paperId}?fields=title,authors,year,externalIds,venue,publicationDate,citationCount

Used when re-verifying a reference that was previously resolved to a Semantic Scholar ID (stored in the bibliography's semantic_scholar_id field).


Verification Tiers (Updated with S2 API)

Tier Method Coverage Purpose
Tier 0 (NEW) Semantic Scholar API 100% of references Programmatic existence check + metadata extraction
Tier 1 DOI resolution 100% of DOI-bearing refs URL-level existence check
Tier 2 WebSearch spot-check 50% of sources Human-readable verification

Execution order: Tier 0 first (batch, 1 req/sec). References that PASS Tier 0 skip Tier 2 unless flagged for other reasons. References that FAIL Tier 0 proceed to Tier 1 + Tier 2 for manual investigation.


Response Handling

On successful match

Record the following in the reference's verification audit trail:

  • semantic_scholar_id: the S2 paper ID (e.g., "649def34f8be52c8b66281af98ae884c09aef38b")
  • s2_title: returned title
  • s2_authors: returned author list
  • s2_year: returned year
  • s2_venue: returned venue
  • s2_citation_count: citation count (informational)
  • match_score: Levenshtein similarity score
  • verification_method: "s2_title_search" or "s2_doi_lookup"

On no match

  • If 0 results with Levenshtein >= 0.70: classify as S2_NOT_FOUND
  • S2_NOT_FOUND does NOT automatically mean fabrication — the paper may exist but not be indexed in Semantic Scholar (e.g., very recent, non-English, grey literature)
  • Proceed to Tier 1 (DOI) and Tier 2 (WebSearch) for further investigation
  • If ALL tiers fail: classify as NOT_FOUND per existing protocol

On API failure

  • HTTP 429 (rate limit): back off 2 seconds, retry up to 3 times
  • HTTP 5xx: skip S2 for this reference, proceed to Tier 1
  • Network error: skip S2 entirely for remaining batch, log [S2-API-UNAVAILABLE]
  • Never block the pipeline on S2 API failure — graceful degradation to existing WebSearch-only verification

Deduplication via S2 ID

When two references resolve to the same semantic_scholar_id, flag as duplicate. The bibliography_agent uses this for deduplication during search (matching PaperOrchestra's approach of deduplicating via Semantic Scholar IDs).


Cost and Performance

  • API calls per paper: ~30-80 (one per reference, typical paper has 30-80 references)
  • Time: At 1 req/sec (unauthenticated), 30-80 seconds for a full paper. With API key (10 req/sec): 3-8 seconds
  • Cost: Free (Semantic Scholar API is free for academic use)
  • Recommendation: Set S2_API_KEY for faster verification. Obtain from https://www.semanticscholar.org/product/api#api-key

References

  • Song, Y., Song, Y., Pfister, T., & Yoon, J. (2026). PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing. arXiv preprint arXiv:2604.05018. — Section 4 Step 3 (Literature Review Agent), Appendix D.3 (Citation Verification).
  • Semantic Scholar API documentation: https://api.semanticscholar.org/