Wikipedia makes it easy to extract page content without scraping HTML. Wikipedia exposes two APIs for this. The MediaWiki Action API (/w/api.php) handles raw wikitext and plain text; and the REST API (/api/rest_v1/) returns full HTML. Between them they offer three useful methods: raw wikitext via action=raw, full HTML via the REST API, and plain text via the TextExtracts endpoint.
Method 1: Raw wikitext via URL
Append ?action=raw to any Wikipedia article URL and you get the raw wikitext — the markup source Wikipedia stores internally:
# recommended: handles special characters in the title reliably
https://en.wikipedia.org/w/index.php?title=PAGE_TITLE&action=raw
# alternative: matches the browser URL, but special characters in the title can break it
https://en.wikipedia.org/PAGE_TITLE?action=raw
Replace PAGE_TITLE with the article title as it appears in the URL. For example, to get the raw source of the Python (programming language) article:
https://en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=raw
You can open this directly in a browser, or fetch it with curl:
curl "https://en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=raw"
The response is raw wikitext, which includes wiki markup like [[links]], {{templates}}, and ==headings==.
Method 2: Wikipedia REST API
The Wikipedia REST API returns the full article as HTML:
https://en.wikipedia.org/api/rest_v1/page/html/PAGE_TITLE
For example:
curl "https://en.wikipedia.org/api/rest_v1/page/html/Python_(programming_language)"
This returns the rendered HTML of the article, which you can then parse with a tool like pup or a library like BeautifulSoup.
Method 3: TextExtracts API
To get the entire article as plain text with no markup or HTML, use the TextExtracts API with explaintext=1. This requires jq, a command-line JSON processor — install it first if you don’t have it.
curl "https://en.wikipedia.org/w/api.php?action=query&titles=Python_(programming_language)&prop=extracts&explaintext=1&format=json" | jq -r '.query.pages[].extract'
The -r flag outputs raw text instead of a quoted JSON string. The filter .query.pages[].extract navigates the response structure:
.query- top-level key.pages- map of page objects keyed by page ID[]- iterates over all values in that map (there is only one here).extract- the plain-text field
Comparison
| Goal | Method |
|---|---|
| Full article source | Raw wikitext (action=raw) |
| Full article as HTML | Wikipedia REST API |
| Full article as plain text | TextExtracts API |