Wikipedia makes it easy to extract page content without scraping HTML. Wikipedia exposes two APIs for this. The MediaWiki Action API (/w/api.php) handles raw wikitext and plain text; and the REST API (/api/rest_v1/) returns full HTML. Between them they offer three useful methods: raw wikitext via action=raw, full HTML via the REST API, and plain text via the TextExtracts endpoint.

Method 1: Raw wikitext via URL

Append ?action=raw to any Wikipedia article URL and you get the raw wikitext — the markup source Wikipedia stores internally:

# recommended: handles special characters in the title reliably
https://en.wikipedia.org/w/index.php?title=PAGE_TITLE&action=raw

# alternative: matches the browser URL, but special characters in the title can break it
https://en.wikipedia.org/PAGE_TITLE?action=raw

Replace PAGE_TITLE with the article title as it appears in the URL. For example, to get the raw source of the Python (programming language) article:

https://en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=raw

You can open this directly in a browser, or fetch it with curl:

curl "https://en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=raw"

The response is raw wikitext, which includes wiki markup like [[links]], {{templates}}, and ==headings==.

Method 2: Wikipedia REST API

The Wikipedia REST API returns the full article as HTML:

https://en.wikipedia.org/api/rest_v1/page/html/PAGE_TITLE

For example:

curl "https://en.wikipedia.org/api/rest_v1/page/html/Python_(programming_language)"

This returns the rendered HTML of the article, which you can then parse with a tool like pup or a library like BeautifulSoup.

Method 3: TextExtracts API

To get the entire article as plain text with no markup or HTML, use the TextExtracts API with explaintext=1. This requires jq, a command-line JSON processor — install it first if you don’t have it.

curl "https://en.wikipedia.org/w/api.php?action=query&titles=Python_(programming_language)&prop=extracts&explaintext=1&format=json" | jq -r '.query.pages[].extract'

The -r flag outputs raw text instead of a quoted JSON string. The filter .query.pages[].extract navigates the response structure:

  • .query - top-level key
  • .pages - map of page objects keyed by page ID
  • [] - iterates over all values in that map (there is only one here)
  • .extract - the plain-text field

Comparison

Goal Method
Full article source Raw wikitext (action=raw)
Full article as HTML Wikipedia REST API
Full article as plain text TextExtracts API