Wikipedia Web Scraping

I saw a post sometime ago on reddit about someone who created a website that tracks the spread of corona virus by attaining data from multiple different websites, and I wondered how he was able to grab data from websites. So, I tried it. It looks like you really just grab data by sorting through the html code. You just look through, and find information by using the id tags that the web creators used to lay out their website. I used a website that has an extremely consistent design that hasn't been changed for years: Wikipedia. I first tried getting the "Today's Featured Article" and printing the contents of it, and it wasn't hard, except for the fact that it displays the html code rather than text, but I haven't tried converting it to just text. The next was to use Wikipedia's special random link that would lead me to a random page, then make the program read the heading of the article, along with the main body of the article.

Using modules requests (for gathering data from websites) and BeautifulSoup (to make things look readable), and some tutorial I found online, it wasn't too hard.

import requests
from bs4 import BeautifulSoup


def wiki_main_page():
    url = "https://en.wikipedia.org/wiki/Main_Page"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find(id="mp-tfa")
    print(results.prettify())


def wiki_random():
    url = "https://en.wikipedia.org/wiki/Special:Random"
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    heading = soup.find(id="firstHeading")
    content = soup.find(id="mw-content-text")
    print(heading.prettify())
    print(content.prettify())

Search This Blog

Title

Wikipedia Web Scraping

Comments

Post a Comment