Link finder (wikipedia)

Adding on to the web scraper that I made, this time I made a script that finds all the links in the given website. Since Wikipedia has a lot of links on their pages and I've already made a web scraping script for it, I made it for Wikipedia. What it does is it gets the singular big byte string containing all the html, decodes it into string and goes through it, looking for: <a href=". From then on, it adds whatever link is in between that and the next ". That gets all the links, and is placed in a list. Sometimes the links are shortened, as they go to different directories on the same website, for example: /wiki/Wikipedia:Featured_articles. For these, I just got the protocol and path from the initial url, and inserted the necessary ones in the beginning to make this link: https://en.wikipedia.org//wiki/Wikipedia:Featured_articles

def get_links():
    url = "https://en.wikipedia.org/wiki/Main_Page"
    protocol = ""
    for x in range(0, len(url)):
        if url[x] == "/" and url[x+1] == "/":
            break
        protocol += url[x]
    print("protocol =", protocol)
    path = protocol + "//"
    for x in range(len(path), len(url)):
        if url[x] == "/":
            path += url[x]
            break
        path += url[x]
    print("path =", path)
    page = requests.get(url)
    results = page.content.decode('utf-8')
    link_list = []
    i = 0
    j = 0
    while i < len(results):
        if results[i] == "<" and results[i+1] == "a" and results[i+2] == " " and results[i+3] == "h":
            link_list.append("")
            i += 9
            while True:
                if results[i] == '"':
                    break
                link_list[j] += results[i]
                i += 1
            j += 1
        i += 1
    for x in range(0, len(link_list)):
        linkvar = link_list[x]
        if linkvar[0:6] == "/wiki/" or linkvar[0:3] == "/w/":
            link_list[x] = path + linkvar
        elif linkvar[0:2] == "//":
            link_list[x] = protocol + linkvar
        print(link_list[x])

Search This Blog

Title

Link finder (wikipedia)

Comments

Post a Comment