Link finder (wikipedia)
Adding on to the web scraper that I made, this time I made a script that finds all the links in the given website. Since Wikipedia has a lot of links on their pages and I've already made a web scraping script for it, I made it for Wikipedia. What it does is it gets the singular big byte string containing all the html, decodes it into string and goes through it, looking for: <a href=". From then on, it adds whatever link is in between that and the next ". That gets all the links, and is placed in a list. Sometimes the links are shortened, as they go to different directories on the same website, for example: /wiki/Wikipedia:Featured_articles. For these, I just got the protocol and path from the initial url, and inserted the necessary ones in the beginning to make this link: https://en.wikipedia.org//wiki/Wikipedia:Featured_articles
def get_links():
url = "https://en.wikipedia.org/wiki/Main_Page"
protocol = ""
for x in range(0, len(url)):
if url[x] == "/" and url[x+1] == "/":
break
protocol += url[x]
print("protocol =", protocol)
path = protocol + "//"
for x in range(len(path), len(url)):
if url[x] == "/":
path += url[x]
break
path += url[x]
print("path =", path)
page = requests.get(url)
results = page.content.decode('utf-8')
link_list = []
i = 0
j = 0
while i < len(results):
if results[i] == "<" and results[i+1] == "a" and results[i+2] == " " and results[i+3] == "h":
link_list.append("")
i += 9
while True:
if results[i] == '"':
break
link_list[j] += results[i]
i += 1
j += 1
i += 1
for x in range(0, len(link_list)):
linkvar = link_list[x]
if linkvar[0:6] == "/wiki/" or linkvar[0:3] == "/w/":
link_list[x] = path + linkvar
elif linkvar[0:2] == "//":
link_list[x] = protocol + linkvar
print(link_list[x])
Comments
Post a Comment