We have our SEO tool bot class for our Python Selenium Bot. We defined a few functions so far. I’ve found that this methodology of expanding functions works well for my work flow. This way we can get the bot to do small actions or chunks of code at a time. For debugging purposes this is a nice technique too. As it allows the programmer to track down the faulty code in that particular function block. Once we learn to pass the local variables to the whole class expansion becomes trivial.
Lets get this bot to earn it’s electrics. We are going to scrape some data. Since most people tend to scrape the wikis. Lets scrape Wikipedia (At this time. I think this ok under their TOS?). Let’s make the bot do it. As before lets import Selenium Web Driver and define our current class structure. This is what we should have so far.
<pre lang=”python” line=”1″>
</pre>
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | print("*" * 60) print("MI PYTHON COM SEO TOOL BOT") print("*" * 60) from selenium import webdriver from selenium.webdriver.common.keys import Keys import time class SEO_BOT(object): def __init__(self,browser,anon_url): self.browser = browser self.anon_url = anon_url self.current_url = None def main(self): pass def go_anon(self): ### SELENIUM IMPLICIT WAIT !!! IMPORTANT !!! self.browser.implicitly_wait(300) print("GETTING " + str(self.anon_url)) ### GO TO ANON URL self.browser.get(self.anon_url) def get_location(self): self.current_ip = self.browser.current_url print("*" * 60) print("CURRENT LOCATION:") print(str(self.current_ip)) print("*" * 60) bot = SEO_BOT(webdriver.Firefox(),"https://www.kproxy.com") bot.get_location() bot.go_anon() bot.get_location() |
So far this Python script invokes the Selenium Web Driver. Opens up an anon redirector, Kproxy. Then Gets the current Url of the Selenium Web Driver. As before we need to define another local variable to pass through the class.
1 2 3 4 5 6 7 8 9 | class SEO_BOT(object): def __init__(self,browser,anon_url,scrape_page): self.browser = browser self.anon_url = anon_url self.current_url = None self.scrape_page = scrape_page self.scrape_html = None |
Above we assigned two new class variables. The “scrape page” variable is the page we are going to scrape. The next variable has a value of None to store the HTML that is scrapped. This will be defined upon instantiation of the class.
Now lets define a scrape function to scrape a page so we can put this bot to work. For now to keep things simple we are just scrapping the entire HTML source. We will parse it later.
1 2 3 4 5 6 7 8 | def scrape(self): self.browser.implicitly_wait(300) print("FINDING ELEMENTS ON " + self.scrape_page + "TO SCRAPE") #self.scrape_html = self.browser.page_source.encode('utf-8') self.scrape_html = self.browser.page_source.encode('ascii', 'ignore') print("SCRAPPED" + str(self.scrape_page)) print(str(self.scrape_html)) return str(self.scrape_html) |
Some times the data doesn’t like to encode properly so one option is “asci ignore” with the encode method.
Since we already have a nifty anon redirect function in our Python Selenium SEO Tool Bot. Lets forward our python scrapping bot through that service. This keeps a few complications down when bot auto testing goes awry. We need to add a few things to our Selenium Anon Redirect function. First we need to find the element locations for our web driver get method. As before we get the element from chrome devtools / firebug and inspect it. Below are the element id’s for the kproxy fields. We are using the Selenium Web Driver .find_element_by_id method to find the element Then submit our scrape page URL to Kproxy for redirection.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def go_anon(self): ### SELENIUM IMPLICIT WAIT !!! IMPORTANT !!! self.browser.implicitly_wait(300) print("GETTING " + str(self.anon_url)) ### GO TO ANON URL self.browser.get(self.anon_url) ### FIND ELEMENTS AND POST SCRAPE PAGE URL elem = self.browser.find_element_by_id("maintextfield").clear() print("CLEARED maintextfield") elem = self.browser.find_element_by_id("maintextfield").send_keys(self.scrape_page) print("POSTING" + str(self.scrape_page) + "TO" + str(self.anon_url)) elem = self.browser.find_element_by_id("maintextfield").submit() print("REDIRECTING TO" + str(self.scrape_page)) |
Now we should have went to Kproxy then redirected to the Wiki. Scrape the Wiki and print the raw HTML to terminal.
Final code for this section should look like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | print("*" * 60) print("MI PYTHON COM SEO TOOL BOT") print("*" * 60) from selenium import webdriver from selenium.webdriver.common.keys import Keys import time class SEO_BOT(object): def __init__(self,browser,anon_url,scrape_page): self.browser = browser self.anon_url = anon_url self.scrape_page = scrape_page self.scrape_html = None self.current_url = None def main(self): pass def go_anon(self): ### SELENIUM IMPLICIT WAIT !!! IMPORTANT !!! self.browser.implicitly_wait(300) print("GETTING " + str(self.anon_url)) ### GO TO ANON URL self.browser.get(self.anon_url) ### FIND ELEMENTS AND POST SCRAPE PAGE URL elem = self.browser.find_element_by_id("maintextfield").clear() print("CLEARED maintextfield") elem = self.browser.find_element_by_id("maintextfield").send_keys(self.scrape_page) print("POSTING" + str(self.scrape_page) + "TO" + str(self.anon_url)) elem = self.browser.find_element_by_id("maintextfield").submit() print("REDIRECTING TO" + str(self.scrape_page)) def get_location(self): self.current_ip = self.browser.current_url print("*" * 60) print("CURRENT LOCATION:") print(str(self.current_ip)) print("*" * 60) def scrape(self): self.browser.implicitly_wait(300) print("FINDING ELEMENTS ON " + self.scrape_page + "TO SCRAPE") #self.scrape_html = self.browser.page_source.encode('utf-8') self.scrape_html = self.browser.page_source.encode('ascii', 'ignore') print("SCRAPPING" + str(self.scrape_page)) print(str(self.scrape_html)) print(str(self.scrape_page) + "HAS BEEN SCRAPPED") return str(self.scrape_html) bot = SEO_BOT(webdriver.Firefox(),"https://www.kproxy.com","http://www.wikipedia.com") bot.get_location() bot.go_anon() bot.get_location() bot.scrape() |
Next time we actually do something with our data from our Python Selenium SEO Tool Bot.