I.T. Spices The LINUX Way
Python In The Shell: The STEEMIT Ecosystem – Post #114
SCRAPING ALL BLOGS USING PYTHON – LET US SCRUTINIZE THE BLOGS
Please refer to Post #110 for the complete python script and the intro of this series, link below:
https://steemit.com/blockchain/@lightingmacsteem/2rydxz-i-t-spices-the-linux-way
In this post we will be discussing the actual blog examinations (known as web scraping) using python modules designed for such auto tasks. We already have a full web URL, let us now use such URL to be able to access each blog automatically acquiring any data we decide.
Lines 66 to 72 are all very important variables for a successful web scrape of each STEEMIT blog:
66
67 ###SCRAPE THE URL RIGHT AWAY
68 my_url = fullurl
69 headers = {'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}
70 open_url = requests.get(my_url, headers=headers)
71 html_url = open_url.content
72 soup = BeautifulSoup(html_url,"html.parser")
Line 68 is the full URL of the blog as discussed on the previous post; what will happen here is that this python program will access and literally “browse” such webpage. Imagine yourself typing a certain web URL to read some news for example, but only for this situation it is python itself doing the browsing in auto, we are just providing it the URL.
Line 69 is a pseudo header, this is “tricking” the said website as if a real person is browsing it. This is like an identifier of sorts as you can see, proclaiming itself to be MOZILLA (firefox browser) when in fact it is python doing its dirty work. Again I am just overstating, this is purely legal.
Lines 70 and 71 is the requests python module doing its job of really opening the webpage using the supplied full URL and the pseudo header; expect lines of HTML codes as the result of this line as memorized by python.
Line 72 is the beautiful BeautifulSoup python module doing its job of parsing the HTML-formatted lines of web page texts; at this line we can expect that we are like arranging the fuzzy HTML lines in a way that can be easily read and be manipulated by humans. As an added info, please take note that HTML is designed to be read by computers, so we can view the BeautifulSoup module as Computer-To-Human translator as far as the HTML lines of codes are concerned.
Hopefully I explained it as simple as possible up to this point.
Stay glued for more.
“Love Does Not Need A Translator As We Don’t Need Words To Express It.”