I.T. Spices The LINUX Way
Python In The Shell: The STEEMIT Ecosystem – Post #112
SCRAPING ALL BLOGS USING PYTHON – THE BLOG URL PREPS
Please refer to Post #110 for the complete python script and the intro of this series, link below:
https://steemit.com/blockchain/@lightingmacsteem/2rydxz-i-t-spices-the-linux-way
In this post we will be discussing the initial preparations so that we can acquire the complete URL of a blog post. It may not be obvious until now, but we can not scrape a website if we have no URL of one, and each blog have one.
Lines 20 to 41 are the initial steps to acquire the blog post in which we can get our full URL for a particular blog:
20 ###GET BLOGS HERE
21 s = Steem('https://api.steemit.com')
22 baseurl = 'https://steemit.com/'
23 account = sys.argv[1]
24 start = sys.argv[2]
25 if int(start) == 10:
26 end = 11
27 elif int(start) > 10:
28 end = 10
29 elif int(start) < 10:
30 print()
31 print('START variable needs to be equal to or greater than 10 and be divisible by 10. Please 32 try again.......')
33 print()
34 sys.exit()
35 counter = int(start)
36 while int(counter) >= 0:
37 ###INDICATE THE COUNTER
38 print('\n' + 'COUNTER is now ' + str(counter))
39 flogs.write('\n' + 'COUNTER is now ' + str(counter))
40 for post in s.get_blog(account, int(start), int(end)):
41 post = Post(post["comment"])
Line 21 instructs the steem python module to get the blockchain data from the API address https://api.steemit.com. An API is like a gateway into the world of STEEM, and if I say STEEM, I mean the steem blockchain where all data pertaining it resides.
Lines 22 to 24 is a text input to the next steem python command of acquiring a certain blog post, it needs as its input:
- A user account
- A start point (a number) indicating the position of the blog in the blockchain
- A count (a number) of decrements starting from the start point
Lines 25 to 34 is checking the counter in every count cycle, a valid number will continue the script, a number of 10 (meaning the last 10 posts) will make sure that it will be replaced by 11 so that the count always exceeds the remaining blogs which will consistently retrieve all intended blog counts. A number as given if less than 10 can not be processed hence a message will be displayed to at least give the executor the correct information.
Lines 35 to 41 are the actual lines that cycles thru the counts of blogs, getting one unique blog at each cycle.
- Line 36 tells us that as long as the start numeral is not less than zero (greater than or equal to zero), the process of acquiring the blog posts continues
- Line 38 is printing on the screen the counter as presently processed, a very useful info
- Line 39 is just writing into a log file whatever is displayed at Line 38
- Line 40 is the actual steem python command in acquiring the blog post using the account, blog number and the count end number
- Line 41 is getting only the “comment” portion of the result as per Line 40
These lines of codes are in a loop, a while loop to be exact, which can only mean that at the minimum number of 10 this is much quicker, and at a maximum number of UNLIMITED (let’s just exaggerate to say), then just imagine if 100 years from now a certain user have already posted 500 thousand blogs.
So you will get the idea now as to how long will it take to scrape all blogs when the blockchain data of STEEMIT gets too huge.
It may be a good side note to mention my blog post previously as I lay out a Design concept as to how the blockchain system can be expanded to LIMITLESS proportions, and securely. Links below if you want to review it:
https://steemit.com/witness-category/@lightingmacsteem/68lq5m-everything-under-the-stars
https://steemit.com/witness-category/@lightingmacsteem/zclks-everything-under-the-stars
Actually that is how big data operates, and we should see all data as written now to still be there a thousand years or more (maybe limitless years) if we really want to be like God.
Just overstating, my bad.
“One Person, Limitless Possibilities, Every Second……. Not What, But Who Can Even Process That?”