Adapting the Google PageRank Algorithm for Steem Content Discovery

in #steemdev6 years ago (edited)



When Google started to dominate the search engine scene in the late 90s, one contributing factor was their new PageRank algorithm. There are many explanations of how this works, including on here and here on Steemit and on wikipedia, so I won't go through it again here.

Anyway, for while I've been considering how a similar algorithm might be employed on our open social network to help us discover good content without utilising votes, which have been largely corrupted by the use of bidbots. Having spent the last few days reconfiguring some of the data that I collect with the Sincerity project, I have a very basic experimental alternative trending view here:

https://steem-sincerity.dapptools.info/s/api/trending-1

It looks at the resteems given to posts, the stake of the account resteeming and number of resteems they are doing. So for example somebody who resteems 100 posts a day is diluting their resteem 'votes' so they aren't worth much, and the resteems from an account with 100,000SP are given more weight (though not proportionally more) than one with 100SP.

EDIT: Having tried a simpler stake-weighted resteem view, I thought I'd include it here too:
https://steem-sincerity.dapptools.info/s/api/trending-2
(this actually seems to work better at the moment, but I hope to refine the work over the coming days)


I'm collecting resteems and also follows into the Sincerity database, so depending on support this might develop into something quite interesting, but it is a very basic interface and formula at the moment. I'd appreciate any feedback on the quality (or otherwise) of the results, as it's all fairly subjective.

Sort:  

This is really clever, Andy. I've been reading the #nobidbots and #nobidbot tags instead of Trending, but if we have an alternative that captures all posts, and ranks them by resteems by heavy accounts with infrequent resteems, that should give us a more organic Trending page; at least for a while.

Thanks! I'm quite pleased with the results given how little data and code is actually used for this. I think there's quite a lot of scope for further improvement.

What language is the code written in? I'd love to check it out. Have you considered making it available under an open source license?

It's Python code. I will probably open source it once it's a bit more developed.

It might be challenging to weed out competitions which require 'resteem to enter'. If a steemian only enters one of those a week and has a reasonable balance; and if there are many such steemians playing, that'll bump the comp up your list quite heavily.

That's true. It also benefits from being unknown at the moment. Only if I can get it to a point where a significant number of people use it, would we see how robust it is against abuse.

This sounds like a great idea. Most people with high SP only tend to resteem content they genuinely want to share with their followers.

Look forward to your updates on this project

@kabir88

Great work. I think more forays into search engine research and development should help a lot in connecting content consumers with content creators. And the way you are referencing PageRank is a promising start.

Hopefully more developers look into solving the current problems of content discovery as I feel that a good solution to the problem would be more valuable to the ecosystem than another generic application. Steem is like the internet prior to its indexing. There's so much potential in finding ways to map and span the network in new and innovative ways.

Thanks. I totally agree. The Hivemind DB should provide better indexing, and help a bit with effective front-ends when it arrives too, but we don't know when that will be.

Very interesting. I suspect that it might be thrown off a bit by accounts that use bid bots and artificially inflate their reputation and SP, but I don't think it will be significantly affected unless someone created a botnet that bought votes on their posts and resteemed to the botnet.

It doesn't use account reputations, so there shouldn't be a problem there. Quite a few accounts using bidbots for profit, will be powering-down their SP, but yes, that could be a problem. The algorithm is somewhat resistant to network attacks by being stake-weighted.

Oh...I guess I misread then. Still quite interested in how it turns out! Good luck!

Sounds interesting... I remember well being a "real" content creator at a time when when Matt Cutts was the "anti-spam God" at Google and they were constantly tweaking rankings.

If you want to be ahead of the curve here, the very beginnings of "spun content" are starting to show up here... and there has to be some way to detect it.

I like the idea of a "clean" trending based around authentic user reputations and some kind of "engagement score," as well.

That's clever however wouldn't people just start making bots that Resteem instead of upvoting? Or they'd have bots for both.

As somone with under 100SP as I am still trying to grow my account I would struggle to get noticed and be pushed further down into the darkness of nothingness.

I think we need to host powerup sessions or something where smaller active users can gain bonus Steeem or SP this would help grow the platform.

Also it's relatively easy to burn through your upvotes.

Minnows should be able to upvote more posts handing out more 1c.

Once you reach a certain limit those upvote curations should reduce. Making Steemit more equitable and more democratic socialist in application.

The more resteems and account does, the weaker these are in the algorithm, so resteem bots should be less of a problem than it seems.

Google's page rank algorithm also went one step further and added weights if the page which had the back links (and hence the ranking) was itself back linked by some other page.

ex. consider a page P0 which is backlinked by P1. Let us say that P1 has around 20 backlinks inside of it. If P1 is backlinked by P2 then P0 would get additional weight to the tune of 1/20.

In our case, since a user cannot be resteemed and only posts can, is it worth doing the following:

  1. figure out if the resteemer's blog is referred to in any post (@xxx). in this case, we can do what was done by Google within the a 7 day window
  2. if account U1 makes a post and U2 resteems that post, find out how many of U2's followers resteemed U1's post. correct the weight based on that metric
  3. of course combine 1&2 with comments made, upvotes on those comments and replies to those comments!
  4. what about external readers of resteemed articles. does that not carry weight as well?

Thanks for the input. I'm actually currently using the Google pagerank algorithm on accounts and then getting the appropriate resteemed post (this part needs work though). I'll be refinining this along similar lines to what you're proposing soon though.

Since readers leave no trace on the blockchain, there's no information that we can use for that.

This is intresting! Have ypu thought about incorporating number of comments into the algo? The idea is that the more valuable the post, the more comments it generates. But maybe thats not really true because of the spammy comments :/ tipuvote! 5

Thanks, yeah, that might be helpful. Like with competitions that require people to 'resteem the post' some seem to ask that people 'comment on this post'. These kind of entry requirements make things a bit more difficult that they would otherwise be.

I probably need to include several of the metrics that I have been collecting/calculating.

Great article andybets,
Will stick around for more details and comments to learn more about the formula.

This thing right here is what we need on steemit !!