RE: [Programming] Beginning the Search for Discovery
What you're saying makes sense. I understand that you don't want to base the recommendations based on other users.
I do think you need a specific way to characterize your feature vector though. N-grams is one way but it can blow up the dimensionality really quickly. You'll also have to think about how you want to handle images and other media. Do you also want to include some features representing the profile of the person voting?
I would also be careful of creating your own filter bubble. Maybe create two models: one for maximizing the expected value of a recommendation and another for maximizing the maximum value of a set of recommendations . You could then mix core recommendations with discovery.
I understand that these suggestions only add complexity to any system you build so I only offer them as potential improvements.
In any case, I'd be happy to help you test and give feedback for what you build.
I've used N-grams before pretty successfully, and I'm really fond of the fact that they care nothing about source language or even source format if you filter the input properly. In this case, however, I'm just going with a bag of words solution which is essentially a giant dictionary with word frequency tags. That should work well enough for what I'm doing right now. I can always change the processing methodology to output a different vectorized descriptor if I really want to.
The actual dimensionality doesn't matter in this particular case because I have more than enough horsepower to throw at the problem and I'm working with sparse matrices anyway.
I'm stripping out images, HTML in general, URLs, pretty much anything that's not text. Those features are simply not interesting to me. The only features that are important about the person voting are the features from the things that they're voting for.
I want a filter bubble. That's the whole point. The vector space described by the up votes is going to be bulbous enough that effectively any sort of distance measure that involves it is going to involve some slop. That's more than enough to keep fuzzy match diversity fairly high – unless someone is extremely specific about what they vote for, in which case who am I to tell them what they want?
Again, the idea is to get away from the idea that anyone else has the right, ability, or insight to tell you what you like. You have emitted signals. Lots of them, actually, if you look at the features of things that you have voted up. A system shouldn't second guess you and your preferences.
I'm a technophile but I don't adhere to the cult of the machine. A discovery tool should do just that, and in this case specifically be a discovery tool for filtering the bloody firehose of posts which are created on a moment to moment basis by the steem blockchain.
This is a long way from any sort of testing or feedback. Right now it's in the early prototype, feeling around the model, looking for sharp edges sort of place.