Natural Language Text Analysis for January 2018

morningtundra (53)in #utopian-io • 7 years ago (edited)

MT Analysis Banner.png

Abstract

This is the third and final part of a Natural Language Text Analysis. It is a follow-on to prior work that analyzed text and emoji. This analysis differs in that we are focusing on natural language word cluster analysis seeking markers of successful or high quality posts.

Scope

We will be considering data for the month of January 2018 and will include foreign languages and character sets. We exclude the following multimedia catagories:

Music & Video	Photography	Memes
dtube, youtube, music	photography, colorchallenge, architecturalphotography, vehiclephotography, photofeed, photo	dmania, decentmemes, meme

The dataset includes 1,203,022 posts from 58,846 categories (excluding those above) and 110,474 distinct authors.

Tools

The analysis will be performed in R using only Open Source Public Domain tools (particularly Quanteda) and on a 10 year old MacBook(!).

Rank	Category	Total Posts	Avg Votes	Authors
1	life	59,087	11.327619	16,804
2	bitcoin	37,777	11.784737	10,059
3	news	30,257	5.156856	3,973
4	kr	29,771	13.549830	3,282
5	spanish	29,626	29.367886	5,621
6	cryptocurrency	28,918	11.535791	8,443
7	art	28,389	12.981331	6,979
8	steemit	28,140	16.085821	10,653
9	food	26,751	11.792718	7,894
10	introduceyourself	21,460	13.443290	16,079

Vote Statistical Summary

Standard Deviation
    46.51859
Quantiles
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 -93    1    1    2    3    4    6    8   12   23 4023 
Summary
   Min. 1st Qu.  Median    Mean   3rd Qu.    Max. 
 -93.00    2.00    4.00   12.87   10.00     4023.00

Subsetting Best & Worst Posts

Here we subset posts by votes (specifically, net_votes per post) and illustrate the author of those posts. We will subset at the bottom 10th percentile (< 1 vote) and the top 90th percentile (> 23 votes).

Illustrated here is the top and bottom 30-user cohort from each subset.

Lexical Diversity

There is some evidence to suggest that vocabulary is correlated with intelligence. In this study we equate size of vocabulary with Lexical Diversity to test the hypothesis that smart users produce high performing, high quality content.

In Natural Language Processing, sentences are tokenized into individual words. Strictly speaking, a lexical token is a character vector and may not be an actual word. e.g. "Ah-ha!" would be a valid token. A Chinese character would also be a valid token.

In this analysis we use Token count as a proxy for Lexical Diversity. We also use the terms "word" and "token" interchangeably.

This chart illustrates Global Lexical Diversity for all Posts. These are users with lexical diversity greater than 300.

This is an arbitrary cut-off. We know the average 4 year old (native english speaker) knows approximately 5,000 distinct words. I am intentionally setting the cut off well below average to include non-native english speakers.

We can now compare these authors with high lexical diversity (big vocabularies and presumably smart) to high and low performing post authors (our subsets).

Highest Voted Authors

Our top performing user (@haejin) by vote count doesn't appear in our list of users with high lexical diversity (the smart people with big vocabularies). This doesn't suggest he's not smart just that he uses a narrow vocabulary. This user publishes specialized, technical content on Eliot Wave Analysis.

The following 14 authors from the Top Performing Post subset also appear in the High Lexical Diversity group. These users have large vocabularies and high votes.

Rank	Author	Tokens	Total Votes
1	@glenalbrethsen	1991	23
2	@amf6	1381	91
3	@karyroa	1224	40
4	@aqiel	864	37
5	@orianandreina18	748	28
6	@justyy	689	5602
7	@gexi	664	83
8	@abialfatih	551	30
9	@svitlaangel	551	23
10	@lorenitaarmy	517	3513
11	@mellisaramirez	451	39
12	@pataty69	410	52
13	@meidy	407	48
14	@michaelizer	347	164

Lowest Voted Authors

We also have six users with low performing posts and high lexical diversity. These users have low performing posts but also large vocabularies. It would suggest, being smart and having a big vocabulary is not an indicator of how well your content will perform.

Rank	Author	Tokens
1	@ddd67	2551
2	@oneness	2284
3	@karyroa	1224
4	@shemzy	453
5	@meidy	407
6	@mhmtbhtyr	315

Visual Comparison

Lining the plots up we observe the Top Performing cohort (middle chart) contains consideraly more authors with Lexical Diversity greater than 300. We're seeing more users in the Top Performing cohort using large vocabularies. While we can not draw conclusions, it would suggest there are more smart people in the top performing cohort.

Word Frequency

To analyze word frequency we must coerce our post data into a Corpus and then to a Document Frequency Matrix (DFM). Creating the Corpus and DFMs takes approximately 45 mins processing time.

Word Frequency

During this process, I removed the following stop-words. Stop-Words are discussed in earlier posts. This step is considered data preprocessing or cleansing.

Stop Word List

"steem","steemit","steemian","steemians","resteem","upvote","upvotes","post","SBD","SP","jpeg","jpg","png","www","com","td","re","nbsp","p","li","br","strong","quote","s3","amazonaws'com","steemit'com","steemitimages'com","img","height","width","src","center","em","html","de","href","hr","blockquote","h1","h2","h3","960","720","div","en","que","la","will","y","el","https","http","do","does","did","has","have","had","is","am","are","was","were","be","being","been","may","must","might","should","could","would","shall","will","can","un","get","alt","_blank","i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves","he","him","his","himself","she","her","hers","herself","it","its","itself", "they","them","their","theirs","themselves","what", "which","who","whom","this","that","these","those","am","is","are","was","were", "be","been","being","have","has","had", "having","do","does","did","doing","would", "should","could","ought","i'm","you're","he's", "she's","it's","we're","they're","i've","you've", "we've","they've","i'd","you'd","he'd","she'd", "we'd","they'd","i'll","you'll","he'll","she'll", "we'll","they'll","isn't","aren't","wasn't","weren't", "hasn't","haven't","hadn't","doesn't","don't","didn't", "won't","wouldn't","shan't","shouldn't","can't","cannot", "couldn't","mustn't","let's","that's","who's","what's", "here's","there's","when's","where's","why's","how's", "a","an","the","and","but","if", "or","because","as","until","while","of", "at","by","for","with","about","against", "between","into","through","during","before","after", "above","below","to","from","up","down", "in","out","on","off","over","under", "again","further","then","once","here","there", "when","where","why","how","all","any", "both","each","few","more","most","other", "some","such","no","nor","not","only", "own","same","so","than","too","very"

Sparse Word Removal

As part of data preparation for illustration we also remove sparse terms. Ths is known as trimming the DFM of sparse terms. This leaves us with words (terms, tokens) used more than 10 times and appear in more than 25% of posts.

Top Performer Cohort

When we examine the most frequently used words in the Top Performing Cohort, we again observe high lexical diversity. This word cloud illustrates words used more than 5,000 times across all Top Voted posts.

These are the Top 10 tokens in this wordcloud. I have chosen not to remove numbers at this point.

image	one	1	like	time	just	also	2	people	new
101916	94650	91955	83957	80139	70767	67883	67275	65488	54249

Bottom Performer Cohort

By comparison, the bottom performing posts use far fewer words 5,000 or more times. The lexical diversity is much lower.

The Top 10 most frequently used words in this cohort appears to be a subset of those used in the Top Performer Cohort (above). In other words, a similar core vocabulary is used but fewer times and in the context of a much narrower vocabulary.

like	one	1	new	now	also	just	time	people	2
90445	76413	72181	72170	69627	66945	62857	61338	54410	50133

Topic Models

Latent Dirichlet allocation (LDA) is a generative statistical model used to identify groups of similar words across documents. I'm using it here in an attempt to identify Topics or Themes.

This is as much art as science as it requires manual tweaking of the algorithms parameters. The algorithm is computationally intensive and takes a long time to run on my crappy MacBook, so I have not invested a lot of time seeking the optimum set of parameters.

Top Performer, Top 5 Topics

      Topic 1  Topic 2 Topic 3  Topic 4 Topic 5
 [1,] "people" "one"   "source" "image" "1"    
 [2,] "time"   "also"  "new"    "watch" "2"    
 [3,] "life"   "like"  "follow" "2017"  "3"    
 [4,] "just"   "first" "2018"   "part"  "4"    
 [5,] "know"   "even"  "use"    "made"  "5"    
 [6,] "day"    "way"   "also"   "2018"  "10"   
 [7,] "see"    "well"  "come"   "used"  "20"   
 [8,] "like"   "much"  "world"  "today" "7"    
 [9,] "go"     "just"  "year"   "day"   "6"    
[10,] "now"    "time"  "time"   "long"  "2018"

Bottom Performer, Top 5 Topics

      Topic 1  Topic 2  Topic 3  Topic 4 Topic 5
 [1,] "time"   "one"    "also"   "1"     "like" 
 [2,] "people" "first"  "new"    "2"     "now"  
 [3,] "make"   "see"    "one"    "3"     "just" 
 [4,] "good"   "people" "make"   "new"   "new"  
 [5,] "know"   "just"   "first"  "first" "make" 
 [6,] "just"   "know"   "time"   "make"  "first"
 [7,] "like"   "make"   "people" "time"  "one"  
 [8,] "see"    "like"   "like"   "one"   "know" 
 [9,] "one"    "good"   "3"      "like"  "time" 
[10,] "first"  "time"   "2"      "good"  "3"

Lexical Dispersion Plots

Using the most frequently used terms from above we can examine how they're used in the top performing posts by Author. This plot illustrates how authors are using particular high frequency words.

Heirarchical Clusters & Dendrograms

Finally, we perform a heirarchical cluster analysis using an agglomeration method called "ward.D". We are examining similar clusters across posts from authors in each subset. I have attempted to bin the clusters into 9 arbitrary groups.

We are suggesting the users in each red box are using statistically similar vocabulary.

Conclusions

This analysis only scratches the surface and is severely limited by my time and crappy laptop. However, it seems to suggest that authors with high lexical diversity (the potentially smarter ones) seem to gain more Up Votes.

Readers should note some acknowledged confounding factors such as,

Some posts are duplicated and translated into two or more languages. This doubles or trebles their token count.
I was unable to account for the effects of Resteeming posts of others.
Some content is straight language translations of other work (e.g. Utopian.io translations)
There are several kanji characters that equate to an english one. This distorts the token count in favor of asian language speakers.

Posted on Utopian.io - Rewarding Open Source Contributors

#steem #blockchainbi #analysis

7 years ago in #utopian-io by morningtundra (53)

$39.94

Sort:

Trending

[-]

crokkon (68) 7 years ago

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

$1.38

3 votes

[-]

utopian.tip (46) 7 years ago

Hey @crokkon, I just gave you a tip for your hard work on moderation. Upvote this comment to support the utopian moderators and increase your future rewards!

$0.00

1 vote

[-]

minnowsupport (70) 7 years ago

Congratulations! This post has been upvoted from the communal account, @minnowsupport, by morningtundra from the Minnow Support Project. It's a witness project run by aggroed, ausbitbank, teamsteem, theprophet0, someguy123, neoxian, followbtcnews, and netuoso. The goal is to help Steemit grow by supporting Minnows. Please find us at the Peace, Abundance, and Liberty Network (PALnet) Discord Channel. It's a completely public and open space to all members of the Steemit community who voluntarily choose to be there.

If you would like to delegate to the Minnow Support Project you can do so by clicking on the following links: 50SP, 100SP, 250SP, 500SP, 1000SP, 5000SP.
Be sure to leave at least 50SP undelegated on your account.

$0.00

1 vote

[-]

babi13 (50) 7 years ago

It's a comprehensive long analysis. Thanks for the information @morningtundra

$0.00

1 vote

[-]

steemitstats (47) 7 years ago

@morningtundra, Contribution to open source project, I like you and upvote.

$0.00

1 vote

[-]

utopian-io (71) 7 years ago

Hey @morningtundra I am @utopian-io. I have just upvoted you!

Achievements

You have less than 500 followers. Just gave you a gift to help you succeed!
Seems like you contribute quite often. AMAZING!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

Vote for my Witness With SteemConnect
Proxy vote to Utopian Witness with SteemConnect
Or vote/proxy on Steemit Witnesses

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

$0.00

1 vote

[-]

steemotion (56) 7 years ago

Very interesting

$0.00

1 vote

[-]

merryslamb (59) 7 years ago

I have never seen a statistical analysis of this kind and it seems to me an impeccable job on your part

$0.00

1 vote

[-]

morningtundra (53) 7 years ago

That’s very kind if you. Thank you.

$0.00

[-]

bigdesafios (29) 7 years ago

Hola soy nueva en esta gran comunidad, espero tener su apoyo, como dice el un dicho: El que se arriesga no pierde nada y el que NO se arriesga no gana... Éxito

$0.00

[-]

friendly-fenix (59) 7 years ago

You are doing interesting work here @morningtundra,
how is your leg/foot healing up?

$0.00

[-]

morningtundra (53) 7 years ago

On the mend. Thanks for asking 🙂

$0.00

Natural Language Text Analysis for January 2018

Abstract

Scope

Tools

Top 10 most popular categories by Post count

Vote Statistical Summary

Subsetting Best & Worst Posts

Lexical Diversity

Highest Voted Authors

Lowest Voted Authors

Visual Comparison

Word Frequency

Word Frequency

Stop Word List

Sparse Word Removal

Top Performer Cohort

Bottom Performer Cohort

Topic Models

Top Performer, Top 5 Topics

Bottom Performer, Top 5 Topics

Lexical Dispersion Plots

Heirarchical Clusters & Dendrograms

Conclusions

Hey @morningtundra I am @utopian-io. I have just upvoted you!

Achievements

Community-Driven Witness!