How to make a word-cloud with R?

Standard

Hello everybody, this is my first post here.

One of my first assignments was to make a word-cloud using R. At the beginning I had no idea what  R was. Thus, I had to research about it. There are so many pages you can find how to do that, but I tried to integrate them and do something more focused and different.

The first thing you need is, obviously, R. You can download R from it’s official page, it’s lighweight and easy to configure. R Project

The second step (and maybe the most important) is to get your data. In this point you have to decide what is gonna be your data origin, such as a known database, a web page, or even a file. In our case, we’ll use a link to a new york time’s post to generate the word-cloud.

postMore..

By default R doesn’t have all the libraries and packages that we need, but you can import them, which is very easy.

So, write in the R console the next instructions:

1
2
3
4
5
6
7
8
 install.packages("tm")  #For text minnig
 install.packages("SnowballC")
 install.packages("wordcloud") #For generate the wordcloud
 install.packages("RColorBrewer") #For put colors in the wordcloud
 library("tm")
 library("SnowballC")
 library("wordcloud")
 library("RColorBrewer")

Now, we have all the needed packages installed, so we are going to throw some code.

First, we have to read the text that we gonna plot and put it in a variable (remember we are using the link previously written).

 

1
2
filePath <- "http://www.nytimes.com/interactive/2016/03/17/science/pluto-images-charon-moons-new-horizons-flyby.html?rref=collection%2Fsectioncollection%2Fspace&action=click&contentCollection=space&region=rank&module=package&version=highlights&contentPlacement=1&pgtype=sectionfront"
text <- ReadLines(filePath)

 

In this step we load the data as a corpus and put it into a variable named “docs” (you may however use the name you want).

1
docs <- Corpus(VectorSource(text))

If you want to print the “docs” variable, you’ll get (you can use inspect(docs) too):

 

1
2
3
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 954

Now, we are going replace some specific characters.

1
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

This function just say to R that it has to change the next characters to a space.

1
2
3
docs <- tm_map(docs, toSpace,<strong> "/"</strong>)
docs <- tm_map(docs, toSpace, <strong>"@"</strong>)
docs <- tm_map(docs, toSpace, <strong>"\\|"</strong>)

Is time to clean our text, we have to do that in order to get a more focused word-cloud.

1
2
3
4
5
docs <- tm_map(docs, content_transformer(tolower)) <span style="color: #808080;">#change the text to lowercase
docs <- tm_map(docs, removeNumbers)<span style="color: #808080;"> #Remove all the numbers</span>
</span>docs <- tm_map(docs, removeWords, stopwords("english")) <span style="color: #808080;">#Remove specific words like "the","and", etc.
docs <- tm_map(docs, removePunctuation)
</span>docs <- tm_map(docs, stripWhitespace) <span style="color: #808080;">#Remove needless spaces</span>

Now we’ll build a term-document matrix. It contains the word’s frequency.

1
2
3
4
5
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
head(d, 10)

3The last step is to generate the word-cloud. (In this case the word-cloud contains html’s tags. But don´t worry, this happens due to text is taken from a webpage)

1
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words=20, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

finl2

We can change the max.word parameter to view more o less words in our word-cloud.

If you want, you can save the word-cloud like a pdf. You only have add some lines more.

1
2
3
4
5
wd<- getwd(); <span style="color: #808080;">#R<span class="st">eturns an absolute filepath representing the current working directory of the <em>R</em> process
</span></span>adir<-"/wcloud.pdf"<span style="color: #ff0000;"><span style="color: #ff0000;"> <span style="color: #808080;">#A</span></span></span><span style="color: #808080;">ssign </span><span style="color: #ff0000;"><span style="color: #808080;">a name to the pdf file</span></span>
dir<-paste(wd,adir, sep="") <span style="color: #808080;">#Just put together the path and the file's name to build the whole path
</span>pdf(dir) <span style="color: #808080;">#Assign the complete path to the pdf function</span>
cloud<-wordcloud(words = d$word, freq = d$freq, min.freq = 1,  max.words=20, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))<span style="color: #808080;"> #Plot the wordcloud</span>

finalThat’s all. I hope this post might be useful for you.

See you soon.

José Ángel Sosa Martinez

Practicante de Desarrollo de Software

Ing. En Sistemas Computacionales.

 

Leave a Reply

Your email address will not be published. Required fields are marked *