R Tutorial : Counting words

Resize

Added 5 years ago by Admin in Top 10

63 Views

Description

Want to learn more? Take the full course at https://learn.datacamp.com/courses/topic-modeling-in-r at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

---

In order to fit a topic model, we must prepare a document-term matrix that will contain counts of word occurrences in documents. In this lesson, we will cover how to do it using packages tidytext and dplyr.

In-text processing, the process of splitting a text is referred to as tokenization. In our case, we will be splitting text into words, but in general, tokens can be a sequence of characters or a sequence of words.

Package tidytext has a function unnest_tokens() that performs tokenization.

The function takes a column from a table, splits it into words and, by default, it will drop the column with text. It will also convert the output to lower case.

We have a data frame named "book". It has two columns: chapter and text.

We call unnest_tokens(), instructing that the column with tokens should be named "word" and that column "text" should be dropped.

We get back a table in which each word is in its own row.

We will use function count() from package dplyr to obtain frequencies of words within chapters.

This function, essentially, groups the rows by chapter and word, and returns the number of rows in each group. This corresponds to the number of times a specific word occurs in a specific chapter.

The result is a table with one row per each combination of chapters and words. For example, the word "is" occurs twice in chapter 1.

Once we have the counts, we often will want to examine the top words, for example, the top 10.

This can be done by grouping the rows by chapter, sorting the rows within each group in order of descending counts, and then realizing that the rank of a word is equal to its row number.

Most frequent word will be in row 1, second most frequent - in row 2, and so on. dplyr has a function row_number() that returns the row number. All we need to do is filter on the condition that row number is less than a threshold value.

This is an example of getting top two words from each chapter. We use arrange() to sort the rows within each group, and wrap the count values n into a call to desc() to sort in descending order.

Then we filter, inside each group, to keep only rows whose number is less than 3.

Note that for chapter 2, the second word we got is "comes". Its count is 1 and we know there were other words with that count. What we got, then, is also determined by the alphabetical order.

You may be familiar with function cast() that is used to transform a table from one format into another.

We need to transform the table with counts into a document-term matrix, dtm for short. In dtm, rows correspond to documents, and columns to words, terms. In our case, each sentence will be a document.

Package tidytext has function cast_dtm() that transforms a table in tidy format into a document-term matrix.

It accepts a table with counts, and needs to know which column corresponds to a document ID, which - to the word, and which - to the value of the count.

Here is an example. You should recognize most of this script. Everything until cast_dtm() is the code that creates a tidy table with word counts.

We add the call to cast_dtm() and get back a document-term matrix. It is stored in a special format, as a so-called sparse matrix.

We can examine its contents by converting it to a regular matrix using function as.matrix().

It's time to practice!

#DataCamp #RTutorial #TopicModelinginR

Post your comment

Comments

Be the first to comment