R language natural language processing (NLP): sentiment analysis news text data

R language natural language processing (NLP): sentiment analysis news text data

Original link: tecdat.cn/?p=19095

Original source: Tuoduan Data Tribe Official Account

 

 

This article conducts sentiment analysis on the text content in R. This implementation makes use of various existing dictionaries, in addition, you can also create custom dictionaries. The custom dictionary uses LASSO regularization as a statistical method to select relevant words. Finally, evaluate and compare all methods.

Introduction

Sentiment analysis is the core research branch of natural language processing (NLP), computational linguistics and text mining. It refers to the method of extracting subjective information from text documents. In other words, it extracts the positive and negative polarities of expressing opinions. People may also refer to sentiment analysis as  opinion mining  (Pang and Lee 2008).

Application in research

Recently, sentiment analysis has received widespread attention (K. Ravi and Ravi 2015; Pang and Lee 2008), which we will discuss below. Current research in the fields of finance and social sciences uses sentiment analysis to understand human decisions based on textual materials. This immediately reveals multiple implications for practitioners and those in the fields of financial research and social science: researchers can use R to extract textual components relevant to readers and test their hypotheses on this basis. Likewise, practitioners can measure which wording is actually important to their readers and improve their writing skills accordingly (Pr llochs, Feuerriegel, and Neumann 2015). In the following two case studies, we demonstrate increased benefits from finance and social sciences.

application

Several applications demonstrate the use of sentiment analysis in organizations and businesses:

  • Finance:  Investors in the financial market will refer to textual information in the form of financial news disclosure before exercising stock ownership. Interestingly, they not only rely on data, but also on information, such as tone and emotion (Henry 2008; Loughran and McDonald 2011; Tetlock 2007), which greatly affects stock prices. By using sentiment analysis, automated traders can analyze the sentiment conveyed in financial disclosures in order to make investment decisions.
  • Marketing:  Marketing departments are usually interested in tracking brand image. To this end, they collect a large number of user opinions from social media and evaluate personal feelings about brands, products and services.
  • Rating and review platform: The  rating and review platform implements valuable functions by collecting users' ratings or preferences for certain products and services. Here, people can automatically process large amounts of user-generated content (UGC) and use the knowledge gained therefrom. For example, people can determine which prompts convey positive or negative opinions, and can even automatically verify their credibility.

Sentiment analysis method

As sentiment analysis is applied to a wide range of fields and text sources, research has devised various methods of measuring sentiment. A recent literature review (Pang and Lee 2008) provides a comprehensive, field-independent survey.

On the one hand, when machine learning methods pursue high predictive performance, it is the first choice. However, machine learning often acts as a black box, making interpretation difficult. On the other hand, dictionary-based methods generate lists of positive and negative words. Then, the corresponding occurrences of these words are combined into a single sentiment score. As a result, basic decisions become traceable, and researchers can understand the factors that lead to specific emotions.

In addition, 

SentimentAnalysis
 Allows generation of customized dictionaries. They are customized for specific fields, and compared with pure dictionaries, they have improved predictive performance and are fully interpretable. The details of this method can be found in (Pr llochs, Feuerriegel and Neumann 2018).

In the process of performing sentiment analysis, the running text must be converted into a machine-readable format. This is achieved by performing a series of preprocessing operations. 1. mark the text as a single word, and then perform common preprocessing steps: stop word removal, stemming, removal of punctuation marks, and lowercase conversion. These operations are also performed by default in 

SentimentAnalysis
, But can be adjusted according to personal needs.

 

Short demonstration

# Analyze the polarity of a single character (positive/negative) anaSen("Yes, this is a great football match for the German team!") Copy code
## [1] positive ## Levels: negative positive Copy code
# Create string vector documents <- c("Wow, I really like the new light saber!", "That book is great.", "R is a great language.", "The service in this restaurant is terrible." "This is neither positive nor negative.", "The waiter forgot my dessert-what a terrible service!") # Analyze sentiment anaSen(documents) # Extract dictionary-based emotions according to QDAP dictionary sentiment$SentimentQDAP Copy code
## [1] 0.3333333 0.5000000 0.5000000 -0.3333333 0.0000000 -0.4000000 duplicated code
#View the emotional direction (ie positive, neutral and negative) ToDirection(sentiment$SentimentQDAP) Copy code
## [1] positive positive positive negative neutral negative ## Levels: negative neutral positive Copy code
response <- c(+1, +1, +1, -1, 0, -1) comToRne(sentiment, response) Copy code
## WordCount SentimentGI NegativityGI ## cor -0.18569534 0.990011498 -9.974890e-01 ## cor.t.statistic -0.37796447 14.044046450 -2.816913e+01 ## cor.p.value 0.72465864 0.000149157 9.449687e-06 ## lm.t.value -0.37796447 14.044046450 -2.816913e+01 ## r.squared 0.03448276 0.980122766 9.949843e-01 ## RMSE 3.82970843 0.450102869 1.186654e+00 ## MAE 3.33333333 0.400000000 1.100000e+00 ## Accuracy 0.66666667 1.000000000 6.666667e-01 ## Precision NaN 1.000000000 NaN ## Sensitivity 0.00000000 1.000000000 0.000000e+00 ## Specificity 1.00000000 1.000000000 1.000000e+00 ## F1 0.00000000 0.500000000 0.000000e+00 ## BalancedAccuracy 0.50000000 1.000000000 5.000000e-01 ## avg.sentiment.pos.response 3.25000000 0.333333333 8.333333e-02 ## avg.sentiment.neg.response 4.00000000 -0.633333333 6.333333e-01 ## PositivityGI SentimentHE NegativityHE ## cor 0.942954167 0.4152274 -0.083045480 ## cor.t.statistic 5.664705543 0.9128709 -0.166666667 ## cor.p.value 0.004788521 0.4129544 0.875718144 ## lm.t.value 5.664705543 0.9128709 -0.166666667 ## r.squared 0.889162562 0.1724138 0.006896552 ## RMSE 0.713624032 0.8416254 0.922958207 ## MAE 0.666666667 0.7500000 0.888888889 ## Accuracy 0.666666667 0.6666667 0.666666667 ## Precision NaN NaN NaN ## Sensitivity 0.000000000 0.0000000 0.000000000 ## Specificity 1.000000000 1.0000000 1.000000000 ## F1 0.000000000 0.0000000 0.000000000 ## BalancedAccuracy 0.500000000 0.5000000 0.500000000 ## avg.sentiment.pos.response 0.416666667 0.1250000 0.083333333 ## avg.sentiment.neg.response 0.000000000 0.0000000 0.000000000 ## PositivityHE SentimentLM NegativityLM ## cor 0.3315938 0.7370455 -0.40804713 ## cor.t.statistic 0.7029595 2.1811142 -0.89389841 ## cor.p.value 0.5208394 0.0946266 0.42189973 ## lm.t.value 0.7029595 2.1811142 -0.89389841 ## r.squared 0.1099545 0.5432361 0.16650246 ## RMSE 0.8525561 0.7234178 0.96186547 ## MAE 0.8055556 0.6333333 0.92222222 ## Accuracy 0.6666667 0.8333333 0.66666667 ## Precision NaN 1.0000000 NaN ## Sensitivity 0.0000000 0.5000000 0.00000000 ## Specificity 1.0000000 1.0000000 1.00000000 ## F1 0.0000000 0.3333333 0.00000000 ## BalancedAccuracy 0.5000000 0.7500000 0.50000000 ## avg.sentiment.pos.response 0.2083333 0.2500000 0.08333333 ## avg.sentiment.neg.response 0.0000000 -0.1000000 0.10000000 ## PositivityLM RatioUncertaintyLM SentimentQDAP ## cor 0.6305283 NA 0.9865356369 ## cor.t.statistic 1.6247248 NA 12.0642877257 ## cor.p.value 0.1795458 NA 0.0002707131 ## lm.t.value 1.6247248 NA 12.0642877257 ## r.squared 0.3975659 NA 0.9732525629 ## RMSE 0.7757911 0.9128709 0.5398902495 ## MAE 0.7222222 0.8333333 0.4888888889 ## Accuracy 0.6666667 0.6666667 1.0000000000 ## Precision NaN NaN 1.0000000000 ## Sensitivity 0.0000000 0.0000000 1.0000000000 ## Specificity 1.0000000 1.0000000 1.0000000000 ## F1 0.0000000 0.0000000 0.5000000000 ## BalancedAccuracy 0.5000000 0.5000000 1.0000000000 ## avg.sentiment.pos.response 0.3333333 0.0000000 0.3333333333 ## avg.sentiment.neg.response 0.0000000 0.0000000 -0.3666666667 ## NegativityQDAP PositivityQDAP ## cor -0.944339551 0.942954167 ## cor.t.statistic -5.741148345 5.664705543 ## cor.p.value 0.004560908 0.004788521 ## lm.t.value -5.741148345 5.664705543 ## r.squared 0.891777188 0.889162562 ## RMSE 1.068401367 0.713624032 ## MAE 1.011111111 0.666666667 ## Accuracy 0.666666667 0.666666667 ## Precision NaN NaN ## Sensitivity 0.000000000 0.000000000 ## Specificity 1.000000000 1.000000000 ## F1 0.000000000 0.000000000 ## BalancedAccuracy 0.500000000 0.500000000 ## avg.sentiment.pos.response 0.083333333 0.416666667 ## avg.sentiment.neg.response 0.366666667 0.000000000 Copy code
## WordCount SentimentGI NegativityGI PositivityGI ## Accuracy 0.6666667 1.0000000 0.66666667 0.6666667 ## Precision NaN 1.0000000 NaN NaN ## Sensitivity 0.0000000 1.0000000 0.00000000 0.0000000 ## Specificity 1.0000000 1.0000000 1.00000000 1.0000000 ## F1 0.0000000 0.5000000 0.00000000 0.0000000 ## BalancedAccuracy 0.5000000 1.0000000 0.50000000 0.5000000 ## avg.sentiment.pos.response 3.2500000 0.3333333 0.08333333 0.4166667 ## avg.sentiment.neg.response 4.0000000 -0.6333333 0.63333333 0.0000000 ## SentimentHE NegativityHE PositivityHE ## Accuracy 0.6666667 0.66666667 0.6666667 ## Precision NaN NaN NaN ## Sensitivity 0.0000000 0.00000000 0.0000000 ## Specificity 1.0000000 1.00000000 1.0000000 ## F1 0.0000000 0.00000000 0.0000000 ## BalancedAccuracy 0.5000000 0.50000000 0.5000000 ## avg.sentiment.pos.response 0.1250000 0.08333333 0.2083333 ## avg.sentiment.neg.response 0.0000000 0.00000000 0.0000000 ## SentimentLM NegativityLM PositivityLM ## Accuracy 0.8333333 0.66666667 0.6666667 ## Precision 1.0000000 NaN NaN ## Sensitivity 0.5000000 0.00000000 0.0000000 ## Specificity 1.0000000 1.00000000 1.0000000 ## F1 0.3333333 0.00000000 0.0000000 ## BalancedAccuracy 0.7500000 0.50000000 0.5000000 ## avg.sentiment.pos.response 0.2500000 0.08333333 0.3333333 ## avg.sentiment.neg.response -0.1000000 0.10000000 0.0000000 ## RatioUncertaintyLM SentimentQDAP NegativityQDAP ## Accuracy 0.6666667 1.0000000 0.66666667 ## Precision NaN 1.0000000 NaN ## Sensitivity 0.0000000 1.0000000 0.00000000 ## Specificity 1.0000000 1.0000000 1.00000000 ## F1 0.0000000 0.5000000 0.00000000 ## BalancedAccuracy 0.5000000 1.0000000 0.50000000 ## avg.sentiment.pos.response 0.0000000 0.3333333 0.08333333 ## avg.sentiment.neg.response 0.0000000 -0.3666667 0.36666667 ## PositivityQDAP ## Accuracy 0.6666667 ## Precision NaN ## Sensitivity 0.0000000 ## Specificity 1.0000000 ## F1 0.0000000 ## BalancedAccuracy 0.5000000 ## avg.sentiment.pos.response 0.4166667 ## avg.sentiment.neg.response 0.0000000 Copy code

A set of preprocessing operations were performed from text mining. Each document will be marked, and finally the input will be transformed into a matrix of document items.

 

enter

Provides an interface with several other input formats, including

  • String vector.
  • in
    tm
     DocumentTermMatrix and TermDocumentMatrix implemented in the software package (Feinerer, Hornik and Meyer 2008).
  • tm
     Corpus objects implemented by the software package (Feinerer, Hornik and Meyer 2008).

We provide examples below.

Vector string

documents <- c("This is good", "This is not good", "It's somewhere in between") convertToDirection(analyzeSentiment(documents)$SentimentQDAP) Copy code
## [1] positive negative neutral ## Levels: negative neutral positive Copy code

Document word matrix

corpus <- VCorpus(VectorSource(documents)) convertToDirection(analyzeSentiment(corpus)$SentimentQDAP) Copy code
## [1] positive negative neutral ## Levels: negative neutral positive Copy code

Corpus object

## [1] positive negative neutral ## Levels: negative neutral positive Copy code

It can be used directly with the document term matrix, so you can use custom preprocessing operations from the beginning. After that, the sentiment score can be calculated. For example, you can replace stop words with stop words from other lists.

dictionary

3.different types of dictionaries can be distinguished. They store different data, and these data ultimately control which sentiment analysis methods can be applied. The dictionary is as follows:

  • SentimentDictionaryWordlist
     Contains a list of words belonging to a category.
  • SentimentDictionaryBinary
     Two word lists are stored, one for positive entries and one for negative entries.
  • SentimentDictionaryWeighted
     Sentiment scores for words are allowed.

Emotional dictionary vocabulary

# Alternative d <- Dictionary(c("Uncertain", "Maybe", "Maybe")) summary(d) Copy code
## Dictionary type: word list (single set) ## Total entries: 3 Copy code

Emotional Dictionary

d <- DictionaryBin(c("Increase", "Rise", "More"), c("decline")) summary(d) Copy code
## Dictionary type: binary (positive/negative) ## Total entries: 5 ## Positive entries: 3 (60%) ## Negative entries: 2 (40%) Copy code

Sentiment dictionary weighting

d <- SentimentDictionaryWeighted(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d) Copy code
## Dictionary type: weighted (words with individual scores) ## Total entries: 3 ## Positive entries: 1 (33.33%) ## Negative entries: 2 (66.67%) ## Neutral entries: 0 (0%) ## ## Details ## Average score: -3.333333 ## Median: -1 ## Min: -10 ## Max: 1 ## Standard deviation: 5.859465 ## Skewness: -0.6155602 Copy code
d <- SentimentDictionary(c("increase", "decrease", "exit"), c(+1, -1, -10), rep(NA, 3)) summary(d) Copy code
## Dictionary type: weighted (words with individual scores) ## Total entries: 3 ## Positive entries: 1 (33.33%) ## Negative entries: 2 (66.67%) ## Neutral entries: 0 (0%) ## ## Details ## Average score: -3.333333 ## Median: -1 ## Min: -10 ## Max: 1 ## Standard deviation: 5.859465 ## Skewness: -0.6155602 Copy code

Dictionary generation

Represent the dependent variable in the form of a vector. In addition, the variable gives the number of times the word appears in the document. Then, the method estimates a linear model with intercept and coefficients. The estimation is based on LASSO regularization, which performs variable selection. In this way, it sets certain coefficients to exactly zero. The remaining words can then be sorted by polarity according to their coefficients.

# Create string vector documents <- c("This is a good thing!", "This is a very good thing!", "It's ok." "This is a bad thing.", "This is a very bad thing." ) response <- c(1, 0.5, 0, -0.5, -1) # Use LASSO regularization to generate a dictionary dict Copy code
## Type: weighted (words with individual scores) ## Intercept: 5.55333e-05 ## -0.51 bad ## 0.51 good Copy code
summary(dict) Copy code
## Dictionary type: weighted (words with individual scores) ## Total entries: 2 ## Positive entries: 1 (50%) ## Negative entries: 1 (50%) ## Neutral entries: 0 (0%) ## ## Details ## Average score: -5.251165e-05 ## Median: -5.251165e-05 ## Min: -0.5119851 ## Max: 0.5118801 ## Standard deviation: 0.7239821 ## Skewness: 0 Copy code

There are several fine-tuning options. Just change the parameters, you can replace LASSO with an elastic network model.

Finally, you can use

read()
 And save and reload the dictionary 
write()

Evaluation

In the end, the routine allows people to dig further into the generated dictionary. On the one hand, you can pass

summary()
 The example shows a simple overview. On the other hand, kernel density estimation can also visualize the distribution of positive and negative words.

## Comparing: wordlist vs weighted ## ## Total unique words: 4213 ## Matching entries: 2 (0.0004747211%) ## Entries with same classification: 0 (0%) ## Entries with different classification: 2 (0.0004747211%) ## Correlation between scores of matching entries: 1 Copy code
## $totalUniqueWords ## [1] 4213 ## ## $totalSameWords ## [1] 2 ## ## $ratioSameWords ## [1] 0.0004747211 ## ## $numWordsEqualClass ## [1] 0 ## ## $numWordsDifferentClass ## [1] 2 ## ## $ratioWordsEqualClass ## [1] 0 ## ## $ratioWordsDifferentClass ## [1] 0.0004747211 ## ## $correlation ## [1] 1 Copy code
## Dictionary ## cor 0.94868330 ## cor.t.statistic 5.19615237 ## cor.p.value 0.01384683 ## lm.t.value 5.19615237 ## r.squared 0.90000000 ## RMSE 0.23301039 ## MAE 0.20001111 ## Accuracy 1.00000000 ## Precision 1.00000000 ## Sensitivity 1.00000000 ## Specificity 1.00000000 ## F1 0.57142857 ## BalancedAccuracy 1.00000000 ## avg.sentiment.pos.response 0.45116801 ## avg.sentiment.neg.response -0.67675202 Copy code

The following example demonstrates how to use the calculated dictionary to predict the sentiment of out-of-sample data. Then evaluate the prediction performance by comparing it with the built-in dictionary.

test_documents <- c("This is neither a good thing nor a bad thing", "What a great idea!", "Good" ) pred <- predict(dict, test_documents) Copy code
## Dictionary ## cor 5.922189e-05 ## cor.t.statistic 5.922189e-05 ## cor.p.value 9.999623e-01 ## lm.t.value 5.922189e-05 ## r.squared 3.507232e-09 ## RMSE 8.523018e-01 ## MAE 6.666521e-01 ## Accuracy 3.333333e-01 ## Precision 0.000000e+00 ## Sensitivity NaN ## Specificity 3.333333e-01 ## F1 0.000000e+00 ## BalancedAccuracy NaN ## avg.sentiment.pos.response 1.457684e-05 ## avg.sentiment.neg.response NaN Copy code

## WordCount SentimentGI NegativityGI ## cor -0.8660254 -0.18898224 0.18898224 ## cor.t.statistic -1.7320508 -0.19245009 0.19245009 ## cor.p.value 0.3333333 0.87896228 0.87896228 ## lm.t.value -1.7320508 -0.19245009 0.19245009 ## r.squared 0.7500000 0.03571429 0.03571429 ## RMSE 1.8257419 1.19023807 0.60858062 ## MAE 1.3333333 0.83333333 0.44444444 ## Accuracy 1.0000000 0.66666667 1.00000000 ## Precision NaN 0.00000000 NaN ## Sensitivity NaN NaN NaN ## Specificity 1.0000000 0.66666667 1.00000000 ## F1 0.0000000 0.00000000 0.00000000 ## BalancedAccuracy NaN NaN NaN ## avg.sentiment.pos.response 2.0000000 -0.16666667 0.44444444 ## avg.sentiment.neg.response NaN NaN NaN ## PositivityGI SentimentHE NegativityHE ## cor -0.18898224 -0.18898224 NA ## cor.t.statistic -0.19245009 -0.19245009 NA ## cor.p.value 0.87896228 0.87896228 NA ## lm.t.value -0.19245009 -0.19245009 NA ## r.squared 0.03571429 0.03571429 NA ## RMSE 0.67357531 0.67357531 0.8164966 ## MAE 0.61111111 0.61111111 0.6666667 ## Accuracy 1.00000000 1.00000000 1.0000000 ## Precision NaN NaN NaN ## Sensitivity NaN NaN NaN ## Specificity 1.00000000 1.00000000 1.0000000 ## F1 0.00000000 0.00000000 0.0000000 ## BalancedAccuracy NaN NaN NaN ## avg.sentiment.pos.response 0.27777778 0.27777778 0.0000000 ## avg.sentiment.neg.response NaN NaN NaN ## PositivityHE SentimentLM NegativityLM ## cor -0.18898224 -0.18898224 0.18898224 ## cor.t.statistic -0.19245009 -0.19245009 0.19245009 ## cor.p.value 0.87896228 0.87896228 0.87896228 ## lm.t.value -0.19245009 -0.19245009 0.19245009 ## r.squared 0.03571429 0.03571429 0.03571429 ## RMSE 0.67357531 1.19023807 0.60858062 ## MAE 0.61111111 0.83333333 0.44444444 ## Accuracy 1.00000000 0.66666667 1.00000000 ## Precision NaN 0.00000000 NaN ## Sensitivity NaN NaN NaN ## Specificity 1.00000000 0.66666667 1.00000000 ## F1 0.00000000 0.00000000 0.00000000 ## BalancedAccuracy NaN NaN NaN ## avg.sentiment.pos.response 0.27777778 -0.16666667 0.44444444 ## avg.sentiment.neg.response NaN NaN NaN ## PositivityLM RatioUncertaintyLM SentimentQDAP ## cor -0.18898224 NA -0.18898224 ## cor.t.statistic -0.19245009 NA -0.19245009 ## cor.p.value 0.87896228 NA 0.87896228 ## lm.t.value -0.19245009 NA -0.19245009 ## r.squared 0.03571429 NA 0.03571429 ## RMSE 0.67357531 0.8164966 1.19023807 ## MAE 0.61111111 0.6666667 0.83333333 ## Accuracy 1.00000000 1.0000000 0.66666667 ## Precision NaN NaN 0.00000000 ## Sensitivity NaN NaN NaN ## Specificity 1.00000000 1.0000000 0.66666667 ## F1 0.00000000 0.0000000 0.00000000 ## BalancedAccuracy NaN NaN NaN ## avg.sentiment.pos.response 0.27777778 0.0000000 -0.16666667 ## avg.sentiment.neg.response NaN NaN NaN ## NegativityQDAP PositivityQDAP ## cor 0.18898224 -0.18898224 ## cor.t.statistic 0.19245009 -0.19245009 ## cor.p.value 0.87896228 0.87896228 ## lm.t.value 0.19245009 -0.19245009 ## r.squared 0.03571429 0.03571429 ## RMSE 0.60858062 0.67357531 ## MAE 0.44444444 0.61111111 ## Accuracy 1.00000000 1.00000000 ## Precision NaN NaN ## Sensitivity NaN NaN ## Specificity 1.00000000 1.00000000 ## F1 0.00000000 0.00000000 ## BalancedAccuracy NaN NaN ## avg.sentiment.pos.response 0.44444444 0.27777778 ## avg.sentiment.neg.response NaN NaN Copy code

Pretreatment

If necessary, a pre-processing stage suitable for specific needs can be implemented. Such as function 

ngram_tokenize()
 , Used to extract from the corpus
n
-gram.

tdm <- TermDocumentMatrix(corpus, control=list(wordLengths=c(1,Inf), tokenize=function(x) ngram_tokenize(x, char=FALSE, ngmin=1, ngmax=2))) Copy code
## Dictionary type: weighted (words with individual scores) ## Total entries: 7 ## Positive entries: 4 (57.14%) ## Negative entries: 3 (42.86%) ## Neutral entries: 0 (0%) ## ## Details ## Average score: 5.814314e-06 ## Median: 1.602469e-16 ## Min: -0.4372794 ## Max: 0.4381048 ## Standard deviation: 0.301723 ## Skewness: 0.00276835 Copy code
dictcopy code
## Type: weighted (words with individual scores) ## Intercept: -5.102483e-05 ## -0.44 bad ## -0.29 very bad ## 0.29 Good Copy code

Performance optimization

 

## SentimentLM ## 1 0.5 ## 2 0.5 ## 3 0.0 ## 4 -0.5 ## 5 -0.5 Copy code

Language support and scalability

 Can be adapted to use in other languages. To do this, you need to change at two points:

  • Preprocessing: using parameters 
    language=""
     To perform all preprocessing operations.
  • Dictionary:  You can use the attached dictionary generation method. This can then automatically generate a dictionary of positive and negative words that can be applied to a given language.

The following example uses the German example. Finally, we conduct sentiment analysis.

documents <- c("Das ist ein gutes Resultat", "Das Ergebnis war schlecht") sentiment <- ana(documents, language="german", sentiment Copy code
## GermanSentiment ## 1 0.0 ## 2 -0.5 Copy code
## [1] positive negative ## Levels: negative positive Copy code

Similarly, a dictionary can be implemented using custom sentiment scores.

woorden <- c("goed","slecht") scores <- c(0.8,-0.5) Copy code
## DutchSentiment ## 1 -0.5 Copy code

 

Instance

We took advantage of

tm
 Reuters Petroleum News in the package.

# Analyze sentiment sentiment <- ana(crude) # Calculate the number of positive and negative news releases table(coToB(sentiment$SentimentLM)) Copy code
## ## negative positive ## 16 4 Copy code
# The highest and lowest emotional news copy the code
## [1] "HOUSTON OIL < HO> RESERVES STUDY COMPLETED" copy the code
crude [[which.min (sentiment $ SentimentLM )]] $ meta $ heading duplicated code
## [1] "DIAMOND SHAMROCK ( DIA) CUTS CRUDE PRICES" copy the code
# View summary statistics of sentiment variables summary(sentiment$SentimentLM) Copy code
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -0.08772 -0.04366 -0.02341 -0.02953 -0.01375 0.00000 Copy code
# Visualize the distribution of standardized emotional variables hist(scale(sentiment$SentimentLM)) Copy code

# Calculate the relevant copy the code
## SentimentLM SentimentHE SentimentQDAP ## SentimentLM 1.0000000 0.2769878 0.4769730 ## SentimentHE 0.2769878 1.0000000 0.6141075 ## SentimentQDAP 0.4769730 0.6141075 1.0000000 Copy code
# 1987-02-26 Crude oil news between 1987-03-02 plot(senti$Sentime) Copy code

plot(SenLM, x=date, cumsum=TRUE) Copy code

Word calculation

Count words.

# (no stop words) copy code
## WordCount ## 1 3 Copy code
# Count all words (including stop words) copy code
## WordCount ## 1 4 Copy code

 

references

Feinerer, Ingo, Kurt Hornik and David Meyer. Year 2008. "Text Mining Infrastructure in R". Statistical Software Journal  25(5): 1 54.

Tetlock, Paul C., 2007. "The sentiment of conveying content to investors: The role of the media in the stock market."  Journal of Finance  62(3): 1139 68.


Most popular insights

1. Analyze the research hotspots of big data journal articles

2. 618 online shopping data inventory-what are the people paying attention to

3. Research on r language text mining tf-idf topic modeling, sentiment analysis n-gram modeling

4. Python topic modeling visualization lda and t-sne interactive visualization

5. Observation of news data under the epidemic

6. Python theme lda modeling and t-sne visualization

7. Topic-modeling analysis of text data in r language

8. Topic model: data listen to those "net things" on the People's Daily Online message board

9. Python crawler performs web crawling lda topic semantic data analysis