Perfectionism in the Public Domain: a Natural Language Processing Approach
Aug 5, 2018
Mike Page
29 minute read

Michael Page

1. INTRODUCTION


1.1 Introduction to the problem

Recent meta-analytical evidence demonstrates that levels of perfectionism in Western populations has linearly increased over the past three decades (Curran & Hill, 2017). This has coincided with a rapid growth in the number of research articles investigating the outcomes, processes, and characteristics associated with perfectionism since the introduction of the first multidimensional perfectionism measures in the early 90’s. Despite a growing body of perfectionism literature in the academic domain, little is known (from the perspective of the academic) regarding reporting standards of perfectionism in the public domain. Indeed, it is important that academic research is accurately translated and disseminated to the broader public. Nonetheless, the extent to which this holds true in the realm of perfectionism is unknown.

1.2 Perfectionism

Broadly defined, perfectionism is understood to be a multidimensional personality trait consisting of two higher-order dimensions: perfectionistic strivings and perfectionistic concerns (Stoeber & Otto, 2006). Perfectionistic strivings capture the setting of high performance standards and self- oriented strivings for perfection, whereas perfectionistic concerns capture the negative reactions to imperfections and mistakes, and the fear of negative social appraisal (Gotwals, Stoeber, Dunn, and Stoll, 2012). These two dimensions are considered to be part of a hierarchical model or heuristic representative of a range of different models that exists (Hill, 2016). Multiple reviews support that both perfectionistic strivings and perfectionistic concerns display a typical pattern of findings: perfectionistic strivings are associated with adaptive outcomes, processes and characteristics, whereas perfectionistic concerns are associated with maladaptive outcomes, processes, and characteristics (e.g., Gotwals et al., 2012, Stoeber, 2011, Stoeber & Otto, 2006). One may therefore expect public reporting of perfectionism to reflect this patterns of findings.

1.3 Framing the problem

An assessment of the meaning and social understanding of perfectionism, both in regards to whether it is perceived as a positive and/or negative trait (in line with the research on perfectionistic strivings and perfectionistic concerns), and inferred by the other words/topics it coalesces with, would contribute to research in this area. Fortunately, a wealth of natural language processing tools are at the researchers disposal to answer such questions. These include sentiment analyses and machine learning methods such as topic modelling. The purpose of this project is, therefore, to employ a variety of natural language processing techniques to better understand how perfectionism is reported in the public domain.

2. DATA WRANGLING


2.1 NewsRiver API client

To obtain data, an API client for the NewsRiver API was built using the ‘httr’ package. Code for the API client (alongside code for the whole project) can be found in the appendix (it should be noted that where relevant, code is also included in text). The NewsRiver API was selected for its large library of news sources and broad range of programmable search parameters. Accordingly, the API client was designed to accept a variety of search parameters from search dates to title and text keyword searches (among others):

# Set date range for api search

search_dates <- seq(as.Date("2017-07-01"), as.Date("2018-07-01"), by = "months")

# Set query parameters to be called

query <- sprintf('title:("perfectionism" OR "perfect") AND text:"perfectionism" AND language:en AND discoverDate:[%s TO %s]', search_dates, search_dates %m+% months(1))
2.2 Wrangling JSON files

The API returned a JSON file which was then parsed as text and stored in a data frame object (in this instance as a tibble). The returned tibble contained 19 different variables, many of which contained metadata and non-essential information (e.g., read time, website icon url, etc.). Moreover, many of these non-essential variables contained large quantities of NA values. Subsequently, four key variables of interest (title, text, date, and website) were selected inline with the aims of the project. The website and article publication date variables were renamed, and the article publication date variable was parsed into date format:

# Return error and stop execution if a json is not returned

    if (http_type(resp) != "application/json") {
    stop("API did not return json", call. = FALSE)
    }

    # Return error if there is a http error, else parse the content from the json file and store the title, text, date, and website in a tibble

    if (http_error(resp) == TRUE) {
        warning("The request failed")
    } else {
        news_tbl <- fromJSON(content(resp, as = "text", encoding = "UTF-8"), flatten = TRUE)  %>%
        as_tibble()

        if (nrow(news_tbl) != 0) {
            news_tbl <- news_tbl %>% mutate(date = as.Date(discoverDate), website = website.domainName) %>% select(title, text, date, website)
        }
2.3 Cleaning variables

The next step in the data wrangling process involved cleaning the selected variables. The transformed tibble was searched for NA values, of which the website variable was found to contain three. As the website variable was not needed for all analyses, these observations were kept in order to maintain sample size.

After parsing the JSON files, many unicode characters (e.g., “i019m”) were found in the text and title variable data and were transformed into ASCII characters using the ‘stringi’ package:

perf_news %<>% mutate(text = stringi::stri_trans_general(perf_news$text, "latin-ascii"), title = stringi::stri_trans_general(perf_news$title, "latin-ascii"))

Next, the tibble was searched for duplicate observations. Strings from the title column were transformed to lower case and then duplicate observations were then removed using the distinct function from ‘dplyr’:

perf_news %<>% mutate(title = str_to_lower(title))
perf_news %<>% distinct(title, .keep_all = TRUE)

Finally, the tibble was manually inspected for errors and subsequent observations were removed:

perf_news <- perf_news %>%
  filter(!(perf_news$title == perf_news$title[52])) %>% 
  filter(!(perf_news$title == perf_news$title[3]))
2.4 Tidy text

For the majority of analyses in this project, a tidy data structure was used. A tidy data structure is one where each variable is a column, each observation is a row, and each type of observational unit is a table (Wickham, 2014). Accordingly, the cleaned tibbles above were tokenised into a ‘tidy text’ format, with one word per row as specified in Silge & Robinson (2017):

tidy_news <- perf_news %>% unnest_tokens(word, text)

Finally, stop words (e.g., “the”, “to”, “of”, etc.) were removed:

tidy_news <- tidy_news %>% anti_join(stop_words)

3. DATA SETS

The wrangled data set, tidy_news, contained 18,160 observations, split across 74 news articles from a variety of sources. Each article contained the word ‘perfectionism’ or ‘perfect’ in the title and ‘perfectionism’ in the text on at least one occasion, as specified in the search parameters in section 3.1. The date range of the articles was from 2017-11-22 to 2018-07-03 (these limits were imposed by the API). A subset of the data can bee seen in the table below:

title date website word
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com mental
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com illness
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com claims
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com lives
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com music
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com helpline
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com seeks
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com change
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com jess
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com cornelius
perfectionism and poverty: why musicians struggle with mental health 2018-06-20 theguardian.com named

It should be noted that although the tidy_news data set formed the basic data structure on which an array of analyses were performed, the data set was further wrangled into several other forms. This included a tidy data set tokenised by sentence, and a document-term matrix (among others). In the interest of conciseness, examples of these data sets have been omitted from this report, however, code for these data set transformations can be found in the appendix.

4. ANALYSES


4.1 Frequency distributions

In order to understand the underlying structure of the data, several exploratory data analyses were performed.

Term frequency distribution: the frequency of terms in each article were calculated and then divided by the total number of terms in each article (i.e., term frequency) as can be seen in Figure 1:

Term frequency across all articles.

Figure 1: Term frequency across all articles.

The distribution in Figure 1 demonstrates a large positive skew, as would be expected in a corpus of natural language such as the one in this study. This is because some types of words such as articles (e.g., ‘the’) appear more frequently than other types of words.

Zipf’s law: to further explore the term frequency distribution in Figure 1, frequency rank against term frequency (on log base 10 scales) was plotted. To calculate frequency rank, frequency for each term was calculated and then terms were grouped by title and then ranked by row number. Additionally, a linear model was fitted to the plot to further examine the underlying data structure. To fit the linear model, first the log base 10 of term frequency was fitted against the log base 10 of rank. The model summary data was then used to plot a reference line on the plot (Figure 2):

Zipfs law applied to the data.

Figure 2: Zipfs law applied to the data.

The relationship between term frequency and rank in Figure 2 represents that of Zipf’s law (an interpretation of the power law). That is, that there is an inverse relationship between term frequency and rank (i.e., the most frequent word will occur approximately twice as often as the second most frequent word, etc.). The middle section of the rank range in Figure 2 has a gradient of -0.76, as determined by the fitted linear model, demonstrating that the data in this study does not strictly abide to Zipf’s law (i.e., a perfect case of Zipf’s law would result in a gradient of -1). Furthermore, there are small deviations at both the higher and lower ranked words. This means that the data set contains fewer rare and common words than would typically be predicted by a power law. Nonetheless, these kind of deviations are not uncommon for many kinds of natural language (Silge & Robinson, 2017). Collectively, the plots in Figure 1 and Figure 2 confirm that data in this project represent a typical corpus of natural language.

4.2 Word frequencies

To understand the basic text composition of the corpus of news articles, a list of custom stop words was removed from the data (see appendix) and the most common words were found (Figure 3):

The most common words across all articles.

Figure 3: The most common words across all articles.

The most common words match those one would expect to find in a corpus of language discussing perfectionism. For example, words such as ‘expectations’, ‘standards’, and ‘pressure’ appear, words commonly used in the academic domain that reflect the high expectations perfectionists place upon themselves. Other words such as ‘students’, ‘college’, and ‘university’ also appear, again, unsurprising given that a large body of academic research has been conducted in University samples. This basic frequency analysis indicates that the words used in this corpus of text can be paralleled to that in the academic domain. Figure 4 demonstrates a further exploration into the most common words used (with the size of the word equating to frequency):

A word cloud of the most frequent perfectionism terms used across all articles.

Figure 4: A word cloud of the most frequent perfectionism terms used across all articles.

4.3 Sentiment Analyses

Sentiment time analysis: Having established that the data represented a typical corpus of natural language, and contained words inline with those one may expect in a corpus of text examining perfectionism, a sentiment analysis over time was performed. This was done in order to explore the underlying polarity of the data (i.e., is perfectionism reported in a positive or negative fashion), and to see if this has changed over time. To achieve this, articles were first tokenised by sentence and assigned a sentence number. Next, the sentences were tokenised by word. To compute a sentiment score for each article, individual word sentiment scores across each sentence were summed, and then each sentence sentiment score across each article was summed. Analyzing sentiment in sentence units, as was done here, better captures the structure of natural language over simpler unigram methods such as summing individual word sentiment scores across a whole text (Silge & Robinson, 2017). In order to provide a multifaceted sentiment analysis, three sentiment lexicons (AFINN, Bing, and NRC) were used to determine the sentiment of each article. For each lexicon, sentiment was split into negative and positive classifications, inline with the theory on perfectionistic strivings and perfectionistic concerns, as can be seen in Figure 5:

Sentiment analysis over time.

Figure 5: Sentiment analysis over time.

The data in Figure 5 reveals several noteworthy artifacts about the data. Firstly, the frequency of publications over time remains consistent, with no time period demonstrating a noticeable increase in publication frequency. Secondly, there appears to be no significant trends in sentiment over time (i.e., the distribution of sentiment scores remains consistent over the time). Finally, the plots reveal that articles report perfectionism in both a positive and negative manner. Using the AFINN and Bing sentiments, perfectionism is frequently reported in a negative fashion, whereas the NRC lexicon demonstrates the opposite effect. One reason for the observed difference in sentiment scores between lexicons may be the higher ratio of positive to negative words used in the NRC lexicon in comparison to the AFINN and Bing lexicons. The table below demonstrates this difference between the Bing and NRC lexicons:

sentiment n lexicon
negative 3324 nrc
positive 2312 nrc
negative 4782 bing
positive 2006 bing

Frequent positive and negative words: the most frequent positive and negative words across all texts were found using the Bing sentiment lexicon as shown in Figure 6:

The most frequent positive and negative terms contributing to sentiment.

Figure 6: The most frequent positive and negative terms contributing to sentiment.

Similar to the term frequency analysis conducted earlier, many predictable terms appear. For example, of the most frequent negative terms, ‘failure’, ‘mistakes’, and ‘unrealistic’ are commonly used in the academic literature to denote the unrealistic expectations perfectionists demand upon themselves, and their associated fear of failure and mistakes. This also extends to terms such as ‘hard’, ‘unhealthy’, and ‘depression’, terms in the academic literature which denote the experiences perfectionists typically find themselves in. On the contrary, of the most frequent positive terms, many terms appear unexpected. Indeed some terms such as ‘positive’ and ‘healthy’ do parallel those terms used in the academic literature (denoting perfectionistic strivings), however, terms such as ‘love’, ‘worth’, and ‘happy’ are not common occurrences in academic texts. Moreover, one would expect literature discussing perfectionism to discuss low self-worth, and a deficit of happiness and love reflecting the relationship and life difficulties associated with perfectionism (Dunkley et al., 2003; Hewitt and Flett, 2002). Perhaps one explanation for the occurrence of unexpected words is the use of unigrams to determine sentiment. Indeed, statements such as ‘not happy’ reflect a negative sentiment, yet result in a positive sentiment as the outcome (i.e, not gets discarded as a stop word and happy gets treated as a positive sentiment) using the methods employed above. One way to overcome and further explore this is through the use of bigrams.

4.4 Bigrams

Bigram sentiment time analysis: in order to account for any negation words in the data set and their effect on sentiment scores, a further sentiment analysis using bigrams was performed. First, the perf_news data set was tokenised by bigrams while keeping track of sentence number. Then, a sentiment analysis was performed by reversing the sentiment score of any bigrams that matched a list of negation words (‘not’, ‘never’, ‘no’, ‘without’). As was done in the unigram sentiment analysis conducted previously, individual word sentiment scores across each sentence were summed, and then each sentence sentiment score across each article was summed to achieve a sentiment score for each article, as can be seen in Figure 7:

Sentiment analysis over time using bigrams.

Figure 7: Sentiment analysis over time using bigrams.

The data in Figure 7 demonstrate that after controlling for negation words, the articles still maintained a split across both positive and negative sentiments. However, in this instance, there was a heavier weighting towards a negative sentiment (as would be expected after controlling for negation words).

Bigram frequency: having tokenised the data set by bigrams, a frequency count of the most common bigrams was computed to further explore the underlying structure of the data, as can be seen in the table below:

word1 word2 n
college students 45
mental health 32
social media 29
socially prescribed 20
psychological bulletin 18
thomas curran 17
depressive symptoms 14
john university 14
st john 14
york st 14

The data in the above table demonstrates that the public reporting of perfectionism is inline with the latest research in the academic domain. For instance, the most frequent bigram terms closely match those of a recent publication by Curran and Hill (2017) published in Psychological Bulletin in which ‘social media’, ‘socially prescribed (perfectionism)’ and ‘mental health’ are discussed in the context of ‘college students’.

Bigram igraph: to visualise the relationship among bigrams simultaneously, the data was arranged into a network of connected nodes, as can be seen in Figure 8:

The most frequent bigrams visualised as a network of connected nodes.

Figure 8: The most frequent bigrams visualised as a network of connected nodes.

4.5 Pairwise correlations

Pairwise correlations: to determine which words various perfectionism terms (e.g., perfect, perfectionist, etc.) coalesce with, pairwise correlations were calculated using the phi coefficient. This identified how much more likely it was that a word would appear in conjunction with a perfectionism term, or that neither term appeared, than that one appeared without the other (Figure 9):

Pairwise correlations with various perfectionism terms.

Figure 9: Pairwise correlations with various perfectionism terms.

The words correlated with perfectionism terms mirrors those found in the frequency analysis conducted previously. These words include ‘mistakes’, ‘fear’, ‘failure’, and ‘expectations’, all words in the arsenal of the academic perfectionism lexicon. Moreover, the terms correlated with ‘perfect’ and ‘perfectionists’ are in keeping with this theme, albeit they are less frequently employed terms in the academic literature. For instance, the occurrence of the terms ‘pregnancy’, ‘person’, ‘kids’, and ‘week’ are not published about frequently in the academic domain, but indeed, a small handful of articles do exists examining these topics (e.g., Macedo et al., 2009).

4.6 Topic models

Model fit: one question of interest is what topics perfectionism coalesces with in the collected news articles. In order to answer such a question the data must be divided into a set of natural groups. One such method for achieving this is topic modelling, a method of unsupervised classification of documents. In topic modelling, each documented is treated as a mixture of topics, and each topic as a mixture of words. This prevents documents being categorised into discrete groups, and allows for overlap in terms of content in a way that represents the structure of natural language. Consequently, topic models were fitted to the data. To achieve this, first, a document-term matrix was calculated. In a document-term matrix, each row represents one document (in this case an article), each column represents one term (in this case a word), and each value contains the number of appearances of that term in the document. As most words often don’t appear in a given text, document-term matrices are usually implemented as sparse matrices as can be seen below:

## <<DocumentTermMatrix (documents: 74, terms: 5226)>>
## Non-/sparse entries: 13002/373722
## Sparsity           : 97%
## Maximal term length: 26
## Weighting          : term frequency (tf)

After the document-term matrix was calculated, the ‘ldatuning’ package was used to determine the optimum number of topics (k). The results from this optimisation process can be observed in Figure 10:

Topic model fit according to the number of topics.

Figure 10: Topic model fit according to the number of topics.

The results in Figure 10 indicate the optimum number of topics (k) occurs in the range of 6-15, as indicated by the variance in extremum. Subsequently, topic models were fitted incrementally in this range and qualitatively evaluated at each stage. In this instance, the optimum number of topics occurred at k = 9.

Word-topic probabilities: having established the optimum number of topics (k = 9), topic models were then fitted using latent Dirichlet allocation (a generative statistical model). For each term in each document the model computes, among other metrics, the probability of a term being generated from a given topic (known as beta probabilities). Therefore, using these probabilities, the most common words within topics were extracted and plotted, as demonstrated in Figure 11:

Word-topic probabilities.

Figure 11: Word-topic probabilities.

As can be seen in Figure 11, some topics overlap. For example, topics three, five, and seven contain at least one of the terms ‘life’ and ‘people’ in the top two terms, whereas topics four, seven, and nine contain the terms ‘kids’ or ‘parents’. It can be observed that there are meaningful differences between the nine topics, with some of these topics in keeping with the perfectionism themes highlighted so far. For example topic two is unique in its use of terms such as ‘college’ and ‘students’ and reflects the academic research of Curran and Hill (2017) highlighted previously. Other topics are also unique, such as topics four and seven, but their relevance to perfectionism is not as clear. For example, topic one contains key terms such as ‘code’, ‘music’, and ‘time’, topics not frequently employed in the academic literature. That is not to say that all discussions of perfectionism in the public domain should exactly mirror those in the academic domain, rather, it is interesting to note the diversity of topics discussed outside of the academic setting, as exemplified in this topic model.

5. SUMMARY


Three key themes emerged in the data:

  1. The most frequently used words across all articles closely mirror those used in the academic literature. This is true across the unigram and bigram term frequencies. There is a general discussion around the themes of ‘pressure’ and ‘standards’ in ‘universities’ and ‘college’, all themes researched in great depth in the academic setting. This preliminary indicates that perfectionism is, at least in part, being discussed in the public domain in a manner that reflects the broader body of academic research.

  2. The sentiment analyses revealed two important artifacts of the data. Firstly, perfectionism is discussed in both a positive and negative fashion. The dichotomy of positive and negative aspects of perfectionism closely mirrors the large body of research in the academic domain investigating perfectionistic strivings and perfectionistic concerns respectively. Secondly, in general, there is a heavier weighting towards the more negative aspects of perfectionism. Interestingly, this pattern of findings also mirrors that in the academic domain. While there is evidence to suggest that perfectionism is associated with some positive outcomes, processes and characteristics, there is a larger body of evidence indicating that perfectionism has a greater association with negative outcomes, processes, and characteristics (Hill, 2014). This effect held true even after controlling for negation words (e.g., not, never, etc.).

  3. The topic model analysis revealed a large diversity in topics across the articles. This implies that in some instances, perfectionism is being talked about in settings that typically fall outside of academic research. This may suggest that in some instances, perfectionism is being reported in the public domain in a manner not in keeping with the broader scientific evidence. However, this assertion requires careful scrutiny, as the topics require contextualising within each article to understand their significance, and the impact of any misreporting would require further analysis.

6. RECOMMENDATIONS


  1. The NewsRiver API imposed limits on article retrieval dates so future research should endeavor to investigate perfectionism over a larger date range. One method to achieve this would be to build a web crawler in R that scrapes new texts from a list of provided URL’s.

  2. In addition to negation words, natural language often contains valence shifters (i.e., amplifiers (intensifiers), adversative conjunctions, etc.) which affect text polarity sentiment. Future research may use a package such as ‘sentimentr’ to overcome these additional valence shifters and increase the validity of the text sentiment.

  3. The data in this study were limited to articles published in English. While the vast majority of academic literature is published in English, the translations of the findings of this study cannot be extended outside of English speaking, Western populations. This is because societal and cultural factors affect the implicit meaning behind psychological constructs, such as perfectionism (Hui & Triandis, 1989). Therefore, it is recommended that future research investigates the sentiment and topics of perfectionism in other languages.

7. REFERENCES


Curran, T., & Hill, A. P. (2017, December 28). Perfectionism Is Increasing Over Time: A MetaAnalysis of Birth Cohort Differences From 1989 to 2016. Psychological Bulletin. Advance online publication. http://dx.doi.org/10.1037/bul0000138.

Dunkley, D. M., Zuroff, D. C., & Blankstein, K. R. (2003). Self-critical perfectionism and daily affect: dispositional and situational influences on stress and coping. Journal of Personality and Social Psychology*, 84, 234-252.

Gotwals, J. K., Stoeber, J., Dunn, J. G., & Stoll, O. (2012). Are perfectionistic strivings in sport adaptive? A systematic review of confirmatory, contradictory, and mixed evidence. Canadian Psychology, 53, 263-279.

Hewitt, P. L., & Flett, G. L. (2002). Perfectionism and stress processes in psychopathology. In G.L. Flett & P.L. Hewitt (Eds.), Perfectionism: Theory, Research, and Treatment (pp. 255-284). Washington, DC: American Psychological Association.

Hill, A. P. (2014). Perfectionistic strivings and the perils of partialling. International Journal of Sport & Exercise Psychology, 12, 302-315.

Hill, A.P. (2016). Conceptualizing perfectionism: An overview and unresolved issues. In A.P. Hill (Ed.), The Psychology of Perfectionism in Sport, Dance and Exercise (pp. 3-30). London: Routledge.

Hui, C. & Triandis, H.C. (1989). Effects of culture and response format on extreme response style. Journal of Cross Cultural Psychology, 20, 296-309.

Macedo, A., Bos, S.C., Marques, M. et al. (2009). Perfectionism dimensions in pregnancy — a study in Portuguese women. Arch Womens Ment Health, 12: 43. https://doi.org/10.1007/s00737-008-0042-5

Silge, J, & Robinson, D. (2017). Text Mining with R: a Tidy Approach. CA: O’Reilly Media.

Stoeber, J. (2011). The dual nature of perfectionism in sports: Relationships with emotion, motivation, and performance. International Review of Sport and Exercise Psychology, 4(2), 128-145.

Stoeber, J., & Otto, K. (2006). Positive conceptions of perfectionism: Approaches, evidence, challenges. Personality and Social Psychology Review, 10, 295- 319.

Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59.

8. APPENDIX


Documented code:

# --------------------
# LIBRARIES
# --------------------

# Load libraries

library(tidyverse)
library(httr)
library(jsonlite)
library(xml2)
library(urltools)
library(lubridate)
library(magrittr)
library(tidytext)
library(tidyr)
library(wordcloud)
library(reshape2)
library(igraph)
library(ggraph)
library(widyr)
library(topicmodels)
library(ldatuning)

# --------------------
# DATA WRANGLING
# --------------------

# Create Newsriver API

# Create function to retrieve data based on query paramters. See 'API REFERENCE' on newsriver.io for a list of query parameters. The function returns a tibble containing the title, text, date, and website relating to the query parameters.

newsriver_api <- function(query) {
  
  # Set rate limit and show progress. The API rate limit is 25 calls per window per API token. The rate limiting window is 15 minutes long.
  
  p$tick()$print()
  Sys.sleep(37)
  
  # Create URL encoded base to be passed custom query parameters
  
  url_base <- "https://api.newsriver.io/v2/search" %>% 
    param_set("query", url_encode(query)) %>% 
    param_set("sortBy", "_score") %>% 
    param_set("sortOrder", "DESC") %>% 
    param_set("limit", "100")
  
  # Make GET request
  
  resp <- GET(url_base, ua, add_headers(Authorization = api_token))
  
  # Return error and stop execution if a json is not returned
  
  if (http_type(resp) != "application/json") {
    stop("API did not return json", call. = FALSE)
  }
  
  # Return error if there is a http error, else parse the content from the json file and store the title, text, date, and website in a tibble
  
  if (http_error(resp) == TRUE) {
    warning("The request failed")
  } else {
    news_tbl <- fromJSON(content(resp, as = "text", encoding = "UTF-8"), flatten = TRUE)  %>%
      as_tibble()
    
    if (nrow(news_tbl) != 0) {
      news_tbl <- news_tbl %>% mutate(date = as.Date(discoverDate), website = website.domainName) %>% select(title, text, date, website)
    }
    
    news_tbl
  }
}

# Set user agent and api token

ua <- user_agent("inser user agent here")
api_token <-  # "insert token here"
  
# Set date range for api search
  
search_dates <- seq(as.Date("2017-07-01"), as.Date("2018-07-01"), by = "months")

# Set query parameters to be called

query <- sprintf('title:("perfectionism" OR "perfect") AND text:"perfectionism" AND language:en AND discoverDate:[%s TO %s]', search_dates, search_dates %m+% months(1))

# Initialise progress bar

p <- progress_estimated(length(search_dates))

# Call newsriver_api function over the vector of query parameters and return a tibble

perf_news <- map_dfr(query, newsriver_api)

# Remove unicode characters from title and text

perf_news %<>% mutate(text = stringi::stri_trans_general(perf_news$text, "latin-ascii"), title = stringi::stri_trans_general(perf_news$title, "latin-ascii"))

# Convert titles to lower case to enable duplicate detection

perf_news %<>% mutate(title = str_to_lower(title))

# Remove duplicates

perf_news %<>% distinct(title, .keep_all = TRUE)

# Inspect each article and remove any errors or unrelated texts

perf_news <- perf_news %>%
  filter(!(perf_news$title == perf_news$title[52])) %>% 
  filter(!(perf_news$title == perf_news$title[3]))

# Unnest perf_news text column

tidy_news <- perf_news %>% unnest_tokens(word, text)

# Remove stop words

tidy_news <- tidy_news %>% anti_join(stop_words)

# --------------------
# DATA ANALYSIS
# --------------------

# Examine the distribution of words in the data set. First, calculate the term frequency of each term in each article, then divide by the total number of terms in that article. Finally, plot the results.

words_news <- perf_news %>% 
  unnest_tokens(word, text) %>% 
  count(title, word, sort = TRUE) %>% 
  ungroup()

words_total <- words_news %>% 
  group_by(title) %>% 
  summarise(total = sum(n))

words_news <- left_join(words_news, words_total)

ggplot(words_news, aes(n/total, fill = title)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.025) +
  theme(strip.text = element_text(size = 7))

# To further examine the structure of the data, fit and plot a linear model to freq_by_rank to demonstrate Zipf's law has been maintained.

freq_by_rank <- words_news %>% 
  group_by(title) %>% 
  mutate(rank = row_number(), `term frequency` = n/total)

lm(log10(`term frequency`) ~ log10(rank), data = freq_by_rank)

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = title)) +
  geom_abline(intercept = -1.1029,slope = -0.7564, color = "black", linetype = 2) +
  geom_line(size = 0.5, alpha = 0.5, show.legend = FALSE) +
  scale_x_log10() +
  scale_y_log10()

# Create custom stop words

custom_stop_words <-  bind_rows(tibble(word = c("perfect", 
                                                "perfection", 
                                                "perfectionism", 
                                                "perfectly", 
                                                "perfectionist", 
                                                "perfectionists", 
                                                "curran", 
                                                "thomas", 
                                                "andy", 
                                                "hill"), 
                                       lexicon = c("custom")), stop_words)

# Find and plot the most common words in tidy_news after applying custom stop words

# As a bar chart:

tidy_news %>%
  anti_join(custom_stop_words) %>% 
  count(word, sort = TRUE) %>% 
  filter(n > 50) %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  labs(y = "Word Frequency (n)")

# As a word cloud:

tidy_news %>%
  anti_join(custom_stop_words) %>% 
  count(word) %>% 
  with(wordcloud(word, n, max.words = 100, colors= c("steelblue1","steelblue2","steelblue3","steelblue")))

# Create a new data set called tidier_news which is tokenized by word, but keeps track of sentence number

tidier_news <- perf_news %>% 
  unnest_tokens(sentence, text, token = "sentences") %>% 
  group_by(title) %>% 
  mutate(sentence_number = row_number()) %>% 
  ungroup() %>%
  unnest_tokens(word, sentence) %>% 
  anti_join(custom_stop_words)

# Perform sentiment analyses using three different sentiment lexicons (AFINN, Bing, and NRC). Compute sentiment in sentence units by summing individual word sentiment scores across each sentence.

# AFINN

afinn <- tidier_news %>% 
  inner_join(get_sentiments("afinn")) %>%
  group_by(title, sentence_number) %>% 
  mutate(sentiment = sum(score)) %>%
  select(date, title, sentence_number, sentiment) %>% 
  distinct() %>% 
  group_by(title) %>% 
  mutate(sent_sum = sum(sentiment)) %>% 
  ungroup() %>% 
  select(date, title, sent_sum) %>% 
  distinct()

# Bing

bing <- tidier_news %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(date, title, sentence_number, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  group_by(title) %>% 
  mutate(sent_sum = sum(sentiment)) %>% 
  ungroup() %>% 
  select(date, title, sent_sum) %>% 
  distinct()

# NRC

nrc <- tidier_news %>% 
  inner_join(get_sentiments("nrc")) %>% 
  filter(sentiment %in% c("positive", "negative")) %>%
  count(date, title, sentence_number, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>% 
  group_by(title) %>% 
  mutate(sent_sum = sum(sentiment)) %>% 
  ungroup() %>% 
  select(date, title, sent_sum) %>% 
  distinct()

# Plot all three sentiment analyses on one graph

bind_rows(afinn %>% mutate(method = "AFINN"), bing %>% mutate(method = "Bing et al."), nrc %>% mutate(method = "NRC")) %>% 
  ggplot(aes(date, sent_sum, fill = method)) +
  geom_col(position = position_dodge(0.5), show.legend = FALSE) +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  labs(x = "Date", y = "Sentiment Score")

# Find the most common positive and negative words

bing_word_counts <- tidier_news %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()

# Plot the most common positive and negative words

# As a bar chart:

bing_word_counts %>% 
  group_by(sentiment) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(word = reorder(word, n)) %>% 
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "contribution to sentiment", x = NULL) +
  coord_flip()

# As a word cloud:

tidier_news %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  acast(word ~ sentiment, value.var = "n", fill = 0) %>% 
  comparison.cloud(colors = c("gray8", "darkorange"), max.words = 100)

# Unnest tokens by bigrams keeping track of sentence number

bigram_news <- perf_news %>%
  unnest_tokens(sentence, text, token = "sentences") %>% 
  group_by(title) %>% 
  mutate(sentence_number = row_number()) %>% 
  ungroup() %>% 
  unnest_tokens(bigram, sentence, token = "ngrams", n = 2)

# Separate bigrams into two columns, "word1", and "word2".

bigrams_separated <- bigram_news %>% 
  separate(bigram, c("word1", "word2"), sep = " ")

# Use bigrams to perform sentiment analyses by reversing the sentiment score of negated words

AFINN <- get_sentiments("afinn")

negation_words <- c("not", "never", "no", "without")

bigrams_afinn <- bigrams_separated %>% 
  filter(!word1 %in% custom_stop_words$word) %>% 
  filter(!word2 %in% custom_stop_words$word) %>% 
  inner_join(AFINN, by = c(word2 = "word")) %>%
  mutate(score = ifelse(word1 %in% negation_words, -score, score))

bigrams_afinn_sentiment <- bigrams_afinn %>%
  group_by(title, sentence_number) %>% 
  mutate(sentiment = sum(score)) %>%
  select(date, title, sentence_number, sentiment) %>% 
  distinct() %>% 
  group_by(title) %>% 
  mutate(sent_sum = sum(sentiment)) %>% 
  ungroup() %>% 
  select(date, title, sent_sum) %>% 
  distinct()

ggplot(bigrams_afinn_sentiment, aes(date, sent_sum, fill = date)) +
  geom_col(position = position_dodge(0.7), width = 2, show.legend = FALSE)

# Count most frequent bigrams not keeping track of sentence

bigram_counts <- perf_news %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word) %>% 
  count(word1, word2, sort = TRUE)

# Count most frequent trigrams not keeping track of sentence

perf_news %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 3) %>% 
  separate(bigram, c("word1", "word2", "word3"), sep = " ") %>% 
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word,
         !word3 %in% stop_words$word) %>% 
  count(word1, word2, word3, sort = TRUE)


# Create igraph object for the most frequent bigrams. The transparency of links denotes how common or rare the bigram is

bigram_graph <- bigram_counts %>% 
  filter(n > 5) %>% 
  graph_from_data_frame()

set.seed(1234)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE, arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()
 
# Calculate pairwise correlations of words using the phi coefficient. This identifies how much more likely it is that either both word X and Y appear, or neither do, than that one appears without the other.

word_cors <- perf_news %>% 
  unnest_tokens(sentence, text, token = "sentences") %>% 
  group_by(title) %>% 
  mutate(sentence_number = row_number()) %>% 
  ungroup() %>%
  unnest_tokens(word, sentence) %>% 
  filter(!word %in% stop_words$word) %>% 
  group_by(word) %>% 
  filter(n() >= 20) %>% 
  pairwise_cor(word, title, sort = TRUE)

# Plot words most correlated with different perfectionism terms

word_cors %>%
  filter(item1 %in% c("perfect", "perfection", "perfectionism", "perfectly", "perfectionist", "perfectionists")) %>% 
  group_by(item1) %>%
  top_n(6) %>%
  ungroup() %>% 
  group_by(item1, item2) %>%                  
  arrange(desc(correlation)) %>%                
  ungroup() %>%
  mutate(item2 = factor(paste(item2, item1, sep = "__"), levels = rev(paste(item2, item1, sep = "__")))) %>%
  ggplot(aes(item2, correlation, fill = item1)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~ item1, scales = "free") +
  coord_flip() +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab(NULL)

# Create Document Term Matrix

news_dtm <- tidy_news %>% 
  anti_join(custom_stop_words) %>%
  count(title, word) %>% 
  cast_dtm(title, word, n)

# Select number of topics (k) for LDA model using the 'ldatuninig' package.

lda_fit <-FindTopicsNumber(news_dtm,
                           topics = seq(from = 2, to = 50, by = 1),
                           metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
                           method = "Gibbs", control = list(seed = 77), mc.cores = 2L, verbose = TRUE)

# find the extremum to determine optimal k

FindTopicsNumber_plot(lda_fit)

# Fit topic models using latent Dirichlet allocation

perf_lda <- LDA(news_dtm, k = 9, control = list(seed = 1234))

# Extract the per-topic-per-word probabilities (beta).

perf_topics <- tidy(perf_lda, matrix = "beta")

# Find most common terms for each topic.

perf_top_terms <-  perf_topics %>% 
  group_by(topic) %>% 
  top_n(5, beta) %>% 
  ungroup() %>% 
  arrange(topic, -beta)

perf_top_terms %>% 
  group_by(topic, term) %>%                  
  arrange(desc(beta)) %>%                
  ungroup() %>%
  mutate(term = factor(paste(term, topic, sep = "__"), levels = rev(paste(term, topic, sep = "__")))) %>%
  mutate(term = reorder(term, beta)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_discrete(labels = function(x) gsub("__.+$", "", x)) +
  xlab(NULL)

# Extract the per-document-per-topic probabilities (gamma).

perf_documents <- tidy(perf_lda, matrix = "gamma")


# Find which words in each document were assigned to which topic.

assignments <- augment(perf_lda, data = news_dtm)