Help for package textdata

Title:

Download and Load Various Text Datasets

Version:

0.4.5

Description:

Provides a framework to download, parse, and store text datasets on the disk and load them when needed. Includes various sentiment lexicons and labeled text data sets for classification and analysis.

License:

MIT + file LICENSE

URL:

https://emilhvitfeldt.github.io/textdata/, https://github.com/EmilHvitfeldt/textdata

BugReports:

https://github.com/EmilHvitfeldt/textdata/issues

Imports:

fs, rappdirs, readr, tibble

Suggests:

covr, knitr, rmarkdown, testthat (≥ 2.1.0)

VignetteBuilder:

knitr

Encoding:

UTF-8

RoxygenNote:

7.3.1.9000

Collate:

'cache_info.R' 'dataset_ag_news.R' 'dataset_dbpedia.R' 'dataset_imdb.R' 'dataset_sentence_polarity.R' 'dataset_trec.R' 'embedding_glove.R' 'lexicon_nrc_vad.R' 'lexicon_nrc_eil.R' 'lexicon_nrc.R' 'lexicon_bing.R' 'lexicon_loughran.R' 'lexicon_afinn.R' 'download_functions.R' 'info.R' 'load_dataset.R' 'printer.R' 'process_functions.R' 'textdata-package.R'

NeedsCompilation:

Packaged:

2024-05-28 20:14:47 UTC; emilhvitfeldt

Author:

Emil Hvitfeldt

[aut, cre], Julia Silge

[ctb]

Maintainer:

Emil Hvitfeldt <emilhhvitfeldt@gmail.com>

Repository:

CRAN

Date/Publication:

2024-05-28 20:40:03 UTC

textdata: Download and Load Various Text Datasets

Description

Provides a framework to download, parse, and store text datasets on the disk and load them when needed. Includes various sentiment lexicons and labeled text data sets for classification and analysis.

Author(s)

Maintainer: Emil Hvitfeldt emilhhvitfeldt@gmail.com (ORCID)

Other contributors:

Julia Silge julia.silge@gmail.com (ORCID) [contributor]

List folders and their sizes in cache

Description

This function will return a tibble with the name and sizes of all folder in specified directory. Will default to textdata's default cache.

Usage

cache_info(dir = NULL)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

Value

A tibble with 2 variables:

name: Name of the folder
size: Size of the folder

Examples

## Not run: 
cache_info()

## End(Not run)

Catalogue of all available data sources

Description

Catalogue of all available data sources

Usage

catalogue

Format

An object of class data.frame with 15 rows and 8 columns.

AG's News Topic Classification Dataset

Description

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600. Version 3, Updated 09/09/2015

Usage

dataset_ag_news(
  dir = NULL,
  split = c("train", "test"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

split

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

The classes in this dataset are

World
Sports
Business
Sci/Tech

Value

A tibble with 120,000 or 30,000 rows for "train" and "test" respectively and 3 variables:

class: Character, denoting new class
title: Character, title of article
description: Character, description of article

Source

http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html

https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz

Examples

## Not run: 
dataset_ag_news()

# Custom directory
dataset_ag_news(dir = "data/")

# Deleting dataset
dataset_ag_news(delete = TRUE)

# Returning filepath of data
dataset_ag_news(return_path = TRUE)

# Access both training and testing dataset
train <- dataset_ag_news(split = "train")
test <- dataset_ag_news(split = "test")

## End(Not run)

DBpedia Ontology Dataset

Description

DBpedia ontology dataset classification dataset. It contains 560,000 training samples and 70,000 testing samples for each of 14 nonoverlapping classes from DBpedia.

Usage

dataset_dbpedia(
  dir = NULL,
  split = c("train", "test"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

split

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

The classes are

Company
EducationalInstitution
Artist
Athlete
OfficeHolder
MeanOfTransportation
Building
NaturalPlace
Village
Animal
Plant
Album
Film
WrittenWork

Value

A tibble with 560,000 or 70,000 rows for "train" and "test" respectively and 3 variables:

class: Character, denoting the class class
title: Character, title of article
description: Character, description of article

Source

https://papers.nips.cc/paper/5782-character-level-convolutional-networks-for-text-classification.pdf

https://www.dbpedia.org/

https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz

Examples

## Not run: 
dataset_dbpedia()

# Custom directory
dataset_dbpedia(dir = "data/")

# Deleting dataset
dataset_dbpedia(delete = TRUE)

# Returning filepath of data
dataset_dbpedia(return_path = TRUE)

# Access both training and testing dataset
train <- dataset_dbpedia(split = "train")
test <- dataset_dbpedia(split = "test")

## End(Not run)

IMDB Large Movie Review Dataset

Description

The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg).

Usage

dataset_imdb(
  dir = NULL,
  split = c("train", "test"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

split

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews > 5 and <= 5.

When using this dataset, please cite the ACL 2011 paper

InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142–150},
url = {http://www.aclweb.org/anthology/P11-1015} }

Value

A tibble with 25,000 rows and 2 variables:

Sentiment: Character, denoting the sentiment
text: Character, text of the review

Source

http://ai.stanford.edu/~amaas/data/sentiment/

Examples

## Not run: 
dataset_imdb()

# Custom directory
dataset_imdb(dir = "data/")

# Deleting dataset
dataset_imdb(delete = TRUE)

# Returning filepath of data
dataset_imdb(return_path = TRUE)

# Access both training and testing dataset
train <- dataset_imdb(split = "train")
test <- dataset_imdb(split = "test")

## End(Not run)

v1.0 sentence polarity dataset

Description

5331 positive and 5331 negative processed sentences / snippets. Introduced in Pang/Lee ACL 2005. Released July 2005.

Usage

dataset_sentence_polarity(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

Citation info:

This data was first used in Bo Pang and Lillian Lee, “Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.”, Proceedings of the ACL, 2005.

InProceedings{pang05,
author = {Bo Pang and Lillian Lee},
title = {Seeing stars: Exploiting class relationships for sentiment
categorization with respect to rating scales},
booktitle = {Proceedings of the ACL},
year = 2005
}

Value

A tibble with 10,662 rows and 2 variables:

text: Sentences or snippets
sentiment: Indicator for sentiment, "neg" for negative and "pos" for positive

Source

https://www.cs.cornell.edu/people/pabo/movie-review-data/

Examples

## Not run: 
dataset_sentence_polarity()

# Custom directory
dataset_sentence_polarity(dir = "data/")

# Deleting dataset
dataset_sentence_polarity(delete = TRUE)

# Returning filepath of data
dataset_sentence_polarity(return_path = TRUE)

## End(Not run)

TREC dataset

Description

The TREC dataset is dataset for question classification consisting of open-domain, fact-based questions divided into broad semantic categories. It has both a six-class (TREC-6) and a fifty-class (TREC-50) version. Both have 5,452 training examples and 500 test examples, but TREC-50 has finer-grained labels. Models are evaluated based on accuracy.

Usage

dataset_trec(
  dir = NULL,
  split = c("train", "test"),
  version = c("6", "50"),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

split

Character. Return training ("train") data or testing ("test") data. Defaults to "train".

version

Character. Version 6("6") or version 50("50"). Defaults to "6".

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

The classes in TREC-6 are

ABBR - Abbreviation
DESC - Description and abstract concepts
ENTY - Entities
HUM - Human beings
LOC - Locations
NYM - Numeric values

the classes in TREC-50 can be found here https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html.

Value

A tibble with 5,452 or 500 rows for "train" and "test" respectively and 2 variables:

class: Character, denoting the class
text: Character, question text

Source

https://cogcomp.seas.upenn.edu/Data/QA/QC/

https://trec.nist.gov/data/qa.html

Examples

## Not run: 
dataset_trec()

# Custom directory
dataset_trec(dir = "data/")

# Deleting dataset
dataset_trec(delete = TRUE)

# Returning filepath of data
dataset_trec(return_path = TRUE)

# Access both training and testing dataset
train_6 <- dataset_trec(split = "train")
test_6 <- dataset_trec(split = "test")

train_50 <- dataset_trec(split = "train", version = "50")
test_50 <- dataset_trec(split = "test", version = "50")

## End(Not run)

Global Vectors for Word Representation

Description

The GloVe pre-trained word vectors provide word embeddings created using varying numbers of tokens.

Usage

embedding_glove6b(
  dir = NULL,
  dimensions = c(50, 100, 200, 300),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

embedding_glove27b(
  dir = NULL,
  dimensions = c(25, 50, 100, 200),
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

embedding_glove42b(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

embedding_glove840b(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

dimensions

A number indicating the number of vectors to include. One of 50, 100, 200, or 300 for glove6b, or one of 25, 50, 100, or 200 for glove27b.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

Citation info:

InProceedings{pennington2014glove,
author = {Jeffrey Pennington and Richard Socher and Christopher D.
Manning},
title = {GloVe: Global Vectors for Word Representation},
booktitle = {Empirical Methods in Natural Language Processing (EMNLP)},
year = 2014
pages = {1532-1543}
url = {http://www.aclweb.org/anthology/D14-1162}
}

Value

A tibble with 400k, 1.9m, 2.2m, or 1.2m rows (one row for each unique token in the vocabulary) and the following variables:

token: An individual token (usually a word)
d1, d2, etc: The embeddings for that token.

Source

https://nlp.stanford.edu/projects/glove/

References

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.

Examples

## Not run: 
embedding_glove6b(dimensions = 50)

# Custom directory
embedding_glove42b(dir = "data/")

# Deleting dataset
embedding_glove6b(delete = TRUE, dimensions = 300)

# Returning filepath of data
embedding_glove840b(return_path = TRUE)

## End(Not run)

AFINN-111 dataset

Description

AFINN is a lexicon of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn Årup Nielsen in 2009-2011.

Usage

lexicon_afinn(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

This dataset is the newest version with 2477 words and phrases.

Citation info:

This dataset was published in Finn Ärup Nielsen (2011), “A new Evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages (2011) 93-98.

article{nielsen11,
author = {Finn Äruprup Nielsen},
title = {A new Evaluation of a word list for sentiment analysis in microblogs},
journal = {CoRR},
volume = {abs/1103.2903},
year = {2011},
url = {http://arxiv.org/abs/1103.2903},
archivePrefix = {arXiv},
eprint = {1103.2903},
biburl = {https://dblp.org/rec/bib/journals/corr/abs-1103-2903},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

Value

A tibble with 2,477 rows and 2 variables:

word: An English word
score: Indicator for sentiment: integer between -5 and +5

Examples

## Not run: 
lexicon_afinn()

# Custom directory
lexicon_afinn(dir = "data/")

# Deleting dataset
lexicon_afinn(delete = TRUE)

# Returning filepath of data
lexicon_afinn(return_path = TRUE)

## End(Not run)

Bing sentiment lexicon

Description

General purpose English sentiment lexicon that categorizes words in a binary fashion, either positive or negative

Usage

lexicon_bing(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

Citation info:

This dataset was first published in Minqing Hu and Bing Liu, “Mining and summarizing customer reviews.”, Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), 2004.

inproceedings{Hu04,
author = {Hu, Minqing and Liu, Bing},
title = {Mining and Summarizing Customer Reviews},
booktitle = {Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
series = {KDD '04},
year = {2004},
isbn = {1-58113-888-1},
location = {Seattle, WA, USA},
pages = {168–177},
numpages = {10},
url = {http://doi.acm.org/10.1145/1014052.1014073},
doi = {10.1145/1014052.1014073},
acmid = {1014073},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {reviews, sentiment classification, summarization, text mining},
}

Value

A tibble with 6,787 rows and 2 variables:

word: An English word
sentiment: Indicator for sentiment: "negative" or "positive"

Source

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Examples

## Not run: 
lexicon_bing()

# Custom directory
lexicon_bing(dir = "data/")

# Deleting dataset
lexicon_bing(delete = TRUE)

# Returning filepath of data
lexicon_bing(return_path = TRUE)

## End(Not run)

Loughran-McDonald sentiment lexicon

Description

English sentiment lexicon created for use with financial documents. This lexicon labels words with six possible sentiments important in financial contexts: "negative", "positive", "litigious", "uncertainty", "constraining", or "superfluous".

Usage

lexicon_loughran(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

Citation info:

This dataset was published in Loughran, T. and McDonald, B. (2011), “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance, 66: 35-65.

article{loughran11,
author = {Loughran, Tim and McDonald, Bill},
title = {When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks},
journal = {The Journal of Finance},
volume = {66},
number = {1},
pages = {35-65},
doi = {10.1111/j.1540-6261.2010.01625.x},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.2010.01625.x},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1540-6261.2010.01625.x},
year = {2011}
}

Value

A tibble with 4,150 rows and 2 variables:

word: An English word
sentiment: Indicator for sentiment: "negative", "positive", "litigious", "uncertainty", "constraining", or "superfluous"

Source

https://sraf.nd.edu/loughranmcdonald-master-dictionary/

Examples

## Not run: 
lexicon_loughran()

# Custom directory
lexicon_loughran(dir = "data/")

# Deleting dataset
lexicon_loughran(delete = TRUE)

# Returning filepath of data
lexicon_loughran(return_path = TRUE)

## End(Not run)

NRC word-emotion association lexicon

Description

General purpose English sentiment/emotion lexicon. This lexicon labels words with six possible sentiments or emotions: "negative", "positive", "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or "trust". The annotations were manually done through Amazon's Mechanical Turk.

Usage

lexicon_nrc(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).

Citation info:

This dataset was published in Saif Mohammad and Peter Turney. (2013), “Crowdsourcing a Word-Emotion Association Lexicon.” Computational Intelligence, 29(3): 436-465.

article{mohammad13,
author = {Mohammad, Saif M. and Turney, Peter D.},
title = {CROWDSOURCING A WORD–EMOTION ASSOCIATION LEXICON},
journal = {Computational Intelligence},
volume = {29},
number = {3},
pages = {436-465},
doi = {10.1111/j.1467-8640.2012.00460.x},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-8640.2012.00460.x},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1467-8640.2012.00460.x},
year = {2013}
}

Value

A tibble with 13,901 rows and 2 variables:

word: An English word
sentiment: Indicator for sentiment or emotion: "negative", "positive", "anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or "trust"

Source

http://saifmohammad.com/WebPages/lexicons.html

Examples

## Not run: 
lexicon_nrc()

# Custom directory
lexicon_nrc(dir = "data/")

# Deleting dataset
lexicon_nrc(delete = TRUE)

# Returning filepath of data
lexicon_nrc(return_path = TRUE)

## End(Not run)

NRC Emotion Intensity Lexicon (aka Affect Intensity Lexicon) v0.5

Description

General purpose English sentiment/emotion lexicon. The NRC Affect Intensity Lexicon is a list of English words and their associations with four basic emotions (anger, fear, sadness, joy).

Usage

lexicon_nrc_eil(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

For a given word and emotion X, the scores range from 0 to 1. A score of 1 means that the word conveys the highest amount of emotion X. A score of 0 means that the word conveys the lowest amount of emotion X.

License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).

Citation info:

Details of the lexicon are in this paper. Word Affect Intensities. Saif M. Mohammad. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018), May 2018, Miyazaki, Japan.

inproceedings{LREC18-AIL,
author = {Mohammad, Saif M.},
title = {Word Affect Intensities},
booktitle = {Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC-2018)},
year = {2018},
address={Miyazaki, Japan}
}

Value

A tibble with 5.814 rows and 3 variables:

term: An English word
score: Value between 0 and 1
AffectDimension: Indicator for sentiment or emotion: ("anger", "fear", "sadness", "joy")

Source

https://saifmohammad.com/WebPages/AffectIntensity.htm

Examples

## Not run: 
lexicon_nrc_eil()

# Custom directory
lexicon_nrc_eil(dir = "data/")

# Deleting dataset
lexicon_nrc_eil(delete = TRUE)

# Returning filepath of data
lexicon_nrc_eil(return_path = TRUE)

## End(Not run)

The NRC Valence, Arousal, and Dominance Lexicon

Description

The NRC Valence, Arousal, and Dominance (VAD) Lexicon includes a list of more than 20,000 English words and their valence, arousal, and dominance scores. For a given word and a dimension (V/A/D), the scores range from 0 (lowest V/A/D) to 1 (highest V/A/D). The lexicon with its fine-grained real- valued scores was created by manual annotation using best–worst scaling. The lexicon is markedly larger than any of the existing VAD lexicons. We also show that the ratings obtained are substantially more reliable than those in existing lexicons.

Usage

lexicon_nrc_vad(
  dir = NULL,
  delete = FALSE,
  return_path = FALSE,
  clean = FALSE,
  manual_download = FALSE
)

Arguments

dir

Character, path to directory where data will be stored. If NULL, user_cache_dir will be used to determine path.

delete

Logical, set TRUE to delete dataset.

return_path

Logical, set TRUE to return the path of the dataset.

clean

Logical, set TRUE to remove intermediate files. This can greatly reduce the size. Defaults to FALSE.

manual_download

Logical, set TRUE if you have manually downloaded the file and placed it in the folder designated by running this function with return_path = TRUE.

Details

License required for commercial use. Please contact Saif M. Mohammad (saif.mohammad@nrc-cnrc.gc.ca).

Citation info:

Details of the NRC VAD Lexicon are available in this paper:

Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words. Saif M. Mohammad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, July 2018.

inproceedings{vad-acl2018,
title={Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance for 20,000 English Words},
author={Mohammad, Saif M.},
booktitle={Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL)},
year={2018},
address={Melbourne, Australia}
}

Value

A tibble with 20.007 rows and 4 variables:

word: An English word
Valence: valence score of the word
Arousal: arousal score of the word
Dominance: dominance score of the word

Source

https://saifmohammad.com/WebPages/nrc-vad.html

Examples

## Not run: 
lexicon_nrc_vad()

# Custom directory
lexicon_nrc_vad(dir = "data/")

# Deleting dataset
lexicon_nrc_vad(delete = TRUE)

# Returning filepath of data
lexicon_nrc_vad(return_path = TRUE)

## End(Not run)

Internal Functions

Description

These are not to be used directly by the users.

Usage

load_dataset(
  data_name,
  name,
  dir,
  delete,
  return_path,
  clean,
  clean_manual = NULL,
  manual_download
)

textdata: Download and Load Various Text Datasets

Description

Author(s)

See Also

List folders and their sizes in cache

Description

Usage

Arguments

Value

Examples

Catalogue of all available data sources

Description

Usage

Format

AG's News Topic Classification Dataset

Description

Usage

Arguments

Details

Value

Source

See Also

Examples

DBpedia Ontology Dataset

Description

Usage

Arguments

Details

Value

Source

See Also

Examples

IMDB Large Movie Review Dataset

Description

Usage

Arguments

Details

Value

Source

Examples

v1.0 sentence polarity dataset

Description

Usage

Arguments

Details

Value

Source

Examples

TREC dataset

Description

Usage

Arguments

Details

Value

Source

See Also

Examples

Global Vectors for Word Representation

Description

Usage

Arguments

Details

Value

Source

References

Examples

AFINN-111 dataset

Description

Usage

Arguments

Details

Value

See Also

Examples

Bing sentiment lexicon

Description

Usage

Arguments

Details

Value