Datasets for Natural Language Processing

1736595182.png

Written by Aayush Saini · 4 minute read · Jun 30, 2020 . Datasets, 64

If you are Beginner or Professional doesn't matter practice make you perfect so we are back with top 10 dataset for Natural Language Processing for Beginner and Professional.

  1. IMDB Dataset of 50K Movie Reviews
  2. The Chars74K dataset
  3. Enron Email Dataset 0.5M messages
  4. Yelp Open Dataset An all-purpose dataset for learning
  5. Free Spoken Digit Dataset (FSDD) 2,000 recordings
  6. Web data: Amazon reviews (34,686,770 Reviews)
  7. Wikipedia Links Dataset (13 million Documents)
  8. The Blog Authorship Corpus (681,288 Posts)
  9. SMS Spam Collection v. 1 (5,574 SMS)
  10. A Million News Headlines

1. IMDB Dataset of 50K Movie Reviews

IMDB dataset having 50K movie reviews for natural language processing or Text analytics. In this dataset they provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.

No. of Rows:- 50K

Get Dataset


2. The Chars 74K dataset

Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision.

No. of Rows:- 74K

character recoginition

In this dataset, symbols used in both English and Kannada are available. This gives a total of over 74K images (which explains the name of the dataset).

Get Dataset


3. Enron Email Dataset 0.5M messages

It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).

No. of Rows:- .05M

Get Dataset


4. Yelp Open Dataset An all-purpose dataset for learning

The Yelp dataset is  available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps. Each file is composed of a single object type, one JSON-object per-line.

This Dataset Have 6 files

  1. business.json
  2. review.json
  3. user.json
  4. checkin.json
  5. tip.json
  6. photo.json

For More Information About this Dataset Click Here

Get Dataset


5. Free Spoken Digit Dataset (FSDD) 2,000 recordings

This is simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends. FSDD is open Dataset for Everyone. In this Dataset 4 Speakers with 2000 Recordings in English pronunciations.

No. of Rows:- 2000

Get Dataset


6. Web data: Amazon Reviews

This dataset consist reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. This dataset in Zip Format

No. of Reviews:- 34,686,770

No. of Users:- 6,643,669

Time Period:- Jun 1995 - Mar 2013

Get Dataset


7. Wikipedia Links Dataset (13 million Documents)

The Wikipedia links (WikiLinks) data consists of web pages that contain at least one hyperlink that points to English Wikipedia. The data set was obtained by iterating over Google's web index. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of that entity. We have done some filtering to ensure that the anchor text can be a mention of the entity that it links to (e.g., we remove anchors such as "click here").

Number of Documents: 13 million

Number of mentions: 59 million

Get Dataset


8. The Blog Authorship Corpus (681,288 Posts)

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

No. of Rows:- 681,288

Get Dataset


9. SMS Spam Collection v. 1 

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

No. of Rows:- 5,574

Get Dataset


10. A Million News Headlines

This contains data of news headlines published over a period of seventeen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation) Agency Site: (http://www.abc.net.au) 

Get Dataset


Thanks for Reading Share this Post

Share   Share