Datasets for Natural Language Processing

If you are Beginner or Professional doesn't matter practice make you perfect so we are back with top 10 dataset for Natural Language Processing for Beginner and Professional.
- IMDB Dataset of 50K Movie Reviews
- The Chars74K dataset
- Enron Email Dataset 0.5M messages
- Yelp Open Dataset An all-purpose dataset for learning
- Free Spoken Digit Dataset (FSDD) 2,000 recordings
- Web data: Amazon reviews (34,686,770 Reviews)
- Wikipedia Links Dataset (13 million Documents)
- The Blog Authorship Corpus (681,288 Posts)
- SMS Spam Collection v. 1 (5,574 SMS)
- A Million News Headlines
1. IMDB Dataset of 50K Movie Reviews
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. In this dataset they provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.
No. of Rows:- 50K
2. The Chars 74K dataset
Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision.
No. of Rows:- 74K
In this dataset, symbols used in both English and Kannada are available. This gives a total of over 74K images (which explains the name of the dataset).
3. Enron Email Dataset 0.5M messages
It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).
No. of Rows:- .05M
4. Yelp Open Dataset An all-purpose dataset for learning
The Yelp dataset is available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps. Each file is composed of a single object type, one JSON-object per-line.
This Dataset Have 6 files
- business.json
- review.json
- user.json
- checkin.json
- tip.json
- photo.json
For More Information About this Dataset Click Here
5. Free Spoken Digit Dataset (FSDD) 2,000 recordings
This is simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends. FSDD is open Dataset for Everyone. In this Dataset 4 Speakers with 2000 Recordings in English pronunciations.
No. of Rows:- 2000
6. Web data: Amazon Reviews
This dataset consist reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. This dataset in Zip Format
No. of Reviews:- 34,686,770
No. of Users:- 6,643,669
Time Period:- Jun 1995 - Mar 2013
7. Wikipedia Links Dataset (13 million Documents)
The Wikipedia links (WikiLinks) data consists of web pages that contain at least one hyperlink that points to English Wikipedia. The data set was obtained by iterating over Google's web index. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of that entity. We have done some filtering to ensure that the anchor text can be a mention of the entity that it links to (e.g., we remove anchors such as "click here").
Number of Documents: 13 million
Number of mentions: 59 million
8. The Blog Authorship Corpus (681,288 Posts)
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
No. of Rows:- 681,288
9. SMS Spam Collection v. 1
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.
No. of Rows:- 5,574
10. A Million News Headlines
This contains data of news headlines published over a period of seventeen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation) Agency Site: (http://www.abc.net.au)
Thanks for Reading Share this Post
Random Blogs
- String Operations in Python
- Where to Find Free Datasets for Your Next Machine Learning & Data Science Project
- Datasets for Natural Language Processing
- Create Virtual Host for Nginx on Ubuntu (For Yii2 Basic & Advanced Templates)
- Grow your business with Facebook Marketing
- Downlaod Youtube Video in Any Format Using Python Pytube Library
- What Is SEO and Why Is It Important?
- Types of Numbers in Python
- Datasets for analyze in Tableau
- Python Challenging Programming Exercises Part 1
Prepare for Interview
Datasets for Machine Learning
- Ozone Level Detection Dataset
- Bank Transaction Fraud Detection
- YouTube Trending Video Dataset (updated daily)
- Covid-19 Case Surveillance Public Use Dataset
- US Election 2020
- Forest Fires Dataset
- Mobile Robots Dataset
- Safety Helmet Detection
- All Space Missions from 1957
- OSIC Pulmonary Fibrosis Progression Dataset
- Wine Quality Dataset
- Google Audio Dataset
- Iris flower dataset
- Artificial Characters Dataset
- Bitcoin Heist Ransomware Address Dataset