
If you are Beginner or Professional doesn't matter practice make you perfect so we are back with top 10 dataset for Natural Language Processing for Beginner and Professional.
If you are Beginner or Professional doesn't matter practice make you perfect so we are back with top 10 dataset for Natural Language Processing for Beginner and Professional.
IMDB dataset having 50K movie reviews for natural language processing or Text analytics. In this dataset they provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.
No. of Rows:- 50K
Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision.
No. of Rows:- 74K

In this dataset, symbols used in both English and Kannada are available. This gives a total of over 74K images (which explains the name of the dataset).
It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).
No. of Rows:- .05M
The Yelp dataset is available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps. Each file is composed of a single object type, one JSON-object per-line.
This Dataset Have 6 files
For More Information About this Dataset Click Here
This is simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends. FSDD is open Dataset for Everyone. In this Dataset 4 Speakers with 2000 Recordings in English pronunciations.
No. of Rows:- 2000
This dataset consist reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review. This dataset in Zip Format
No. of Reviews:- 34,686,770
No. of Users:- 6,643,669
Time Period:- Jun 1995 - Mar 2013
The Wikipedia links (WikiLinks) data consists of web pages that contain at least one hyperlink that points to English Wikipedia. The data set was obtained by iterating over Google's web index. We treat each page on Wikipedia as representing an entity (or concept or idea), and the anchor text as a mention of that entity. We have done some filtering to ensure that the anchor text can be a mention of the entity that it links to (e.g., we remove anchors such as "click here").
Number of Documents: 13 million
Number of mentions: 59 million
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
No. of Rows:- 681,288
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.
No. of Rows:- 5,574
This contains data of news headlines published over a period of seventeen years. Sourced from the reputable Australian news source ABC (Australian Broadcasting Corporation) Agency Site: (http://www.abc.net.au)
Thanks for Reading Share this Post
Sign in to join the discussion and post comments.
Sign in



