Datasets for Machine Learning
Amazon Product Reviews Dataset
Amazon Review Data (2018) is a large-scale dataset containing over 233 million customer reviews from Amazon products between 1996 and 2018. It includes detailed review information such as ratings, review text, helpfulness votes, and timestamps, along with rich product metadata like brand, price, category, and images. This dataset supports various tasks in natural language processing, sentiment analysis, and recommendation systems.
- Type: Real
- Task: Classification,Regression
- Attributes: Multivariate, Sequential, Temporal aspect
- Year: 2018
Ozone Level Detection Dataset
Two ground ozone level data sets are included in this collection. One is the eight hour peak set (eighthr.data), the other is the one hour peak set (onehr.data). Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.
- Type: Real
- Task: Classification
- Attributes: Multivariate, Sequential, Time-Series
- Year: 2008
Bank Transaction Fraud Detection
At LOL Bank Pvt. Ltd., ensuring the safety and integrity of economic transactions is a top priority. With increasingly more on line transactions and digital banking activities, fraudulent transactions have end up a good sized danger to both the financial institution and its customers. Fraudulent activities, along with unauthorized account get right of entry to, identification robbery, and suspicious transaction patterns, bring about economic losses and harm to patron agree with.
- Type: Multivariate
- Task: Classification, Regression, Clustering
- Attributes: Real
- Year: 2024
YouTube Trending Video Dataset (updated daily)
YouTube maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments and likes). Note that they’re not the most-viewed videos overall for the calendar year
- Type: Multivariate
- Task: Classification, Regression, Clustering
- Attributes: Real
- Year: 2021
Covid-19 Case Surveillance Public Use Dataset
The COVID-19 case surveillance system database includes individual-level data reported to U.S. states and autonomous reporting entities, including New York City and the District of Columbia (D.C.), as well as U.S. territories and states. On April 5, 2020,
- Type: Multivariate, Data-Generator
- Task: Relational-Learning
- Attributes: Real
- Year: 2020
US Election 2020
The US Election 2020 dataset contains 864 instances and 52 attributes, focusing on the presidential race at the county level. It includes real-valued multivariate data for classification and regression tasks, with no missing values. The dataset provides insights into voting patterns and election trends across the U.S.
- Type: Multivariate
- Task: Classification, Regression
- Attributes: Real
- Year: 2020
Forest Fires Dataset
Forest fires are a major environmental issue, creating economical and ecological damage while endangering human lives. Fast detection is a key element for controlling such phenomenon. To achieve this, one alternative is to use automatic tools based on local sensors, such as provided by meteorological stations.
- Type: Multivariate
- Task: Regression
- Attributes: Real
- Year: 2008
Mobile Robots Dataset
The Mobile Robots dataset, published in 1995, contains sensor data from a mobile robot for classification tasks. It includes categorical, integer, and real attributes with no missing values. The dataset is used for learning concepts from robotic sensor data and was contributed by researchers from the University of Dortmund, Germany.
- Type: Domain-Theory
- Task: Classification
- Attributes: Categorical, Integer, Real
- Year: 1995
Safety Helmet Detection
Improve work safety by detecting the presence of people and safety helmets. To import a dataset, install MakeML. You can train an Object Detection neural network in a few clicks using this dataset.
- Type: Image
- Task: Classification, Regression, Clustering
- Attributes: Real
- Year: 2020
All Space Missions from 1957
This Dataset contains informations regarding space missions since the beginning of them (1957). This Datasets contains 9 Column (String : 6, Integer : 2, Decimal : 1) This DataSet was scraped from https://nextspaceflight.com/launches/past/?page=1 and includes all the space missions since the beginning of Space Race (1957)
- Type: Multivariate
- Task: Classification, Clustering, Causal-Discovery
- Attributes: Real
- Year: 2020
OSIC Pulmonary Fibrosis Progression Dataset
The Open Source Imaging Consortium (OSIC) is proud to partner with Kaggle to host the first-ever computational challenge for interstitial lung diseases: The OSIC Pulmonary Fibrosis Progression Challenge. A $55,000 prize will be offered to the Kaggle investigator(s) who devises the highest performing algorithm.
- Type: Image
- Task: Classification, Regression, Clustering
- Attributes: Real
- Year: 2019
Wine Quality Dataset
The dataset contains different chemical information about wine. It has 4898 instances with 14 variables each. The dataset is good for classification and regression tasks. The model can be used to predict wine quality.
- Type: Multivariate
- Task: Classification, Regression
- Attributes: Real
- Year: 2009
Google Audio Dataset
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories.
- Type: Multivariate
- Task: Relational-Learning
- Attributes: Real
- Year: 2017
Iris flower dataset
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
- Type: Multivariate
- Task: Classification
- Attributes: Real
- Year: 1988
Artificial Characters Dataset
This database has been artificially generated by using a first order theory which describes the structure of ten capital letters of the English alphabet and a random choice theorem prover which accounts for etherogeneity in the instances.
- Type: Multivariate
- Task: Classification
- Attributes: Categorical, Integer, Real
- Year: 1992
Random Blogs
- Career Guide: Natural Language Processing (NLP)
- Why to learn Digital Marketing?
- Data Analytics: The Power of Data-Driven Decision Making
- Datasets for Speech Recognition Analysis
- AI in Cybersecurity: The Future of Digital Protection
- Understanding OLTP vs OLAP Databases: How SQL Handles Query Optimization
- Transforming Logistics: The Power of AI in Supply Chain Management
- Loan Default Prediction Project Using Machine Learning
- Grow your business with Facebook Marketing
- The Ultimate Guide to Data Science: Everything You Need to Know
- Python Challenging Programming Exercises Part 3
- Types of Numbers in Python
- The Ultimate Guide to Starting a Career in Computer Vision
- Understanding AI, ML, Data Science, and More: A Beginner's Guide to Choosing Your Career Path
- Important Mistakes to Avoid While Advertising on Facebook
Prepare for Interview
- Design Patterns in Python
- Dynamic Programming and Recursion in Python
- Trees and Graphs in Python
- Linked Lists, Stacks, and Queues in Python
- Sorting and Searching in Python
- Debugging in Python
- Unit Testing in Python
- Asynchronous Programming in PYthon
- Multithreading and Multiprocessing in Python
- Context Managers in Python
- Decorators in Python
- Generators in Python
- Requests in Python
- Django
- Flask