When it comes to AI, can we ditch the datasets? – MIT News
Huge amounts of data are needed to train machine-learning models to perform image classification tasks, such as identifying damage in satellite photos following a natural disaster. However, these data are not always easy to come by. Datasets may cost millions of dollars to generate, if usable data exist in the first place, and even the best datasets often contain biases that negatively impact a models performance.
To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.
Their results show that a contrastive representation learning model trained using only these synthetic data is able to learn visual representations that rival or even outperform those learned from real data.
This special machine-learning model, known as a generative model, requires far less memory to store or share than a dataset. Using synthetic data also has the potential to sidestep some concerns around privacy and usage rights that limit how some real data can be distributed. A generative model could also be edited to remove certain attributes, like race or gender, which could address some biases that exist in traditional datasets.
We knew that this method should eventually work; we just needed to wait for these generative models to get better and better. But we were especially pleased when we showed that this method sometimes does even better than the real thing, says Ali Jahanian, a research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.
Jahanian wrote the paper with CSAIL grad students Xavier Puig and Yonglong Tian, and senior author Phillip Isola, an assistant professor in the Department of Electrical Engineering and Computer Science. The research will be presented at the International Conference on Learning Representations.
Generating synthetic data
Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing. The training process involves showing the generative model millions of images that contain objects in a particular class (like cars or cats), and then it learns what a car or cat looks like so it can generate similar objects.
Essentially by flipping a switch, researchers can use a pretrained generative model to output a steady stream of unique, realistic images that are based on those in the models training dataset, Jahanian says.
But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can imagine how a car would look in different situations situations it did not see during training and then output images that show the car in unique poses, colors, or sizes.
Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different.
The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains.
This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations, he says.
Even better than the real thing
The researchers compared their method to several other image classification models that were trained using real data and found that their method performed as well, and sometimes better, than the other models.
One advantage of using a generative model is that it can, in theory, create an infinite number of samples. So, the researchers also studied how the number of samples influenced the models performance. They found that, in some instances, generating larger numbers of unique samples led to additional improvements.
The cool thing about these generative models is that someone else trained them for you. You can find them in online repositories, so everyone can use them. And you dont need to intervene in the model to get good representations, Jahanian says.
But he cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they arent properly audited.
He and his collaborators plan to address those limitations in future work. Another area they want to explore is using this technique to generate corner cases that could improve machine learning models. Corner cases often cant be learned from real data. For instance, if researchers are training a computer vision model for a self-driving car, real data wouldnt contain examples of a dog and his owner running down a highway, so the model would never learn what to do in this situation. Generating that corner case data synthetically could improve the performance of machine learning models in some high-stakes situations.
The researchers also want to continue improving generative models so they can compose images that are even more sophisticated, he says.
This research was supported, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.
Here is the original post:
When it comes to AI, can we ditch the datasets? - MIT News
- Google is experimenting with machine learning-powered age-estimation tech in the US - TechCrunch - August 1st, 2025 [August 1st, 2025]
- Google Will Use Machine Learning to Estimate Users Age and Block Them From Restricted Content and Ads - Adweek - August 1st, 2025 [August 1st, 2025]
- A thermodynamic approach to machine learning: How optimal transport theory can improve generative models - Tech Xplore - August 1st, 2025 [August 1st, 2025]
- Machine Learning Transforms Immunotherapy in Metastatic NSCLC - BIOENGINEER.ORG - August 1st, 2025 [August 1st, 2025]
- Clinical decision support for vestibular diagnosis: large-scale machine learning with lived experience coaching - Nature - August 1st, 2025 [August 1st, 2025]
- Graph theoretic and machine learning approaches in molecular property prediction of bladder cancer therapeutics - Nature - August 1st, 2025 [August 1st, 2025]
- Automotive Battery Management System Market Outlook Report 2025-2034 | AI and Machine Learning Transforming the BMS Technology Landscape - Yahoo.co - August 1st, 2025 [August 1st, 2025]
- Machine learning model predicts radiotherapy response in patients with nasopharyngeal carcinoma - News-Medical - August 1st, 2025 [August 1st, 2025]
- Google is experimenting with machine learning-powered age-estimation tech in the US - Yahoo Finance - August 1st, 2025 [August 1st, 2025]
- Identification and validation of an explainable machine learning model for vascular depression diagnosis in the older adults: a multicenter cohort... - August 1st, 2025 [August 1st, 2025]
- Machine learning-based high-benefit approach versus traditional high-risk approach in statin therapy: the Shizuoka Kokuho database study - Nature - August 1st, 2025 [August 1st, 2025]
- Investigating the Impact of the Stationarity Hypothesis on Heart Failure Detection using Deep Convolutional Scattering Networks and Machine Learning -... - August 1st, 2025 [August 1st, 2025]
- Predicting Sepsis with Machine Learning and Lab-on-a-Chip - Electropages - August 1st, 2025 [August 1st, 2025]
- Classification accuracy of pain intensity induced by leg blood flow restriction during walking using machine learning based on electroencephalography... - August 1st, 2025 [August 1st, 2025]
- Machine learning-based drug-drug interaction prediction: a critical review of models, limitations, and data challenges - Frontiers - August 1st, 2025 [August 1st, 2025]
- AI and Machine Learning - AI and geospatial companies join forces to map Africa - Smart Cities World - July 30th, 2025 [July 30th, 2025]
- Summer research project explores alternative machine learning framework - Mercer University - July 30th, 2025 [July 30th, 2025]
- Unveiling multiscale drivers of wind speed in Michigan using machine learning - Nature - July 30th, 2025 [July 30th, 2025]
- New machine learning tool reveals atomic structure of ultra-thin film materials - Phys.org - July 28th, 2025 [July 28th, 2025]
- Optimizing base fluid composition for PEMFC cooling: A machine learning approach to balance thermal and rheological performance - Nature - July 28th, 2025 [July 28th, 2025]
- Overview: Machine learning in the medical space - Scientist Live - July 28th, 2025 [July 28th, 2025]
- IMD develops a novel machine-learning-based tool to predict urban rainfall trends in India - Research Matters - July 28th, 2025 [July 28th, 2025]
- Unsupervised System 2 Thinking: The Next Leap in Machine Learning with Energy-Based Transformers - MarkTechPost - July 27th, 2025 [July 27th, 2025]
- A machine learning-based approach to predict depression in Chinese older adults with subjective cognitive decline: a longitudinal study - Nature - July 27th, 2025 [July 27th, 2025]
- Machine Learning Identifies Role of Impaired Purine Metabolism in Gout Pathogenesis - HCPLive - July 27th, 2025 [July 27th, 2025]
- Detection of breast cancer using machine learning and explainable artificial intelligence - Nature - July 27th, 2025 [July 27th, 2025]
- Investigation of key ferroptosis-associated genes and potential therapeutic drugs for asthma based on machine learning and regression models - Nature - July 27th, 2025 [July 27th, 2025]
- Predicting postoperative trauma-induced coagulopathy in patients with severe injuries by machine learning - Nature - July 27th, 2025 [July 27th, 2025]
- Machine learning based multi-stage intrusion detection system and feature selection ensemble security in cloud assisted vehicular ad hoc networks -... - July 27th, 2025 [July 27th, 2025]
- Comparative analysis of machine learning models for malaria detection using validated synthetic data: a cost-sensitive approach with clinical domain... - July 27th, 2025 [July 27th, 2025]
- Statistical modelling and forecasting of HIV and anti-retroviral therapy cases by time-series and machine learning models - Nature - July 27th, 2025 [July 27th, 2025]
- Seeing Through the Rust: How Machine Learning is Improving Corrosion Detection - Research Matters - July 27th, 2025 [July 27th, 2025]
- Machine-Learning Approach to Increase the Potency and Overcome the Hemolytic Toxicity of Gramicidin S - ACS Publications - July 24th, 2025 [July 24th, 2025]
- Machine learning-based academic performance prediction with explainability for enhanced decision-making in educational institutions - Nature - July 24th, 2025 [July 24th, 2025]
- Can External Validation Tools Can Improve Annotation Quality for LLM-as-a-Judge - Apple Machine Learning Research - July 24th, 2025 [July 24th, 2025]
- How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms - Malaria Journal - July 24th, 2025 [July 24th, 2025]
- Development and validation of a dynamic early warning system with time-varying machine learning models for predicting hemodynamic instability in... - July 24th, 2025 [July 24th, 2025]
- Early and non-destructive prediction of the differentiation efficiency of human induced pluripotent stem cells using imaging and machine learning -... - July 24th, 2025 [July 24th, 2025]
- Algorithmica Reports 35% Return in First Fiscal Year, Driven by Machine Learning Trading Technology - PR Newswire - July 24th, 2025 [July 24th, 2025]
- New research using machine learning further links increase in earthquakes, quake intensity, in Raton Basin to wastewater injections - The... - July 24th, 2025 [July 24th, 2025]
- Early modern text transcription revolutionized by ethical machine learning tools - Archaeology News Online Magazine - July 22nd, 2025 [July 22nd, 2025]
- Role of Artificial Intelligence and Machine Learning in Conservative Dentistry and Endodontics: A Review - Cureus - July 22nd, 2025 [July 22nd, 2025]
- NTT Researchers Advance AI and Machine Learning Accuracy, Security and Cost Effectiveness at ICML 2025 - Business Wire - July 22nd, 2025 [July 22nd, 2025]
- Exploring Phase Stability and Transport Properties of Emerging Thermoelectric Materials: Machine Learning and Experimental Insights - ACS Publications - July 22nd, 2025 [July 22nd, 2025]
- Google expands Ad Manager partner guidelines with machine learning restrictions - PPC Land - July 22nd, 2025 [July 22nd, 2025]
- Leveraging Generative AI into Wargaming and Machine Learning to Shape War Termination Scenarios in Ukraine - oodaloop.com - July 22nd, 2025 [July 22nd, 2025]
- Predictive AI Too Hard To Use? GenAI Makes It Easy - Machine Learning Week 2025 - July 22nd, 2025 [July 22nd, 2025]
- Wheat is becoming more climate-resilient through nature-based plant breeding and machine learning - Phys.org - July 22nd, 2025 [July 22nd, 2025]
- Machine learning enhanced ultra-high vacuum system for predicting field emission performance in graphene reinforced aluminium based metal matrix... - July 22nd, 2025 [July 22nd, 2025]
- Machine learning-guided evolution of pyrrolysyl-tRNA synthetase for improved incorporation efficiency of diverse noncanonical amino acids - Nature - July 22nd, 2025 [July 22nd, 2025]
- Dietary intervention optimized using machine learning could lower risk of dementia - Medical Xpress - July 20th, 2025 [July 20th, 2025]
- Application of machine learning algorithms and SHAP explanations to predict fertility preference among reproductive women in Somalia - Nature - July 20th, 2025 [July 20th, 2025]
- From Reactive to Predictive: Forecasting Network Congestion with Machine Learning and INT - Towards Data Science - July 20th, 2025 [July 20th, 2025]
- Artificial intelligence and machine learning in the development of vaccines and immunotherapeuticsyesterday, today, and tomorrow - Frontiers - July 20th, 2025 [July 20th, 2025]
- How Machine Learning is Revolutionizing Threat Detection for Businesses in Real-Time - Eye On Annapolis - July 20th, 2025 [July 20th, 2025]
- Identification of clinical diagnostic and immune cell infiltration characteristics of acute myocardial infarction with machine learning approach -... - July 20th, 2025 [July 20th, 2025]
- Predicting the mechanical performance of industrial waste incorporated sustainable concrete using hybrid machine learning modeling and parametric... - July 20th, 2025 [July 20th, 2025]
- Integrative multi-omics and machine learning reveal critical functions of proliferating cells in prognosis and personalized treatment of lung... - July 20th, 2025 [July 20th, 2025]
- Systematic measurement and machine learning-based profile characterization of community noise in a medium-large city in the United States - Nature - July 20th, 2025 [July 20th, 2025]
- Prediction of birthweight with early and mid-pregnancy antenatal markers utilising machine learning and explainable artificial intelligence - Nature - July 20th, 2025 [July 20th, 2025]
- A comprehensive machine learning for high throughput Tuberculosis sequence analysis, functional annotation, and visualization - Nature - July 20th, 2025 [July 20th, 2025]
- AI and Machine Learning Skills Are Make or Break for Developers: 71% of Tech Leaders Wont Hire Without Them - The National Law Review - July 20th, 2025 [July 20th, 2025]
- Quality-of-life scale machine learning approach to predict immunotherapy response in patients with advanced non-small cell lung cancer - Frontiers - July 20th, 2025 [July 20th, 2025]
- Inversion and validation of soil water-holding capacity in a wild fruit forest, using hyperspectral technology combined with machine learning - Nature - July 20th, 2025 [July 20th, 2025]
- Machine Learning in Drug Discovery Market to Witness Exponential Growth: Key Players, $250M Eli Lilly Deal & Regional Insights for 2025-2034 -... - July 18th, 2025 [July 18th, 2025]
- Automated seafood freshness detection and preservation analysis using machine learning and paper-based pH sensors - Nature - July 18th, 2025 [July 18th, 2025]
- Do You Know What It Means To Train a Machine Learning Model? - LSU - July 18th, 2025 [July 18th, 2025]
- Establishment of an interpretable MRI radiomics-based machine learning model capable of predicting axillary lymph node metastasis in invasive breast... - July 18th, 2025 [July 18th, 2025]
- A Machine Learning-Reconstructed Dataset of River Discharge, Temperature, and Heat Flux into the Arctic Ocean - Nature - July 18th, 2025 [July 18th, 2025]
- Leveraging computational linguistics and machine learning for detection of ultra-high risk of mental health disorders in youths | Schizophrenia -... - July 18th, 2025 [July 18th, 2025]
- Development and validation of machine learning-based diagnostic models using blood transcriptomics for early childhood diabetes prediction - Frontiers - July 18th, 2025 [July 18th, 2025]
- Fatigue and stamina prediction of athletic person on track using thermal facial biomarkers and optimized machine learning algorithm - Nature - July 18th, 2025 [July 18th, 2025]
- Identifying the crucial oncogenic mechanisms of DDX56 based on a machine learning-based integration model of RNA-binding proteins - Nature - July 18th, 2025 [July 18th, 2025]
- AI and Machine Learning Skills Are Make or Break for Developers: 71% of Tech Leaders Wont Hire Without Them - Yahoo Finance - July 18th, 2025 [July 18th, 2025]
- Developing an explainable machine learning and fog computing-based visual rating scale for the prediction of dementia progression - Nature - July 18th, 2025 [July 18th, 2025]
- Prognosis of air quality index and air pollution using machine learning techniques - Nature - July 18th, 2025 [July 18th, 2025]
- Integrating vision transformer-based deep learning model with kernel extreme learning machine for non-invasive diagnosis of neonatal jaundice using... - July 18th, 2025 [July 18th, 2025]
- PlayStation 6 Likely to Feature 24 GB RAM for Advanced Ray Tracing and Machine Learning Without Raising Costs - Wccftech - July 18th, 2025 [July 18th, 2025]
- Machine Learning-Assisted Iterative Screening for Efficient Detection of Drug Discovery Starting Points - ACS Publications - July 16th, 2025 [July 16th, 2025]
- 2025 IT Camp on AI & Machine Learning for Beginners to be held August 5 - Southeastern Oklahoma State University - July 16th, 2025 [July 16th, 2025]