How Machine Learning Learns from Data — and Why Errors Still Happen
6 min read / 2026-06-19
Machine learning is the technology that lets computers find patterns in huge amounts of data and make predictions — but it can still produce wrong answers, especially when the data it learns from is incomplete or biased.
What machine learning means
Machine learning (ML) is a type of computer program that improves at a task by studying examples, rather than following instructions written out step by step. Think of it like this: instead of a teacher telling you exactly how to recognise a dog, you look at thousands of photos of dogs and non-dogs until you can spot the difference yourself. ML systems do the same thing — they look at large amounts of data, find patterns, and use those patterns to make future predictions or estimates. The more relevant data they study, the better they usually get.
How it works in practice
A machine learning model goes through three main stages. 1. Training: The model is fed a large dataset — for example, thousands of satellite images of factories — and told what the correct answer is for each one (say, how much pollution each factory produced, measured on the ground). It adjusts its internal settings until its guesses match the known answers as closely as possible. 2. Testing: The model is then tried on new examples it has never seen, to check whether it learned a real pattern or just memorised the training data. 3. Deployment: Once it performs well enough, the model is used to estimate answers for situations where no direct measurement exists — like estimating emissions from a remote industrial site that has no pollution sensor. In climate science, ML models use satellite images, atmospheric readings, and industrial records to estimate how much greenhouse gas a facility releases.
A simple example
Imagine you want to guess the marks a student will score in an exam. You collect data on past students: how many hours they studied, how many mock tests they took, and their final scores. You feed this to an ML model. It finds that study hours and mock tests are strong clues, and it builds a formula. Now, for a new student, you enter their study hours and mock test results, and the model predicts a likely score. This works well if your past data is accurate and covers many different types of students. But if all your past data came from one school in one city, the model might predict poorly for students from a different background. The same problem appears in emissions tracking: if a model is trained mostly on well-monitored factories in wealthy countries, it may give bad estimates for factories in regions with fewer ground sensors.
Why errors can creep in
ML models are only as good as the data they learn from. Common sources of error include: - Gaps in training data: If certain regions, industries, or time periods are underrepresented, the model fills in gaps with guesses that may be far off. - Measurement uncertainty: Satellite sensors have limits. Cloud cover, atmospheric interference, or sensor drift can distort readings. - Model assumptions: Every ML model makes simplifying assumptions. If reality is more complex than those assumptions, predictions drift from the truth. - Lack of ground truth: In many developing regions, there are few on-the-ground sensors to check whether satellite-based estimates are correct. Researchers and engineers try to reduce these errors through validation — comparing model outputs against real measurements — and by publishing uncertainty ranges so users know how confident to be in any given estimate.
What to remember
Machine learning is a powerful tool that allows scientists to estimate things — like factory emissions worldwide — that would be impossible to measure one by one. But no ML model is perfect. Every estimate carries some uncertainty, and good science makes that uncertainty visible rather than hiding it. Independent validation (checking the model's answers against real-world data) is what keeps ML results trustworthy. When a model is used to make big decisions — like where to send climate funding or which countries to hold accountable for pollution — the quality of the data and the honesty about uncertainty matter enormously.
Key words
Machine learning
A type of computer program that learns to make predictions by finding patterns in large sets of examples, rather than following fixed rules.
Training data
The collection of examples — with known correct answers — that a machine learning model studies to learn its patterns.
Validation
The process of checking a model's outputs against real-world measurements to see how accurate it is.
Uncertainty
A measure of how confident or unsure a model's estimate is; good science always reports this alongside any prediction.
Key facts
- 1Machine learning was formally described as a field of computer science in 1959 by Arthur Samuel, who built a program that learned to play checkers by playing thousands of games against itself.
- 2Satellite-based ML models can analyse images of tens of thousands of industrial sites in the time it would take human inspectors to visit just a handful.
- 3A model trained on data from one region can be significantly less accurate when applied to a different region — researchers call this a 'distribution shift'.
- 4The National Institute of Standards and Technology (NIST) publishes guidelines for evaluating AI and ML systems, including how to measure and report uncertainty in model outputs.
- 5UNESCO's Recommendation on the Ethics of AI, adopted in 2021, calls for AI systems used in high-stakes decisions to be transparent, auditable, and tested for accuracy across different populations and contexts.
Why it matters
When ML models are used to guide billion-dollar climate decisions, small errors in training data or model design can steer money and accountability in the wrong direction — so understanding how these models work helps everyone ask better questions about result
Sources
- National Institute of Standards and Technology (NIST) — AI Risk Management Framework
- UNESCO — Recommendation on the Ethics of Artificial Intelligence (2021)
- Science Daily — research coverage of machine learning in environmental science


