Building ML models when data is scarce
- Samvar Shah
- 2 days ago
- 2 min read

Most machine learning advice assumes we have lots of data. But in the real world we often don’t. Maybe the system is new, experiments are expensive, or data collection takes years.
So what do we do when “just get more data” isn’t an option? (This is actually what happened when I was building an ML Model for Rogan Kala. Availability of enough data samples was a huge limitation.)
To overcome, we can combine data with math, physics, and process knowledge.
Use Math to Shrink the Problem
Bayesian models with informative priors
Probabilistic models with explicit structure
Math helps by reducing the number of degrees of freedom. We’re telling the model, “The answer probably looks like this,” instead of letting it guess wildly.
Bayesian methods are especially powerful here. They let us code expertise (“this parameter should be positive,” “this effect is small”) and update it as data arrives.
Use Physics (or First Principles) as the Model Backbone
If the system follows known physical laws, we can:
Start with equations (e.g., conservation laws, kinematics, thermodynamics)
Use ML to learn only what the equations can’t capture
This approach is often called:
Physics-informed ML
Hybrid modeling
Here, ML fills in the gaps like unknown coefficients, delays, or unmodeled interactions. Physics constraints the solution space so that even small datasets can be enough.
Use Process Knowledge and Rules
In many domains, there’s a well-understood process:
Manufacturing steps
Medical workflows
These can be encoded as:
Rule-based systems
Causal graphs
Simulation models
ML is then used inside the process. For example, a small classifier might estimate a risk score at one step, while the rest of the decision-making follows fixed logic. This vastly reduces how much data we need because the model isn’t responsible for end-to-end behavior.
What do you think? Is this feasible? Any examples were you might have tried this? Would love to know your thoughts.