Building ML models when data is scarce

Samvar Shah
2 days ago
2 min read

Most machine learning advice assumes we have lots of data. But in the real world we often don’t. Maybe the system is new, experiments are expensive, or data collection takes years.

So what do we do when “just get more data” isn’t an option? (This is actually what happened when I was building an ML Model for Rogan Kala. Availability of enough data samples was a huge limitation.)

To overcome, we can combine data with math, physics, and process knowledge.

Use Math to Shrink the Problem

Bayesian models with informative priors
Probabilistic models with explicit structure

Math helps by reducing the number of degrees of freedom. We’re telling the model, “The answer probably looks like this,” instead of letting it guess wildly.

Bayesian methods are especially powerful here. They let us code expertise (“this parameter should be positive,” “this effect is small”) and update it as data arrives.

Use Physics (or First Principles) as the Model Backbone

If the system follows known physical laws, we can:

Start with equations (e.g., conservation laws, kinematics, thermodynamics)
Use ML to learn only what the equations can’t capture

This approach is often called:

Physics-informed ML
Hybrid modeling

Here, ML fills in the gaps like unknown coefficients, delays, or unmodeled interactions. Physics constraints the solution space so that even small datasets can be enough.

Use Process Knowledge and Rules

In many domains, there’s a well-understood process:

Manufacturing steps
Medical workflows

These can be encoded as:

Rule-based systems
Causal graphs
Simulation models

ML is then used inside the process. For example, a small classifier might estimate a risk score at one step, while the rest of the decision-making follows fixed logic. This vastly reduces how much data we need because the model isn’t responsible for end-to-end behavior.

What do you think? Is this feasible? Any examples were you might have tried this? Would love to know your thoughts.