IDK how it applies to LLMs but the original meaning was a change in a distribution over time. Like if you had some model based app trained on American English, but slowly more and more American Spanish users adopt your app; training set distribution is drifting away from the actual usage distribution.
In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
Interesting approach, I've been particularly interested in tracking and being able to understand if adding skills or tweaking prompts is making things better or worse.
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets