NanoGPT Slowrun: Language Modeling with Limited Data, Infinite Compute

(qlabs.sh)

140 points | by sdpmas 11 hours ago

12 comments

linolevan 6 hours ago
There was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.
[0] https://www.alphaxiv.org/abs/2509.14786
[-]
- sdpmas 6 hours ago
  yeah, we do incorporate some of the findings from the paper in our repo! like aggressive regularization and ensembling.
  [-]
  - _0ffh 4 hours ago
    I see you already mention diffusion - iirc there was a result not too long ago that diffusion models keep improving with more epochs for longer than AR models do.
    [-]
    - sdpmas 4 hours ago
      diffusion is promising, but still an open question how much data efficient they are compared to AR. in practice, you can also train AR forever with high enough regularization, so let's see.
      [-]
      - _0ffh 4 hours ago
        Yes, it could go either way of course.
        Still, just for reference, here's the paper I remembered: https://arxiv.org/pdf/2507.15857
        [-]
        sdpmas 4 hours ago
        thanks, here's another one: https://arxiv.org/abs/2511.03276
bee_rider 5 hours ago
> Directions we think are wide open
> Second-order optimizers and natural gradient methods
Do second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
[-]
- sdpmas 4 hours ago
  yes! typically the optimizer that trains faster also get better data efficiency. it maybe not be absolutely true, but that has been my observation so far. also see https://arxiv.org/pdf/2510.09378 for second-order methods.
  [-]
  - vladf 2 hours ago
    That still looks like a “converge faster” paper.
    https://arxiv.org/abs/2006.10732
    The above provides a nuanced theoretical view. GD inductive bias is probably better unless your model is misspecified
  - alyxya 3 hours ago
    Fundamentally I don't believe second-order methods get better data efficiency by itself, but changes to the optimizer can because the convergence behavior changes. ML theory lags behind the results in practice.
shubhamintech 2 hours ago
The ensemble diversity point is underrated. Most teams pick one architecture and ship it, so the finding that architectural variation beats random seeds is interesting but hard to act on in practice. The more useful takeaway: low-data regimes expose every bad design decision you normally paper over with more tokens. It's basically a forcing function for understanding what actually drives model quality vs. what's just scale noise.
kseniamorph 7 hours ago
Curious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
[-]
- timshel1 6 hours ago
  Modded-nanogpt is also much more data efficient than vanilla napogpt, even if some of the individual optimizations trade off higher throughput for worse data efficiency.
  [-]
  - sdpmas 6 hours ago
    yes, agreed, modded-nanogpt is already a data-efficient variant of original nanogpt. just that the kinds of algorithms it allows are somewhat constrained because it optimizes for wall clock time.
archermarks 9 hours ago
Very cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
[-]
- sdpmas 9 hours ago
  yes, good point. right now, it's somewhat hard to overfit because the meta-optimization extracts tiny bits of information. but over time, we will switch the validation set to some other random subset of the FineWeb or even entirely OOD datasets!
lzaborowski 9 hours ago
I like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.
If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
navvyeanand 9 hours ago
Amazing job!
refulgentis 5 hours ago
This looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.
[-]
- sdpmas 4 hours ago
  no ensembling means train 8 models and during inference avg logits of all 8 models to make a prediction.
suddenlybananas 10 hours ago
Reminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.
[-]
- sdpmas 10 hours ago
  hey, it's Samip (behind the Slowrun repo). yeah that's a fair point, we will mention them in the blog. but there are a couple of major differences: 1. our emphasis is on using more compute to get better data efficiency. this is important because there are lots of hacky chances that will get lower loss, but when compared to general methods that leverage a lot of compute, they don't do so well. and you can already see how this emphasis on compute leads to different methods to BabyLM! 2. our reasoning behind the repo is not anything to do with how much data a child sees. and our dataset is not tailored towards that either. it's simple pretraining on random subset of the internet. we know there are better training algorithms that get lower loss on that data, and we are finding those.
  [-]
  - soraki_soladead 10 hours ago
    also, BabyLM is more of a conference track / workshop than an open-repo competition which creates a different vibe
aplomb1026 4 hours ago
[dead]
riajain2525 8 hours ago
[flagged]
STARGA 5 hours ago
[flagged]
[-]
- whimsicalism 5 hours ago
  really no shame in comments like these?
  [-]
  - devinplatt 5 hours ago
    It seems like best etiquette would be to have a username with "bot" in it and include something in the post explicitly indicating it's a bot (e.g. a signature).
    This isn't even a new problem where a good cultural solution hasn't been figured out yet. Reddit has had bot etiquette for years.
  - noahbp 4 hours ago
    In case anyone doubts it's AI-written: https://www.pangram.com/history/b7433cbe-08e7-43fe-9a32-3e43...