1 comments

  • TotallyNotOla 1 hour ago
    ML training workflows store checkpoints as full model files, even when most of the model hasn’t changed.I wanted to see what happens if you treat checkpoints as structured objects instead of opaque blobs, and deduplicate at the tensor level.

    A few things surprised me:

    - Delta compression mostly doesn’t work during training (deltas can be larger than the original)

    - File-level deduplication (e.g. DVC) doesn’t capture most of the redundancy

    - Almost all storage savings come from exact tensor identity, not partial overlap

    For things like warm-start tree models and transfer learning, this ends up working really well. Curious if anyone has seen different behavior with larger models or different chunk sizes.