Running Out of Disk Space in Production

(alt-romes.github.io)

52 points | by romes 3 days ago

4 comments

  • flanfly 1 hour ago
    A neat trick I was told is to always have ballast files on your systems. Just a few GiB of zeros that you can delete in cases like this. This won't fix the problem, but will buy you time and free space for stuff like lock files so you can get a working system.
    • throw0101d 5 minutes ago
      > A neat trick I was told is to always have ballast files on your systems.

      ZFS has a "reservation" mechanism that's handy:

      > The minimum amount of space guaranteed to a dataset, not including its descendants. When the amount of space used is below this value, the dataset is treated as if it were taking up the amount of space specified by refreservation. The refreservation reservation is accounted for in the parent datasets' space used, and counts against the parent datasets' quotas and reservations.

      * https://openzfs.github.io/openzfs-docs/man/master/7/zfsprops...

      Quotas prevent users/groups/directories (ZFS datasets) from using too much space, but reservations ensure that particular areas always have a minimum amount set aside for them.

    • dspillett 24 minutes ago
      Similarly, I always leave some space unallocated on LMV volume groups. It means that I can temporarily expand a volume easily if needed.

      It also serves to leave some space unused to help out the wear-levelling on the SSDs on which the RAID array that is the PV¹ for LVM. I'm, not 100% sure this is needed any more² but I've not looked into that sufficiently so until I do I'll keep the habit.

      --------

      [1] if there are multiple PVs, from different drives/arrays, in the VG, then you might need to manually skip a bit on each one because LVM will naturally fill one before using the next. Just allocate a small LV specially on each and don't use it. You can remove one/all of them and add the extents to the fill LV if/when needed. Giving it a useful name also reminds you why that bit of space is carved out.

      [2] drives under-allocate by default IIRC

    • fifilura 1 hour ago
      I did this too, but i also zipped the file, turns out it had great packing ratio!
      • saagarjha 1 hour ago
        Personally I just keep the file on a ramdisk so you can avoid having to fetch it from slow storage
    • Chaosvex 18 minutes ago
      Similar to the old game development trick of hiding some memory away and then freeing it up near the end of development when the budget starts getting tight.
    • bombcar 13 minutes ago
      Some filesystems can be unable to delete a file if full. Something to be a bit worried about.
    • jaapz 1 hour ago
      Love the simplicity and pragmatism of this solution
    • ninalanyon 1 hour ago
      This is why I never empty the Rubbish Bin/trash Can on my Linux laptop until the disk fills.
    • omarqureshi 1 hour ago
      Surely a 50% warning alarm on disk usage covers this without manual intervention?
      • theshrike79 1 hour ago
        Depends. A Kubernetes container might have only a few megabytes of disk space, because it shouldn't need it.

        Except that one time when .NET decides that the incoming POST is over some magic limit and it doesn't do the processing in-memory like before, but instead has to write it to disk, crashing the whole pod. Fun times.

        Also my Unraid NAS has two drives in "WARNING! 98% USED" alert state. One has 200GB of free space, the other 330GB. Percentages in integers don't work when the starting number is too big :)

      • jcims 1 hour ago
        If the alarms are reliably configured, confirmed to be working, low noise enough to be actioned, etc etc.

        And of course there's nothing to say that both of these things can't be done simultaneously.

      • dspillett 1 hour ago
        If the alarm works. And it actioned not just snoozed too much or just dismissed entirely.

        Defence in depth is a good idea: proper alarms, and a secondary measure in case they don't have the intended effect.

        • n4r9 9 minutes ago
          Surely there are pitfalls either way. A ballast file can be deleted too readily, or someone could forget to re-add it.
        • pixl97 54 minutes ago
          Alarms are great, but when something goes wrong SSDs can fill up amazingly fast!
    • testplzignore 1 hour ago
      Would another way be to drop the reserved space (typically 1% to 5% on an ext file system)?
      • bombcar 13 minutes ago
        Reserved space doesn't protect you against root, who is often the user to blame for the last used MB.
  • entropie 51 minutes ago
    > I rushed to run du -sh on everything I could, as that’s as good as I could manage.

    I recently came across gdu (1) and have installed/used it on every machine since then.

    [1]: https://github.com/dundee/gdu

  • tcp_handshaker 15 minutes ago
    [dead]