16 comments

  • kreelman 5 hours ago
    This is very much worth watching. It is a tour de force.

    Laurie does an amazing job of reimagining Google's strange job optimisation technique (for jobs running on hard disk storage) that uses 2 CPUs to do the same job. The technique simply takes the result of the machine that finishes it first, discarding the slower job's results... It seems expensive in resources, but it works and allows high priority tasks to run optimally.

    Laurie re-imagines this process but for RAM!! In doing this she needs to deal with Cores, RAM channels and other relatively undocumented CPU memory management features.

    She was even able to work out various undocumented CPU/RAM settings by using her tool to find where timing differences exposed various CPU settings.

    She's turned "Tailslayer" into a lib now, available on Github, https://github.com/LaurieWired/tailslayer

    You can see her having so much fun, doing cool victory dances as she works out ways of getting around each of the issues that she finds.

    The experimentation, explanation and graphing of results is fantastic. Amazing stuff. Perhaps someone will use this somewhere?

    As mentioned in the YT comments, the work done here is probably a Master's degrees worth of work, experimentation and documentation.

    Go Laurie!

    • throwaway81523 1 hour ago
      This is a 54 minute video. I watched about 3 minutes and it seemed like some potentially interesting info wrapped in useless visuals. I thought about downloading and reading the transcript (that's faster than watching videos), but it seems to me that it's another video that would be much better as a blog post. Could someone summarize in a sentence or two? Yes we know about the refresh interval. What is the bypass?

      Update: found the bypass via the youtube blurb: https://github.com/LaurieWired/tailslayer

      "Tailslayer is a C++ library that reduces tail latency in RAM reads caused by DRAM refresh stalls.

      "It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules, using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton. Once the request comes in, Tailslayer issues hedged reads across all replicas, allowing the work to be performed on whichever result responds first."

      • svrtknst 3 minutes ago
        Unnecessarily negative imo.

        I like the video because I cant read a blog post in the background while doing other stuff, and I like Gadget Hackwrench narrating semi-obscure CS topics lol

      • fc417fc802 7 minutes ago
        > using (undocumented!) channel scrambling offsets that works on AMD, Intel, and Graviton

        Seems odd to me that all three architectures implement this yet all three leave it undocumented. Is it intended as some sort of debug functionality or what?

      • satvikpendem 1 hour ago
        Just use the Ask button on YouTube videos to summarize, that's what it's for.
    • gopalv 3 hours ago
      >> It replicates data across multiple, independent DRAM channels with uncorrelated refresh schedules

      This is the sort of thing which was done before in a world where there was NUMA, but that is easy. Just task-set and mbind your way around it to keep your copies in both places.

      The crazy part of what she's done is how to determine that the two copies don't get get hit by refresh cycles at the same time.

      Particularly by experimenting on something proprietary like Graviton.

      • rockskon 3 hours ago
        She determines that by having three copies. Or four. Or eight.

        Tis just probabilities and unlikelihood of hitting a refresh cycle across that many memory channels all at once.

        • GeneralMayhem 1 hour ago
          Right, but the impressive part is finding addresses that are actually on different memory channels.
    • 100ms 1 hour ago
      > Google's strange job optimisation technique (for jobs running on hard disk storage)

      Can you give more context on this? Opus couldn't figure out a reference for it

      • why_only_15 47 minutes ago
        This is a quite old technique. The idea, as I understood it, was that lots of data at Google was stored in triplicate for reliability purposes. Instead of fetching one, you fetched all three and then took the one that arrived first. Then you sent UDP packets cancelling the other two. For something like search where you're issuing hundreds of requests that have to resolve in a few hundred milliseconds, this substantially cut down on tail latency.
        • yvdriess 40 minutes ago
          Tournament parallelism is the technical term IIRC.
      • tastroder 33 minutes ago
    • ufocia 4 hours ago
      I like the video, but this is hardly groundbreaking. You send out two or more messengers hoping at least one of them will get there on time.
      • rcbdev 3 hours ago
        Yeah. These are literally just mainframe techniques from yesteryear.
        • actionfromafar 18 minutes ago
          Almost everything "new" was invented by IBM it seems like. And it goes by a completely different name there. It's still nice to rediscover what they knew.
      • npunt 4 hours ago
        and dropbox was just rsync
      • UltraSane 3 hours ago
        The clever part is figuring out what RAM is controlled by which controllers.
        • saidnooneever 5 minutes ago
          everyone says this but no one says why it was clever. i find her videos have cool results but i cant have patience for them usually because its recycled old stuff (can be cool but its not ground breaking).

          there is a ton of info you can pull from: smbios, acpi, msrs, cpuid etc. etc. about cpu/ram topology and connecticity, latencies etc etc.

          isnt the info on what controllers/ram relationships exists somewhere in there provided by firmware or platform?

          i can hardly imagine it is not just plainly in there with the plethtora info in there...

          theres srat/slit/hmat etc. in acpi, then theres MSRs with info (amd expose more than intel ofc, as always) and then there is registers on memory controller itself as well as socket to socket interconnects from upi links..

          its just a lot of reading and finding bits here n there. LLms are actually really good at pulling all sorts of stuff from various 6-10k page documents if u are too lazy to dig yourself -_-

  • foltik 5 hours ago
    Love the format, and super cool to see a benchmark that so clearly shows DRAM refresh stalls, especially avoiding them via reverse engineering the channel layout! Ran it on my 9950X3D machine with dual-channel DDR5 and saw clear spikes from 70ns to 330ns every 15us or so.

    The hedging technique is a cool demo too, but I’m not sure it’s practical.

    At a high level it’s a bit contradictory; trying to reduce the tail latency of cold reads by doubling the cache footprint makes every other read even colder.

    I understand the premise is “data larger than cache” given the clflush, but even then you’re spending 2x the memory bandwidth and cache pressure to shave ~250ns off spikes that only happen once every 15us. There’s just not a realistic scenario where that helps.

    Especially HFT is significantly more complex than a huge lookup table in DRAM. In the time you spend doing a handful of 70ns DRAM reads, your competitor has done hundreds of reads from cache and a bunch of math. It’s just far better to work with what you can fit in cache. And to shrink what doesn’t as much as possible.

    • Lramseyer 1 hour ago
      Another point about HFT - They're mostly using FPGAs (some use custom silicon) which means that they have much tighter control over how DRAM is accessed and how the memory controller is configured. They could implement this in hardware if they really need to, but it wouldn't be at the OS level.
    • formerly_proven 1 hour ago
      On most RAM tREF can be increased a lot from the default, at least if kept somewhat cool.
  • sbiru93 9 minutes ago
    Doesn't doing this halve the computing power? I don't know this world at all, is that acceptable?
  • yalogin 27 minutes ago
    This is a cool idea, very well put through for everyone to understand such an esoteric concept.

    However I wonder if the core idea itself is useful or not in practice. With modern memory there are two main aspects it makes worse. First is cost, it needs to double the memory used for the same compute. With memory costs already soaring this is not good. Then the other main issue of throughout, haven’t put enough thought into that yet but feels like it requires more orchestration and increases costs there too.

  • bronlund 1 hour ago
    She could probably have been stinking rich on this work alone, but instead she just put it up on Github. Kudos to Laurie.
    • larodi 57 minutes ago
      She probably is already stinking rich, or at least rich enough. Beyond certain point, though, research and knowledge seems more interesting than riches, and particularly if you feel yourself a researcher. Otherwise, perhaps, she be doing the same to business and be Ellona or something. Thank God she does not, but the contrary - is an inspiration to so many people - young and adult. Kudos!
  • mzajc 5 hours ago
  • rkagerer 3 hours ago
    Halfway through this great video and I have two questions:

    1) Can we take this library and turn it into a a generic driver or something that applies the technique to all software (kernel and userspace) running on the system? i.e. If I want to halve my effective memory in order to completely eliminate the tail latency problem, without having to rewrite legacy software to implement this invention.

    2) What model miniature smoke machine is that? I instruct volunteer firefighters and occasionally do scale model demos to teach ventilation concepts. Some research years back led me to the "Tiny FX" fogger which works great, but it's expensive and this thing looks even more convenient.

    • lauriewired 2 hours ago
      1. not that I can think of, due to the core split. It really has to be independent cores racing independent loads. anything clever you could do with kernel modules, page-table-land, or dynamically reacting via PMU counters would likely cost microseconds...far larger than the 10s-100s of nanoseconds you gain.

      what I wished I had during this project is a hypothetical hedged_load ISA instruction. Issue two requests to two memory controllers and drop the loser. That would let the strategy work on a single thread! Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation. But, you’d have to convince Intel/AMD/someone else :)

      2. It’s called a “smokeninja”. Fairly popular in product photography circles, it’s quite fun!

      • rkagerer 1 hour ago
        Or, even better, integrating the behavior into the memory controller itself, which would be transparent to all software without recompilation.

        Yeah it would be neat to just flip a BIOS switch and put your memory into "hedge" mode. Maybe one day we'll have an open source hardware stack where tinkerers can directly fiddle with ideas like this. In the meantime, thanks for your extensive work proving out the concept and sharing it with the world!

      • solstice 1 hour ago
        Is there a reason you can think of why AMD, Intel etc. would not want to do this?

        Really enjoyed the video and feel that I (not being in the IT industry) better understand CPUs und and RAM now.

    • hawk_ 2 hours ago
      > halve my effective memory in order to completely eliminate the tail latency problem,

      Wouldn't you have a tail latency problem on the write side though if you just blindly apply it every where? As in unless all the replicas are done writing you can't proceed.

    • imp0cat 3 hours ago
      Brio 33884. It has a tiny ultrasonic humidifier in there.
  • josalhor 1 hour ago
    I haven't had time to see the whole thing yet, but I'm quite surprised this yielded good results. If this works I would have expected CPU implementations to do some optimization around this by default given the memory latency bottleneck of the last 1.5 decades. What am I missing here?
    • formerly_proven 1 hour ago
      Turning on mirroring does this for the low, low price of doubling your RAM cost.
  • boznz 4 hours ago
    Should say DRAM, SRAM does not have this.
    • guenthert 18 minutes ago
      Indeed. And only for certain DRAM refresh strategies. I mean, it's at least conceivable that a memory management system responsible for the refresh notices that a given memory location is requested by the cache and then fills the cache during the refresh (which afaiu reads the memory) or -- simpler to implement perhaps -- delays the refresh by a μs allowing the cache-fill to race ahead.

      (seems that in the earlier submission, https://news.ycombinator.com/item?id=47680023, jeffbee hinted that IBM zEnterprise is doing something to that effect)

      Said that, I'm not convinced that this is a big issue in practice. If you really care about performance, you got to avoid cache misses.

  • dinkumthinkum 3 hours ago
    This is an unreasonably good video. Hopefully, it inspires others to see we can still think hard and critically about technical things.
    • deathanatos 1 hour ago
      Yeah, wow, the comments weren't kidding. This'll probably be the best video I watch all month, at least, if not more. I would have said what she was trying to do was "impossible" (had I not seen the title and figured … well … she posted the video) and right about when I was thinking that she got me with:

      > Hold on a second. That's a really bad excuse. And technology never got anywhere by saying I accept this and it is what it is.

  • rcbdev 3 hours ago
    Am I the only one who feels the comments here don't sound organic at all?
    • tredre3 2 hours ago
      No I felt the same way, they're exactly like the usual LLM bot comment where a LLM recap ops and ends with an platitude or witty encouragement.

      But all the accounts are old/legit so I think that you and me have just become paranoid...

    • guenthert 5 minutes ago
      No, something is funny here. In the previous submission (https://news.ycombinator.com/item?id=47680023) the only (competently) criticizing comment (by jeffbee) was downvoted into oblivion/flagged.
    • isoprophlex 2 hours ago
      You're absolutely right
    • silisili 2 hours ago
      You're absolutely right to call this out. No humans, no emotion, no real comments - just LLM slop.

      In all seriousness, agreed. The top comment at time of this writing seems like a poor summarizing LLM treating everything as the best thing since sliced bread. The end result is interesting, but neither this nor Google invented the technique of trying multiple things at once as the comment implies.

    • Alifatisk 2 hours ago
      I don’t see anything unusual
  • rationalist 5 hours ago
    [dead]
  • dragonsenseiguy 2 days ago
    [flagged]