Investigating Split Locks on x86-64

(chipsandcheese.com)

50 points | by ingve 3 days ago

3 comments

sidkshatriya 20 minutes ago
This article seems relevant to me for the following scenario:
- You have faulty software (e.g. games) that happen to have split locks
AND
- You have DISABLED split lock detection and "mitigation" which would have hugely penalised the thread in question (so the lock becomes painfully evident to that program and forced to be fixed).
AND
- You want to see which CPU does best in this scenario
In other words you just assume the CPU will take the bus lock penalty and continue WITHOUT culprit thread being actively throttled by the OS.
In the normal case, IIUC Linux should helpfully throttle the thread so the rest of the system is not affected by the bus lock. In this benchmark here the assumption is the thread will NOT be throttled by Linux via appropriate setting.
So to be honest I don't see the merit of this study. This study is essentially how fast is your interconnect so it can survive bad software that is allowed to run untrammelled.
On aarch64 the thread would simply be killed. It's possible to do the same on modern AMD / Intel also OR simply throttle the thread so that it does not cause problems via bus locks that affect other threads -- none of these are done in this benchmark.
[-]
- VorpalWay 2 minutes ago
  > So to be honest I don't see the merit of this study. This study is essentially how fast is your interconnect so it can survive bad software that allowed to run untrammelled.
  It seems like a worthwhile study if you want to know what CPU to buy to play specific old games that use bus locks. Games that will never be fixed.
anematode 4 hours ago
Cool investigation. This part perplexes me, though:
> Games have apparently been using split locks for quite a while, and have not created issues even on AMD’s Zen 2 and Zen 5.
For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....
[-]
- toast0 4 hours ago
  > For the life of me I don't understand why you'd ever want to do an atomic operation that's not naturally aligned, let alone one split across cache lines....
  I assume they force packed their structure and it's poorly aligned, but x86 doesn't fault on unaligned access and Windows doesn't detect and punish split locks, so while you probably would get better performance with proper alignment, it might not be a meaningful improvement on the majority of the machines running the program.
  [-]
  - anematode 4 hours ago
    Ah, that's a great hypothesis. I wonder, then, how it works with x86 emulation on ARM. IIRC, atomic ops on ARM fault if the address isn't naturally aligned... but I guess the runtime could intercept that and handle it slowly.
    [-]
    - omcnoe 2 hours ago
      ARM macs apparently have some kind of specific handling in place for this when a process is running with x86_64 compatibility, but it’s not publicly documented anywhere that I can see.
      [-]
      - my123 43 minutes ago
        XNU has this oddity: https://github.com/apple-oss-distributions/xnu/blob/f6217f89...
        Redacted from open source XNU, but exists in the closed source version
        [-]
        omcnoe 32 minutes ago
        Is it actually redacted, or just a leftover stub from a feature implemented in silicon instead of software? Isn't the x86 memory order compatibility done at hardware level?
    - BobbyTables2 4 hours ago
      An emulated x86 atomic instruction wouldn’t need to use atomic instructions on ARM.
      [-]
      - dooglius 3 hours ago
        Why not?
        [-]
        MBCook 2 hours ago
        They don’t have to match.
        As an example, what about a divide instruction. A machine without an FPU can emulate a machine that has one. It will legitimately have to run hundreds/thousands of instructions to emulate a single divide instruction, it will certainly take longer.
        Thats OK, just means the emulation is slower doing that than something like add that the host has a native instruction for. In ‘emulator time’ you still only ran one instruction. That world is still consistent.
        [-]
        anematode 2 hours ago
        ? That's not how Windows on ARM emulation works. It uses dynamic JIT translation from x86 to ARM. When the compiler sees, e.g., lock add [mem], reg presumably it'll emit a ldadd, but that will have different semantics if the operand is misaligned.
strstr 1 hour ago
Split locks are weird. It’s never been obvious to me why you’d want to do them unless you are on a small core count system. When split lock detection rolled out for linux, it massacred perf for some games (which were probably min-maxing single core perf and didn’t care about noisy neighbor effects).
Frankly, I’m surprised split lock detection is enabled anywhere outside of multi-tenant clouds.