How Cloudflare responded to the “Copy Fail” Linux vulnerability

(blog.cloudflare.com)

65 points | by mobeigi 6 hours ago

12 comments

sammy2255 4 hours ago
Any Cloudflare employees reading this, your network map has a few PoPs missing from it https://www.cloudflare.com/network/ notably, Perth (PER) Australia. Hobart (HBA) Australia. Wellington (WLG), New Zealand. Christchurch (CHC), New Zealand. Nausori (SUV), Fiji.
skinfaxi 5 hours ago
Would love to learn more about their internal behavioural detection program.
> One of the first things our security team did was confirm that our existing endpoint detection would catch this exploit. Our servers run behavioral detection that continuously monitors process execution patterns. It doesn't rely on knowing about specific vulnerabilities; it watches for anomalous behavior across the fleet.
[-]
- CGamesPlay 5 hours ago
  Would certainly be interesting to learn more about. A simple check: allowlist of known "processes that run as root". Any new process shows up, something happened.
  [-]
  - jeffbee 5 hours ago
    Based on what? Proc title?
    [-]
    - CGamesPlay 5 hours ago
      Proc title is very easily forged (without root even). Obviously a real privileged process could modify the kernel and do whatever it wants, but if I were trying to detect this I would start with /proc/$id/exe.
      [-]
      - Retr0id 4 hours ago
        /proc/pid/exe is also easily forged, without root. For example you can do LD_PRELOAD=evil.so /bin/foo on any dynamic executable, or spawn /bin/foo unmodified and inject code via ptrace or /proc/pid/mem.
        I have a fileless, execless copyfail exploit that works by injecting shellcode directly into systemd's pid 1. (I should probably publish it at some point...)
        [-]
        jeffbee 4 hours ago
        Yeah the whole system is based on the ability of one task to apparently become another task, that's how Unix works. So the indicators in /proc are just that: indicative at best.
        There's no reason the task should even be assumed to be executing code in a file. A process can map code into anonymous memory and continue executing there without even branching. Again this is considered a feature of the system rather than a flaw.
      - jeffbee 5 hours ago
        Maybe, but there's a prctl to change that reference which a root process can use.
    - dboreham 4 hours ago
      They might just compute a hash over the binary, or the code space in memory.
    - parliament32 5 hours ago
      It's curious they're just "monitoring" rather than preventing.
      In a serious environment you'd run IPE with dm-verity/fs-verity to ensure binaries are whitelisted and integrity-checked at every execution.
      [-]
      - staticassertion 4 hours ago
        lol no one does that (edit: or, rather, that is extremely uncommon, even in "serious" environments, for a ton of reasons).
        [-]
        parliament32 3 hours ago
        Look at the FedRAMP requirements around integrity protection, then look at how massive the list of complaint products is. I promise, pretty much everyone in regulated environments is. It's so prevelant Azure is even pushing a turnkey solution for k8s https://learn.microsoft.com/en-us/azure/aks/use-azure-linux-...
        [-]
        staticassertion 1 hour ago
        Nothing about fedramp requires that you enable any of the features you're talking about. Linking to a public preview of an Azure product that doesn't even run with enforcement on is not great supporting evidence.
        jeffbee 3 hours ago
        If you have much experience with fedramp, and it sounds like you do, perhaps you might agree that it is a huge list of things that superficially indicate doing something, without actually doing anything. As the documentation for IPE freely admits, it has no protective benefits because it is unaware of anonymous executable regions.
        [-]
        parliament32 3 hours ago
        It sure has limitations, but "no protective benefits" is pretty wrong. In a real world example, if your containerized application has an RCE, you're preventing the attacker from executing binaries they tampered with or down/up-loaded. Combined with minimal distroless containers, it's a very effective attack surface reduction strategy, and works much better than the legacy scan-occasionally integrity-checking methods (rkhunter et al).
- staticassertion 4 hours ago
  Syscalls and kernel module loading can both be logged, I assume that's sufficient here.
  [-]
  - skinfaxi 4 hours ago
    Yes but I am interested in hearing about cloudflare's implementation, how they scale it to their whole fleet, and what kinds of heuristics they are using to classifying behavior as anomalous.
- mobeigi 4 hours ago
  I'd very much like to learn more about this too, deserves its own blog post.
srcreigh 4 hours ago
It’s fascinating that already had a system which could identify the exploit at runtime. How can I learn more about that?
mkj 4 hours ago
If they're already running a custom Linux kernel build, why did they have AF_ALG enabled? Seems the perfect situation to limit features to only those actually being used.
[-]
- computerfriend 4 hours ago
  In the article they explain that some of their services use it.
  [-]
  - mixdup 42 minutes ago
    And also as part of this, they have learned the lesson parent comment is trying to make: they called out that they are going to review their deployments and make sure there's no unused modules being deployed
electra2012 3 hours ago
> Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line.
Hopefully a wake-up call to those who believe older distro LTS kernels are getting all the security fixes Canonical and Redhat would want you to believe.
cluckindan 2 hours ago
Has anyone figured out whether this CVE was intentional?
PunchyHamster 4 hours ago
for us it was
* Get list of modules from Puppet's facts, confirm module isn't used anywhere (it wasn't) * `install algif_aead /bin/false` in /etc/modprobe.d/disable-algif.conf * Run a check using exploit code to check it is no longer working
I imagine CF runs more stuff that could use it I guess but apparently it's not often used API
jmclnx 4 hours ago
> Linux kernel build based on the community's Long-Term Support (LTS)
CopyFail only highlights why Companies want LTS. If there was a supported kernel built prior to 2017, most large companies would still be on that version, avoiding this issue all-together.
The corporate mindset is usually "never upgrade unless there is new hardware needed or critical software failure". All CopyFail did was reinforce that mindset.
I wonder if CopyFail will cause enterprises put pressure on the Linux Foundation to maintain a "ultra LTS" were it is supported for 20 years ?
[-]
- PunchyHamster 4 hours ago
  > CopyFail only highlights why Companies want LTS. If there was a supported kernel built prior to 2017, most large companies would still be on that version, avoiding this issue all-together.
  Sadly not really how it works for say Red Hat. They routinely backport features while keeping whatever "stable" number on kernel. We even had displeasure of them backporting a bug... same bug to 2 different RHEL versions
- tempest_ 3 hours ago
  The longer you wait the more painful the switch will eventually be.
  [-]
  - em-bee 7 minutes ago
    for the kernel? hardly. only if the kernel breaks userspace. which it shouldn't.
dboreham 5 hours ago
The "Hunting for Exploitation" section is unclear to me: "The exploit leaves a distinctive trace in kernel logs when it runs." Hmm. Wouldn't a system with a compromised kernel also log exactly what the attacker wanted logged?
[-]
- cube00 4 hours ago
  I guess the hope is the kernel has been able to successfully transmit that log message to the immutable central logging infra before it gets compromised.
  Although given the tendency for end point logging agents to run on buffers to reduce their network chattiness I do wonder if a fast acting exploit could dump that buffer before it manages to be transmitted.
  I don't think any of the agents are complex enough to immediately transmit permission elevation log messages over the regular background noise.
- QuantumNoodle 2 hours ago
  Also 48 hours prior the disclosure is a very narrow window? I wonder if their logs don't go back further or if there was another reason to look back only two days.
- rithdmc 4 hours ago
  The attack itself creates the logs, which - reading between the lines - are shipped to a central log server. A compromised server might not send any new indicators to the logs, but existing logs moved off device would still be available.
  I'd like to know what those distinctive traces are, which is also missing :(
- PunchyHamster 4 hours ago
  Your exploit would have to get root and kill/exploit the logging daemon near instantly, else the log will already be sent to remote before you can change it locally
john_strinlai 5 hours ago
this is a techincal dive into how cloudflare responded, not a confirmation that they responded
for whatever reason, unknown to me, hn automatically strips "how" from the start of titles. i cant remember ever seeing a title where this was an improvement.
[-]
- dang 1 hour ago
  Of course you can't, because the cases it improves don't get noticed, while the remainder stick out like sore thumbs.
- gamegoblin 4 hours ago
  I learned a few years ago that HN also editorializes by dropping "world's" from titles
  Before: Teens break record for world's longest kickball game
  After: Teens break record for longest kickball game
  [-]
  - Velocifyer 4 hours ago
    I do actually agree with that change.
    [-]
    - gamegoblin 3 hours ago
      It occasionally leads to kinda ambiguous headlines, e.g.
      "China opens world's longest undersea tunnel"
      vs
      "China opens longest undersea tunnel"
      It's a little unclear if it's the longest undersea tunnel in the world, or just in China
    - jmalicki 3 hours ago
      It doesn't give enough recognition to the true longest game of space kickball.
  - buredoranna 4 hours ago
    ... what a world.
- dpoloncsak 4 hours ago
  Interestingly, there's a current post on the front page with "How" at the start of the title.
  > https://news.ycombinator.com/item?id=48018715 "How do I inform Windows that I’m writing a binary file?"
  I wonder if it ending in a '?' has anything to do with it?
  edit: Upon review, at the time of posting it was actually on the 2nd page
  [-]
  - john_strinlai 4 hours ago
    not sure about that specific case or if '?' has anything to do with it, but there is a short editing window where the submitter can re-add the "how" or whatever back in
  - GavinAnderegg 4 hours ago
    I’ve been hit by this when posting links. If you edit the post, you can re-add the stripped word and it will stay. “Why” is another that is often stripped.
- varun_ch 4 hours ago
  I'm yet to see a good example of the title stripping, at least for "how" and "how to" (although perhaps this is survivorship bias).
- trollbridge 5 hours ago
  Starting a title with “How” is standard clickbait.
  [-]
  - gilrain 4 hours ago
    Starting a sentence with “How” is standard English, too.
    [-]
  - Goronmon 5 hours ago
    If we are taking that attitude why not go all the way?
    Titles are standard clickbait.
    [-]
    - miki123211 4 hours ago
      With LLMs, you could actually do anti-clickbait titles. Extract the article text with something like r.jina.ai, and ask an LLM to generate a ~80-character summary that explains the main point of the article for people too busy to read it.
      I do think this would genuinely be useful.
      [-]
      - senko 4 hours ago
        You're absolutely right! (errm...oops....anyways...)
        The fact that LLMs usually generate anodyne summaries is actualy a benefit here.
        I used my website-to-markdown tool[0] to get the text, piped the output to claude -p and got a pretty decent "Patching Copy Fail at scale: how bpf-lsm bought us time before the kernel reboot" result.
        [0] https://markshot.dev
      - john_strinlai 4 hours ago
        back in my day, people just used the thing that rattles around inside their skull for such tasks
        [-]
        senko 4 hours ago
        To do that, you need to read the article first, which is the point of click-bait titles. The point of the defense is to avoid exposing your neurons to that stuff.
        [-]
        john_strinlai 4 hours ago
        i would hope that people are reading articles first and submitting them to hn because they are interesting, rather than submitting articles to hn blindly.
        [-]
        senko 4 hours ago
        I agree with you on that, but that just holds true (we hope) for the OP.
        HN already editorializes the title, to help everyone other than the OP (not all people agree over what's interesting to them). Now we're just arguing over the degree.
cube00 4 hours ago
> At the time of the "Copy Fail" disclosure, the majority of our infrastructure was running the 6.12 LTS version
That could be as low as 50.1%, I wish they'd provide an actual percentage.