I have had some incredible medical advice from ChatGPT. It has saved me from small mystery issues, like a rash on my face. Small enough issues that I probably wouldn't have bothered to go into a doctor. BUT it also failed to diagnose me with a medical issue that ended up with a trip to the ER and emergency surgery.
A few weeks before the ER, I was having stomach pain. I went to the doctor with theories from ChatGPT in hand, they checked me for those things and then didn't check me for what ended up being a pretty obvious issue. What's interesting is that I mentioned to the doctor that I used ChatGPT and that the doctor even seemed to value that opinion and did not consider other options (and what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it). I do feel I actually biased the first doctors opinion with my "research."
> I do feel I actually biased the first doctors opinion with my "research."
It may feel easy to say doctors should just consider all the options. But telling them an option is worse than just biasing their thinking; they are going to interpret that as information about your symptoms.
If you feel pain in your abdomen but are only talking about your appendix, they are rightfully going to think the pain is in the region of your appendix. They are not going to treat you like you have kidney pain. How could they? If they have to treat all of your descriptions as all the things that you could be relating them to, then that information is practically useless.
> I do feel I actually biased the first doctors opinion with my "research."
This has been a big problem in medicine since the early days of WebMD: Each appointment has a limited time due to the limited supply of doctors and high demand for appointments.
When someone arrives with their own research, the doctor has to make a choice: Do they work with what the patient brought and try to confirm or rule it out, or do they try to walk back their research and start from the beginning?
When doctors appear to disregard the research patients arrive with many patients get very angry. It leads to negative reviews or even formal complaints being filed (usually from encouragement from some Facebook group or TikTok community they were in). There might even be bigger problems if the patient turns out to be correct and the doctor did not embrace the research, which can prompt lawsuits.
So many doctors will err on the side of focusing on patient-provided theories first. Given the finite time available to see each patient (with waiting lists already extending months out in some places) this can crowd out time for getting a big picture discussion through the doctor's own diagnostic process.
When I visit a doctor I try to ground myself to starting with symptoms first and try to avoid biasing toward my thoughts about what it might be. Only if the conversation is going nowhere do I bring out my research, and then only as questions rather than suggestions. This seems to be more helpful than what I did when I was younger, which is research everything for hours and then show up with an idea that I wanted them to confirm or disprove.
A doctor is typically scheduled at 6 patients/hour. In that time they also have to chart, walk between rooms, make up time for the other patients that inevitably went over time, et cetera. The doctor you're seeing probably has a goal of only talking to you for 3 minutes.
My aunt died from this (my opinion). She spend two years confusion her diagnosis and treatment, and borderline harassing her doctors, by thinking her own research was on point and interpreting all her symptoms through that lens. In the end it wasn't borrelia, parasites, 5G, or any of the other fancies, but just lung cancer that was only diagnosed when it was very well developed.
> what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it
I'm not so sure. Doctors are trained to check for the most common things that explain the symptoms. "When you hear hoofbeats, think horses not zebras" is a saying that is often heard in medicine.
ChatGPT was trained on the same medical textbooks and research papers that doctors are.
Personally, I think the value in ChatGPT in health is not that it's right or wrong but that it encourages you to take an active role in your health and more importantly to try things. I've gone through similar issues with ChatGPT where it's convinced me that if A is true, therefore so must B though that may not be the case.
In the future, I think I'll likely review things with ChatGPT and have an opinion and treat the doctor like a ChatGPT session as well--this is opposed to leading the doctor to what I believe I should be doing. I was dismissive about the doctor's advice because it seemed so obvious but more and more, I feel that most of our issues are caused by habitual, daily mistakes--little things that take hold seasonally or over periods of stress that appear like chronic health issues. At least for me.
We have the same kind of issue as software engineers. Users come to use with solutions to their problems and want us to implement the solution. At that point the lazy path would be to just do that.
If you have bad management, software engineers might even be punished for questioning the customers.
What you want instead is that the users just describe their problem, as unbiased as possible and with enough detail and then let the expert come up with an appropriate solution that solves the problem.
I try to do that as well when going to the doctor.
The real story hear your doctor actually listened to you. I appreciate what a lot doctors do, but majority of them fucking irritating and don’t even listen your issues, I’m glad we have AI and less reliant on them.
I mean - obviously if they're not listening their chance of the latter is pretty low.
Doctors hate to hear this, but if you're so poor in communication and social skills that the patient can't/won't follow you any care you've given, your value is lost.
This is ultimately the same difference between a search engine and a professional. 10 years before this, Googling the symptoms was a thing.
I have a family member who had a "rare but obvious" one but it took 5 doctors to get to the diagnosis. What we really need to see are attempts to blind studies and real statistical rigor. It's funny to paint a tunnel on a canvas and get a Tesla to drive into it, but there's a reason studies (and the more blind the better) are the standard.
You should've let the doctor do its job. if he reached a different conclusion then you can tell him what you researched. and he will make a decision having already done his own research without biasing him
Which is exactly why the AI, at least the ones of today, should never be used beyond the level of (trusted or not) advisor. Yet not only many CxOs and boards, but even certain governments which shall not be named, are stubbornly trying, for cost or whatever other reasons, to throw entire populations (employees or nations) under the AI bus. And I sincerely don't believe anything short of an uprising will be able to stop them. Change my mind.
The sad truth is that it is because while we all appreciate hard work and a good job, that isn't what is needed to move forward in the world of business. Creaky leaky products held together under the hood by scotch tape and string are fine. You don't make more money having a better product. A more performant tool. Better benchmarks. End users, aside for writing tools for other engineers, don't care. They really don't. Word 95 probably opens faster than word today.
Management has realized this. Hey I can outsource to bangalore/hyderabad/east europe/ai, get something that barely works, and just market the crap out of it. Look at the sort of companies, products, and services that dominate markets today. These aren't leaders in quality or engineering. They are leaders in marketing. Marketing is what sells. Marketing can sell billions of steaming turds. Nike shoes are pieces of shit but it's marketing that makes the brand and provides all value in the stock. The world doesn't value quality. It values noise and pretty feathers.
I agree. AI right now is at a level of "knowledgeable friend", not of "professional with years of real world experience". You'd listen to what your friend has to say, but taking pills after one of their suggestions? Dumb idea. It's great to brainstorm things, but just like your knowledgeable friend that likes reading Wikipedia pages a bit too much you need to really check it's not reaching to conclusions too quickly
Not the original commenter, but you may have noticed a wee kerfluffle between a large nation-state's "Secretary of War" and a frontier model provider over whether the model's licensing would permit autonomous lethal weapon systems operated by said - and I cannot emphasize the middle word enough - large _language_ model.
I'd greatly prefer a blind study comparing doctors to AI, rather than a study of doctors feeding AI scenarios and seeing if it matches their predetermined outcome.
Edit: People seem confused here. The study was feeding the AI structured clinical scenarios and seeing it's results. The study was not a live analyses of AI being used in the field to treat patients.
I don't understand this reasoning. Randomizing people to AI vs standard of care is expensive and risky. Checking whether the AI can pass hypothetical scenarios seems like a perfectly reasonable approach to researching the safety of these models before running a clinical trial.
The issue is that those hypothetical scenarios do not have to look like how patients actually interact with the tool.
Real life use is full of ill posed questions open ended statements inaccurate assessment of symptoms, and conclusory remarks sprinkled in between. Real use of chat bots for Health by non-clinicians looks very different than scenario based evaluation.
You would pass those hypothetical scenarios to doctors too, and then the analyses of results would be done by doctors who don't know if it's an AI or doctor result.
> Three physicians independently assigned gold-standard triage levels based on cited clinical guidelines and clinical expertise, with high inter-rater agreement
We have standards of care for a reason. They are the most basic requirements of testing. Ignoring them is not just being a bad doctor, its unethical treatment. Its the absolute bare minimum of a medical system.
That type of experimental set-up is forbidden due to ethical concerns. It goes against medical ethics to give patients treatment that you think might be worse.
You could absolutely randomize care between a doctor and an AI under an IRB. I’d be stunned if there aren’t a dozen studies doing something like this already.
You have to justify it, but most places have sections in the document where you request review to justify it. It’s not any different from giving one patient heart medicine that you think works and another patient a sugar pill.
Huh? Do you have any actual examples of such studies? I don't think you understand how IRB actually works.
In actual heart medicine studies the control arm is typically treated with the current standard of care, not a placebo. So it seems pretty clear that you don't have any actual knowledge or experience in this area.
I think the best would be an interface, where the patient isn't told if the doctor on the other end is human or AI. Tell them that they are going to do multiple remote exams with different care providers for the same illness in exchange for free treatment, and payment for the study.
If you're worried about not catching a legit emergency, as in something that can't wait a day or two for them to complete the different sessions, you could have a doctor monitor the interactions with the ability to raise a flag and step in to send them to the ER.
I don't think that would tell us anything useful. The data quality in most patient charts is shockingly bad. I've seen a lot of them while working on clinical systems interoperability. Garbage in / garbage out. When human physicians make a diagnosis they typically rely on a lot of inputs that never appear in the patient chart.
And in most cases the diagnosis is the easy part. I mean we see occasional horror stories about misdiagnosis but those are rare. The harder and more important part is coming up with an effective treatment plan which the patient will actually follow, and then monitoring progress while making adjustments as needed. So a focus on the diagnosis portion of clinical decision support seems fundamentally misguided.
I think the worse situation is the bad AI summaries from search on health issues.
We had a potential pet poisoning, so was naturally searching for resources. Google had a summary with a "dose of concern" that was an order of magnitude off. Someone could have read that and thought all was fine and had a dead cat.
(BTW cat is fine, turned out to be a false alarm, but public service announcement: cats are alergic to aspirin and peptobismal has aspirin. don't leave demented plastic chewing cats around those bottles, in case you too have a lovely but demented cat)
I have literally never seen a correct google summary. Maybe y'all are searching for different things than i am, but at this point I've started taking the viewpoint that if I don't know why the ai summary is wrong, then i also don't know enough about the topic to trust its summary enough to determine whether the summary is useful.
There is a concept of “the burden or knowledge”, in that doctors know the worst thing that could happen, so they recommend the most cautious approach. My son had stomach pain one time when he was young. We took him to urgent care because it was a stomach ache. The doctor there said we needed to go to the ER because it could be an appendicitis. So we trucked to the ER. Close to $2000 later he was diagnosed with idiopathic stomach pain and told to wait it out at home.
So when I read “they then compared the platform’s recommendations with the doctors’ assessments” and see a mismatch, I wonder if it’s because human doctors are overly cautious or that the AI was wrong.
But that all pales in what could be the actual issue. I can’t read the original study, but if it use the USA, it’s understandable why people are turning to AI for Health advice. Healthcare is painfully expensive here. Even a simple trip to the ER (e.g. a $2000 stomach ache) is beyond a lot of people’s ability to spend. That’s just a reality.
With that in mind, the real questions “should I do nothing about my symptoms because I can’t afford healthcare or should I at least ask AI knowing it could be wrong”.
Is this unsurprising? It's a fancy markov chain. It's like using a slot machine to diagnose medical conditions. I guess it's a slot machine with really good marketing.
Even though these tools are showing time and time again that they have serious reliability issues, somehow people still think it is a good idea to use them for critical decisions.
Still regularly get wrong information from google’s search AI.
Really starting to wonder if common sense is ever going to come back with new tech, but I fear it is going to require something truly catastrophic to happen.
I’ve got a popcorn reserve at hand to watch the show when the massive security breaches happen and people start freaking out. And/or a lawsuit gets discovery of a company’s LLM history and it’s every bit as awful for them as we all know it will be and the rest of corporate America pumps the brakes.
These systems are borderline useless if you don’t give them dangerous levels of access to data and generate tons of juicy chat history with them. What’s coming is very predictable.
It's a strange paradigm shift, where the tool is right and useful most of than not, but also make expensive mistakes that would have been spotted easily by an expert.
Then Google shouldn't be using something so unreliable for anything important. Arguing that random users should know the difference between cheap and frontier models is also not compelling. It's all the same "AI" to most people.
You are mistaken. ChatGPT Health [1] is a model specifically designed for health applications and was co-developed with a benchmark suite, HealthBench [2], for testing against health conditions. This study suggests that the people working on HealthBench have some concerning external validity problems.
It's really the "common sense" i.e. believing things without thinking because they "sound right" or because it's what your parents told you a lot growing up or because you watched an ad saying it a hundred times that's the issue. People don't want "the truth" or uncomfortable realities; they want comfortable, easily digestible bullshit. Smooth talkers filled the role before and LLMs are filling that role now.
In the general case it's usually not possible to accurately review an individual physician's performance. The software developers here on HN like to think in simplistic binary terms but in the real world of clinical care there is usually no reliable source of truth to evaluate against. Occasionally we see egregious cases of malpractice or failure to follow established clinical practice guidelines but below that there's a huge gray area.
If you look at online reviews, doctors are mostly rated based on being "nice" but that has little bearing on patient outcomes.
Amazing how you can just deflect any criticism of LLMs here by going “but humans suck too!” And the misanthropic HN userbase eats it up every time.
We live during the healthiest period in human history due to the fact that doctors are highly reliable and well-trained. You simply would not be able to replace a real doctor with an LLM and get desirable results.
Even in medicine, often the difference between drug A and drug B is the difference between the two in statistical terms. If drugs were held to the standard "works 100% of the time", no drug would ever be cleared for use. Feelings about AI and this administration are influencing this conversation far too much.
It's like people want to remove the physician or current care from the discussion. It's weird because care is already too expensive and too error prone for the cost.
> Amazing how you can just deflect any criticism of LLMs here by going “but humans suck too!” And the misanthropic HN userbase eats it up every time.
I think it's rather people trying to keep grounded and suggest that it's not just the hallucination machine that's bad, but also that many doctors in real life also suck - in part because of the domain being complex, but also due to a plethora of human reasons, such as not listening to your patients properly or disregarding their experiences and being dismissive (seems to happen to women more for some reason), or sometimes just being overworked.
> You simply would not be able to replace a real doctor with an LLM and get desirable results.
I don't think people should be replaced with LLMs, but we should benchmark the relative performance of various approaches:
A) the performance of doctors alone, no LLMs
B) the performance of LLMs alone, no human in the loop
C) the performance of doctors, using LLMs
Problem is that historical cases where humans resolved the issue and not the ones where the patient died (or suffered in general as a consequence of the wrong calls being made) would be pre-selecting for the stuff that humans might be good at, and sometimes wouldn't even properly be known due to some of those being straight up malpractice on the behalf of humans, whereas benchmarking just LLMs against stuff like that wouldn't give enough visibility in the failings of humans either.
Ideally you'd assess the weaknesses and utility of both at a meaningfully large scale, in search of blind spots and systemic issues, the problem being that benchmarking that in a vacuum without involving real cases might prove to be difficult and doing that on real cases would be unethical and a non-starter. And you'd also get issues with finding the truly shitty doctors to include in the sample set, sometimes even ones with good intentions but really overworked (other times because their results would suggest they shouldn't be practicing healthcare), otherwise you're skewing towards only the competent ones which is a misrepresentation of reality.
The fact that someone would say stuff like "Doctors are more like machines." implies failure before we even get to basic medical competency. People willingly misdirect themselves and risk getting horrible advice because humans will not give better advice and the sycophantic machine is just nicer.
> I think it's rather people trying to keep grounded and suggest that it's not just the hallucination machine that's bad, but also that many doctors in real life also suck
No, you see this line or argumentation on every post critical of LLM's deficiencies. "Humans also produce bad code", "Humans also make mistakes" etc etc.
A friend of mine had such a bad experience with _multiple_ American doctors missing a major issue that nearly ended up killing her that she decided that, were she to have kids, she would go back to Russia rather than be pregnant in the American medical system.
Now, I don't agree that this is a good decision, but the point is, human doctors also often miss major problems.
Medical errors are one of the leading causes of death. It's a real catch-22. If you're under medical care for something serious, there's a real chance that someone will make a mistake that kills you.
You also don't sue for malpractice unless something goes catastrophically wrong. I've had doctors make ludicrously bad diagnoses, and while it sucked until I found a competent doctor and got proper treatment, it wasn't something I was going to go to court over.
A friend of mine had an accident. He was taken to the emergency room, but the doctors there thought his injuries were minor. My friend insisted that he was bleeding out internally. They finally checked for that, and it turns out he was minutes from dying.
AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
>AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
That doesn't necessarily follow from your story. The AI's specificity and sensitivity are important, which is why we need to study this stuff. An AI that produces too many false positives will send doctors off chasing zebras and they'll waste time, which will result in more deaths.
An AI that produces too many false negatives will make doctors more likely to miss things they otherwise would have checked, which will result in more deaths.
The other real problem with using AI in a medical setting is that AI is very very good at producing plausible sounding wrong information. Even an expert isn't immune to this. So it's even more important that we study how likely they are to be wrong.
I really only use ChatGPT as a better search engine. But it's often wrong, which has actually ended up costing me money. I don't put a lot of trust in it. Certainly would not try to use it as a doctor.
I have found the LLMs to be wrong in random insidious ways, so trusting them with anything critical is terrifying.
Recent (as in last few days/weeks) incidents using different models/tools:
* Google AI search summary compare product A & B, call out a bunch of differences that are correct.. and then threw in features that didn't exist
* Work (midsize company with big AI team / homebuilt GPT wrappers) PDF parsing for company headquarters address, it hallucinated an address that didn't exist in the document
* Work, a team using frontier model from top 2 AI lab was using it to perform DevOps type tasks, requested "Restart XYZ service in DEV environment". It responded "OK, restarting ABC service in PROD environment". It then asked for confirmation AFTER actioning whether they meant XYZ in DEV or ABC in PROD... a little too late.
They are very difficult tools to use correctly when the results are not automatically verifiable (like code can be with the right tests) and the answer might actually matter.
> Work, a team using frontier model from top 2 AI lab was using it to perform DevOps type tasks, requested "Restart XYZ service in DEV environment". It responded "OK, restarting ABC service in PROD environment". It then asked for confirmation AFTER actioning whether they meant XYZ in DEV or ABC in PROD... a little too late.
... Wait, they gave the magic robot _access to modify their production environment_?!
Yes, at a fairly large company that should otherwise know better.
The problem with all these orgs hiring "AI experts" is the adverse selection of finding the people who "know AI" but can't get a job at AI lab, startup, big tech, or literally any other job using AI that is better than "making excel do AI more good".
It's like Big Data / Cybersecurity / DevOps / Big Agile / Cloud Evangelist / Data Science grifter playbook all over again.
No, see both. LLMs are great for second opinions, as long as you give it the relevant info and don't try to steer it. Even though we all know we're supposed to get second opinions on medical things, we usually don't bother because it's too expensive in both time and money.
I think there is so much potential for AI in healthcare, but we absolutely HAVE to go through the existing ruleset of conducting years of research and trials and approvals before pushing anything out to patients. Move fast and break things is simply not an option in healthcare.
It depends; people actually get sicker and even die due to endless backlog and lack of doctors (in most developed countries). It's not as if everyone gets optimal care now. A.I can at least expedite things hopefully.
The reality is entering the healthcare system can result in thousands of dollars in bills. People make risk/cost judgement on going to the hospital or not.
Adding normal lab results made the suicide crisis banner disappear? That's a weird failure mode. You'd expect unrelated context to be ignored, not to override the risk signal.
As a software dev that uses it and observes the many errors it makes on a daily basis, I definitely treat the output with a much greater deal of skepticism than the average person I speak with. If you're used to it providing relatively accurate results based on surface level google-eqsue searches, then it makes sense why you'd place a higher weight on it being an "expert" vs a "tool that needs verification". I understand why people fall into this mindset.
I used ChatGPT to do a valve adjustment on an engine; a task I've never done before. I didn't just accept the torque values and procedure it told me though, because I know better from my experience with it as a dev. I cross-referenced it all with Youtube videos, forum posts, instruction manuals (where available) to make sure the job was A) doable for a non-mechanic like me and B) done correctly. Thanks to the Youtube video (which I cross-referenced with other sources), I discovered the valve clearance values were slightly off with the ChatGPT recommendation.
I think the average Joe would assume these values were correct and run with it.
If the AI gets attached to a health insurer (not the case here as far as I know), I would expect it to make decisions that are aligned with the company’s incentive to weed out unprofitable patients. AI is not a human who takes a Hippocratic oath; it can be more easily manipulated to perform unethical acts.
With an integrated insurer/provider, they just have to make primary care scarce so that it takes months to get an appointment, and then offer AI Doctor as an option. Not all patients have to use it for it to be cost effective.
Has anyone tried to suggest sudoku puzzles? In the middle of a hard game I will submit the screenshot to copilot or Gemini and it hallucinates suggestions on next move.
That would need to be tested. If doctors get lazy, complacent, or overworked (!), a "doctor with access to ChatGPT Health" may be functionally equivalent to "just ChatGPT Health" in some cases.
What do you mean "allow"? From a public policy perspective there's nothing prohibiting that today, as long as the human MD follows the HIPAA privacy rule.
I feel like these need to be run against case histories from already determined cases, not cases were the doctors set up the scenarios, knowing they’re going to be run against ChatGPT.
I’ve never heard of in my entire life a doctor failing to recognize a medical emergency. /s
One of the things that people need to come to grips with is that like Wikipedia people will use ChatGPT because it is there. And the alternative is to be rich and have a primary care doctor that you can reach out to at a moments notice. Until that is different people will use these web services. It’s the same thing as Wikipedia or WebMD.
A few weeks before the ER, I was having stomach pain. I went to the doctor with theories from ChatGPT in hand, they checked me for those things and then didn't check me for what ended up being a pretty obvious issue. What's interesting is that I mentioned to the doctor that I used ChatGPT and that the doctor even seemed to value that opinion and did not consider other options (and what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it). I do feel I actually biased the first doctors opinion with my "research."
It may feel easy to say doctors should just consider all the options. But telling them an option is worse than just biasing their thinking; they are going to interpret that as information about your symptoms.
If you feel pain in your abdomen but are only talking about your appendix, they are rightfully going to think the pain is in the region of your appendix. They are not going to treat you like you have kidney pain. How could they? If they have to treat all of your descriptions as all the things that you could be relating them to, then that information is practically useless.
This has been a big problem in medicine since the early days of WebMD: Each appointment has a limited time due to the limited supply of doctors and high demand for appointments.
When someone arrives with their own research, the doctor has to make a choice: Do they work with what the patient brought and try to confirm or rule it out, or do they try to walk back their research and start from the beginning?
When doctors appear to disregard the research patients arrive with many patients get very angry. It leads to negative reviews or even formal complaints being filed (usually from encouragement from some Facebook group or TikTok community they were in). There might even be bigger problems if the patient turns out to be correct and the doctor did not embrace the research, which can prompt lawsuits.
So many doctors will err on the side of focusing on patient-provided theories first. Given the finite time available to see each patient (with waiting lists already extending months out in some places) this can crowd out time for getting a big picture discussion through the doctor's own diagnostic process.
When I visit a doctor I try to ground myself to starting with symptoms first and try to avoid biasing toward my thoughts about what it might be. Only if the conversation is going nowhere do I bring out my research, and then only as questions rather than suggestions. This seems to be more helpful than what I did when I was younger, which is research everything for hours and then show up with an idea that I wanted them to confirm or disprove.
A doctor is typically scheduled at 6 patients/hour. In that time they also have to chart, walk between rooms, make up time for the other patients that inevitably went over time, et cetera. The doctor you're seeing probably has a goal of only talking to you for 3 minutes.
People not suffering from mental illness will typically not blame 5G for their health concerns.
I'm not so sure. Doctors are trained to check for the most common things that explain the symptoms. "When you hear hoofbeats, think horses not zebras" is a saying that is often heard in medicine.
ChatGPT was trained on the same medical textbooks and research papers that doctors are.
Yeah hm I wonder what the difference could possibly be.
In the future, I think I'll likely review things with ChatGPT and have an opinion and treat the doctor like a ChatGPT session as well--this is opposed to leading the doctor to what I believe I should be doing. I was dismissive about the doctor's advice because it seemed so obvious but more and more, I feel that most of our issues are caused by habitual, daily mistakes--little things that take hold seasonally or over periods of stress that appear like chronic health issues. At least for me.
What you want instead is that the users just describe their problem, as unbiased as possible and with enough detail and then let the expert come up with an appropriate solution that solves the problem.
I try to do that as well when going to the doctor.
Doctors hate to hear this, but if you're so poor in communication and social skills that the patient can't/won't follow you any care you've given, your value is lost.
I have a family member who had a "rare but obvious" one but it took 5 doctors to get to the diagnosis. What we really need to see are attempts to blind studies and real statistical rigor. It's funny to paint a tunnel on a canvas and get a Tesla to drive into it, but there's a reason studies (and the more blind the better) are the standard.
Management has realized this. Hey I can outsource to bangalore/hyderabad/east europe/ai, get something that barely works, and just market the crap out of it. Look at the sort of companies, products, and services that dominate markets today. These aren't leaders in quality or engineering. They are leaders in marketing. Marketing is what sells. Marketing can sell billions of steaming turds. Nike shoes are pieces of shit but it's marketing that makes the brand and provides all value in the stock. The world doesn't value quality. It values noise and pretty feathers.
Why can't you name them, and give us some context? Is this based on public info, or not?
Edit: People seem confused here. The study was feeding the AI structured clinical scenarios and seeing it's results. The study was not a live analyses of AI being used in the field to treat patients.
Real life use is full of ill posed questions open ended statements inaccurate assessment of symptoms, and conclusory remarks sprinkled in between. Real use of chat bots for Health by non-clinicians looks very different than scenario based evaluation.
> Three physicians independently assigned gold-standard triage levels based on cited clinical guidelines and clinical expertise, with high inter-rater agreement
You have to justify it, but most places have sections in the document where you request review to justify it. It’s not any different from giving one patient heart medicine that you think works and another patient a sugar pill.
In actual heart medicine studies the control arm is typically treated with the current standard of care, not a placebo. So it seems pretty clear that you don't have any actual knowledge or experience in this area.
If you're worried about not catching a legit emergency, as in something that can't wait a day or two for them to complete the different sessions, you could have a doctor monitor the interactions with the ability to raise a flag and step in to send them to the ER.
And in most cases the diagnosis is the easy part. I mean we see occasional horror stories about misdiagnosis but those are rare. The harder and more important part is coming up with an effective treatment plan which the patient will actually follow, and then monitoring progress while making adjustments as needed. So a focus on the diagnosis portion of clinical decision support seems fundamentally misguided.
Yea, like how rich the patient is or if they are on insurance etc. I wish I was kidding.
These "experts", they have no problem to tout anecdotes when it serves them..
We had a potential pet poisoning, so was naturally searching for resources. Google had a summary with a "dose of concern" that was an order of magnitude off. Someone could have read that and thought all was fine and had a dead cat.
(BTW cat is fine, turned out to be a false alarm, but public service announcement: cats are alergic to aspirin and peptobismal has aspirin. don't leave demented plastic chewing cats around those bottles, in case you too have a lovely but demented cat)
So when I read “they then compared the platform’s recommendations with the doctors’ assessments” and see a mismatch, I wonder if it’s because human doctors are overly cautious or that the AI was wrong.
But that all pales in what could be the actual issue. I can’t read the original study, but if it use the USA, it’s understandable why people are turning to AI for Health advice. Healthcare is painfully expensive here. Even a simple trip to the ER (e.g. a $2000 stomach ache) is beyond a lot of people’s ability to spend. That’s just a reality.
With that in mind, the real questions “should I do nothing about my symptoms because I can’t afford healthcare or should I at least ask AI knowing it could be wrong”.
Still regularly get wrong information from google’s search AI.
Really starting to wonder if common sense is ever going to come back with new tech, but I fear it is going to require something truly catastrophic to happen.
These systems are borderline useless if you don’t give them dangerous levels of access to data and generate tons of juicy chat history with them. What’s coming is very predictable.
The fact that the model most hyper-optimized for cheap+fast makes mistakes is not a particular compelling argument.
[1] https://openai.com/index/introducing-chatgpt-health/
[2] https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca65...
I suspect many, many doctors also fail to regularly recognize medical emergencies.
If you look at online reviews, doctors are mostly rated based on being "nice" but that has little bearing on patient outcomes.
We live during the healthiest period in human history due to the fact that doctors are highly reliable and well-trained. You simply would not be able to replace a real doctor with an LLM and get desirable results.
It's like people want to remove the physician or current care from the discussion. It's weird because care is already too expensive and too error prone for the cost.
I think it's rather people trying to keep grounded and suggest that it's not just the hallucination machine that's bad, but also that many doctors in real life also suck - in part because of the domain being complex, but also due to a plethora of human reasons, such as not listening to your patients properly or disregarding their experiences and being dismissive (seems to happen to women more for some reason), or sometimes just being overworked.
> You simply would not be able to replace a real doctor with an LLM and get desirable results.
I don't think people should be replaced with LLMs, but we should benchmark the relative performance of various approaches:
Problem is that historical cases where humans resolved the issue and not the ones where the patient died (or suffered in general as a consequence of the wrong calls being made) would be pre-selecting for the stuff that humans might be good at, and sometimes wouldn't even properly be known due to some of those being straight up malpractice on the behalf of humans, whereas benchmarking just LLMs against stuff like that wouldn't give enough visibility in the failings of humans either.Ideally you'd assess the weaknesses and utility of both at a meaningfully large scale, in search of blind spots and systemic issues, the problem being that benchmarking that in a vacuum without involving real cases might prove to be difficult and doing that on real cases would be unethical and a non-starter. And you'd also get issues with finding the truly shitty doctors to include in the sample set, sometimes even ones with good intentions but really overworked (other times because their results would suggest they shouldn't be practicing healthcare), otherwise you're skewing towards only the competent ones which is a misrepresentation of reality.
Reminds me of an article that got linked on HN a while back: https://restofworld.org/2025/ai-chatbot-china-sick/
The fact that someone would say stuff like "Doctors are more like machines." implies failure before we even get to basic medical competency. People willingly misdirect themselves and risk getting horrible advice because humans will not give better advice and the sycophantic machine is just nicer.
No, you see this line or argumentation on every post critical of LLM's deficiencies. "Humans also produce bad code", "Humans also make mistakes" etc etc.
Now, I don't agree that this is a good decision, but the point is, human doctors also often miss major problems.
The numbers that you see quoted are almost certainly wildly exaggerated.
A friend of mine had an accident. He was taken to the emergency room, but the doctors there thought his injuries were minor. My friend insisted that he was bleeding out internally. They finally checked for that, and it turns out he was minutes from dying.
AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
That doesn't necessarily follow from your story. The AI's specificity and sensitivity are important, which is why we need to study this stuff. An AI that produces too many false positives will send doctors off chasing zebras and they'll waste time, which will result in more deaths.
An AI that produces too many false negatives will make doctors more likely to miss things they otherwise would have checked, which will result in more deaths.
The other real problem with using AI in a medical setting is that AI is very very good at producing plausible sounding wrong information. Even an expert isn't immune to this. So it's even more important that we study how likely they are to be wrong.
Recent (as in last few days/weeks) incidents using different models/tools:
* Google AI search summary compare product A & B, call out a bunch of differences that are correct.. and then threw in features that didn't exist
* Work (midsize company with big AI team / homebuilt GPT wrappers) PDF parsing for company headquarters address, it hallucinated an address that didn't exist in the document
* Work, a team using frontier model from top 2 AI lab was using it to perform DevOps type tasks, requested "Restart XYZ service in DEV environment". It responded "OK, restarting ABC service in PROD environment". It then asked for confirmation AFTER actioning whether they meant XYZ in DEV or ABC in PROD... a little too late.
They are very difficult tools to use correctly when the results are not automatically verifiable (like code can be with the right tests) and the answer might actually matter.
... Wait, they gave the magic robot _access to modify their production environment_?!
Bloody hell, there's no helping some people.
The problem with all these orgs hiring "AI experts" is the adverse selection of finding the people who "know AI" but can't get a job at AI lab, startup, big tech, or literally any other job using AI that is better than "making excel do AI more good".
It's like Big Data / Cybersecurity / DevOps / Big Agile / Cloud Evangelist / Data Science grifter playbook all over again.
If it could be an emergency, see a doctor.
I used ChatGPT to do a valve adjustment on an engine; a task I've never done before. I didn't just accept the torque values and procedure it told me though, because I know better from my experience with it as a dev. I cross-referenced it all with Youtube videos, forum posts, instruction manuals (where available) to make sure the job was A) doable for a non-mechanic like me and B) done correctly. Thanks to the Youtube video (which I cross-referenced with other sources), I discovered the valve clearance values were slightly off with the ChatGPT recommendation.
I think the average Joe would assume these values were correct and run with it.
https://www.liveinsurancenews.com/health-insurance-claims-de...
Most physicians I know use ChatGPT. Although of course it's usage guided by an expert, not by the patient, nor fully autonomous.
No, no, no, and no. Are we going to never learn. Sharing medical data with AI tools is going to come back and bite you.
Win win right?
One of the things that people need to come to grips with is that like Wikipedia people will use ChatGPT because it is there. And the alternative is to be rich and have a primary care doctor that you can reach out to at a moments notice. Until that is different people will use these web services. It’s the same thing as Wikipedia or WebMD.