News outlets are limiting the Internet Archive’s access to their journalism

(niemanlab.org)

165 points | by jaredwiener 4 hours ago

25 comments

hungryhobbit 38 minutes ago
There's an incredibly simple fix: block the archive for a week. No one is paying after a week, so you let the Archive access after that.
I don't see why every news outlet doesn't just do this.
remus 2 hours ago
That's a real shame. I am involved with some history-related projects and the number of websites which go offline is huge, and the wayback machine is incredibly helpful for unearthing these dead sites.
It is not hard to imagine a future in 50 years time where a huge percentage of this content is lost forever, or at best incredibly hard to find.
[-]
- horacemorace 2 hours ago
  This future is here already, policy makers have it locked up. Any person who remembers what microfiche is understands the magnitude of this problem of not having a trustworthy public record. If we extended public policy from the library era, the library of congress itself would be the Internet Archive.
- AnthonyMouse 6 minutes ago
  In the walls of the cubicle there were three orifices. To the right of the speakwrite, a small pneumatic tube for written messages, to the left, a larger one for newspapers; and in the side wall, within easy reach of Winston's arm, a large oblong slit protected by a wire grating. This last was for the disposal of waste paper. Similar slits existed in thousands or tens of thousands throughout the building, not only in every room but at short intervals in every corridor. For some reason they were nicknamed memory holes. When one knew that any document was due for destruction, or even when one saw a scrap of waste paper lying about, it was an automatic action to lift the flap of the nearest memory hole and drop it in, whereupon it would be whirled away on a current of warm air to the enormous furnaces which were hidden somewhere in the recesses of the building.
wormius 2 hours ago
Ugh - our local paper used to have a wonderful archive, that got limited and locked down after the pandemic. IDK if they got bought out, but it's a real shame, I think some of the problem is things that used to be public information (birthdates, families, names) in hospital admissions (I found old entries of my friends parents and my own for being "in the hospital" in the newspaper for example).
I'm sure that plays a role, but still... This obviously is about cost and money making, not security as a whole (ime)
[-]
- ghaff 17 minutes ago
  A lot of those aggregated records very quickly become a very precise public record. I'm not saying if it's good or bad but a lot of people on this site probably object to having their lives be essentially an open book which is very close to being the case as soon as a relatively small number of facts are opened up.
  It's more the case when the addresses and birthdates of public figures, which are often a matter of public record, enter the picture but it's easier to find out information about a lot of people with a bit of data than most people realize if anyone really cares to investigate.
svachalek 3 hours ago
There really should be a micropayments setup on the internet that's not advertising based. Let these models pay a nickel to read the article, covered by the multi trillion dollar AI blank check.
[-]
- poisonfountain 2 hours ago
  Cloudflare is trying to push for that, but every time it's mentioned people complain (because they hate Cloudflare for making them wait 2s for a captcha) and nobody proposes an alternative solution. I don't think this is going to happen, unfortunately, and the internet will get silo-ed into oblivion.
  [-]
  - Forgeties79 1 hour ago
    My issue with cloudflare is that if I run a VPN it randomly just locks up. Coin toss I can get through if I am routing thorough another country.
  - overfeed 52 minutes ago
    People don't hate micropayments because it's Cloudflare promoting them, it's because it truly is a shit idea, for many reasons.
    People would equally reject Netflix, if Netflix fooated the idea of replacing the subscription model with pay-per-view micropayments.
    > ...nobody proposes an alternative solution
    Such is the human condition - some problems simply have no satisfactory solutions.
    [-]
    - HeWhoLurksLate 49 minutes ago
      I would gladly pay $4 to not have to watch twenty-seven ads before the main fight in an MMA match
      [-]
      - overfeed 37 minutes ago
        If perfect information and perfect competition were attainable, the free hand of the market would deliver such a service to you. Since it's not, you'll have to bear the dudewipes ads.
- andrepd 3 hours ago
  There's a river of cash flowing to the pockets of the wealthy and to the megalomaniac projects of hyperscaler, but not to drip a few pennies onto the pockets of people providing such an important public service as journalists.
  [-]
  - b65e8bee43c2ed0 1 hour ago
    Good.
sandeepkd 2 hours ago
I think its bound to happen and in some ways it a good thing to happen too. The current state of AI affairs is a lot about outrightly selling some one else's intellectual property. The short term incentives are eroding the trust and goodwill among the natural knowledge actors.
The next natural thing to happen would be privatization or consolidation of the internet itself. Its already happening in the form of grabbing and consolidating IPv4 addresses.
[-]
- drtz 2 hours ago
  > The current state of AI affairs is a lot about outrightly selling some one else's intellectual property.
  Blocking archiving in a flailing attempt to keep AIs away is extremely shortsighted. Archiving is important for keeping historical context, especially when it comes to news and journalism.
  [-]
  - sandeepkd 2 hours ago
    There is a natural flow of information that allows the information producers to make money for their work. How do you expect that the information producers would be even able to continue to create information when the they are not getting paid anymore.
    One possible solution that I can think of for the long term good could be to just allow archival, no retrieval of the latest information, at-least for 6 months or a year. This should theoretically allow most goals.
- ronsor 2 hours ago
  [dead]
storus 34 minutes ago
Not trying to be paranoid, but losing recorded history raw as it was originally reported could lead to quick AI-assisted rewrites in the archives of news outlets to fit whatever narrative of the "jour" is in fashion/that powerful of those times want. We are already seeing it in new editions of some old books that suddenly miss some currently controversial topics. History is written by the victors could change to history is rewritten by the (current) victors, as they see fit.
evanjrowley 2 hours ago
They should allow access after the news becomes old. That's what the archive is intended for.
[-]
- GCA10 47 minutes ago
  JSTOR does exactly this with scholarly journals, and it works out pretty well. Recent issues are accessible only to paying customers.
  Back issues (usually at least a few years old) are available via JSTOR for free in small amounts and through subscriptions for bulk users. I'm sure there's some reason to fight about the details, but from a distance it looks like a pretty good compromise.
- smith7018 1 hour ago
  Agreed. IA should take snapshots of the articles over time and then make them publicly available X months/years later. There's no reason to immediately publicly mirror the articles beyond people trying to get around paywalls.
endofreach 18 minutes ago
why not just agree on a release date? while i enjoy circumventing moneyfences, i understand the wallmakers do not. i think this would be an easy deal, if someone just laid it on the table.
acidhousemcnab 3 hours ago
Perhaps I imagined this, however some months ago on X someone pointed out a historical article on dailymail.co.uk related to Prince Phillip and Epstein had been scrubbed, which likely would be intelligence or through D-Notices, but where instead of showing a 404 page would redirect to an article that was similar but benign. I checked the URL on the Wayback Machine and it turned up zero results, but not even the redirected article, however the user on X had screen grabbed the original, which everyone was reading and commenting on. As of 21st May I can't find this discussion on X and Grok denies it ever existed. This is a "maximally truth-finding" AI, so I must be mistaken. Perhaps the Internet Archive cannot be trusted, so this is why 340 local news outlets need to limit access.
[-]
- grosswait 2 hours ago
  This sounds like the beginning of a story where the next odd thing is your family and friends don’t know who you are, and know one has ever heard of you.
b00ty4breakfast 2 hours ago
Of course they are, because they are not primarily concerned with the reporting of noteworthy events. They are most worried about profit with the secondary goal of reporting but only insofar as it serves the first goal. This is a wider trend across many industries.
Obviously, a business needs to have an income but it's becoming more common for businesses to function first and foremast as revenue generators and the thing that enables that is only seen as a means to an end. When the quality of the product/service and it's function as a revenue generator diverge, the product/service will always take 2nd chair.
Maybe we could argue that the primary product is the revenue, especially when there are investors involved who are looking for big returns.
[-]
- psb5 1 hour ago
  More than even that, there is more news being generated than there are 3 inch chimp brains available to digest it all (even with AI busy summerizing everything) or act on it.
  There is no media theory of information of what happens when info explodes beyond capacity of the system to consume it. (UN report on Attention Economy says less than 1% is actually consumed by humans)
  So media orgs, instead of coming up with one, they just keep mindlessly doing what they know how to do - generate more info. Platforms and corps subsidize this activity for their own interests.
  So media orgs have no signal/warped signals of how useless what they are doing is.
- xp84 1 hour ago
  When it comes to the companies named here, I would argue that they have shown that reporting isn't even a secondary goal or a goal at all. Journalists don't even make that much money, but they've still gutted newsrooms very thoroughly. I assume that they already have people working on setting up an LLM connected to feeds of press releases, government announcements, public police crime reports, prominent social media accounts, etc. to create a repository of slop they can use (which will bear a vague resesmblance to 'news') without having even one reporter employed. And then they'll try to sell access to that slop feed back to the AI vendor (which hopefully won't buy it).
- frmersdog 1 hour ago
  As good a time as any to remind people that the Southern Strategy was never really all that Southern:
  https://www.uh.edu/news-events/stories/052815watchingtvracia...
  https://www.mediamatters.org/legacy/video-what-happens-when-...
  Historically-speaking, if your local news can twist the context to make you easier to sell to (products, services, ideologies), they will do that.
arjie 1 hour ago
It's interesting how much we lost with the end of the advertising model (though likely its death would arrive with agentic access anyway). An unsurprising reaction to that was the advent of the widespread paywall. And in a world where every paywalled article on social media, including HN, is on an archived paywall-bypass site there was going to be a natural cat-and-mouse game. The distributed payment model of online advertising was surprisingly effective. No single person was worth very much but the aggregate of attention had a probabilistic conversion that enabled a sufficient ecosystem of news.
Now most of those who spend money get access to relatively good news in comparison to those who don't. The interesting thing is that if you model the utility of a customer base as trifactorial (subscriptions, ad-supported, influence-ability) and you set ad-support to near zero you're left with this situation where those with no ability to pay are now overwhelmingly useful to the website provider only as an influenceable base.
"If you're not paying, you're not the customer, you're the product", we used to say[0]. It turns out that's true, but if you can't pay by looking at ads, you will pay by the actions you take when you believe what the actual customer wants you to believe.
0: Though sometimes you do pay and you're still "the product" haha!
xp84 1 hour ago
> "as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property."
So their argument is that people who would be paying money at their paywalls, are going to IA to get their news for free? And if they can thwart those people, they'll show up and become monthly subscribers?
I am vaguely sympathetic to newspapers as a concept, though the actually owners of approximately all of them are just PE companies looking to extract maximum profit from this dying industry, not really trying to prolong their existence.
But I think everyone who is interested in subscribing to their newspapers' paywalls already has subscribed. Those of us who bypass paywalls with that archive.whatever site, or apparently IA (I have never tried it for this purpose) are doing so because there is zero chance we're going to (recurringly!) pay the asking price for some random out-of-town newspaper, The Verge, Bloomberg, whatever. It's fair game to call us immoral for that decision, but if (and it's a big if) this move prevents more people from being able to bypass a paywall, I predict zero incremental dollars will go to the news publishers.
[-]
- munk-a 1 hour ago
  IA is sort of caught in the middle of a conflict it didn't ask for, here. The same tools that allow IA to do it's work are also used by Google to scrape and resell the news. There are ways to allow one usage without the other - but the simpler and more foolproof approach is to block both.
forestingfisher 1 hour ago
Of course, there are other archivers that don’t care
internet_points 1 hour ago
they should make a browser extension that lets logged in readers submit the contents of their tab to IA
flippant 3 hours ago
Apologies for the self-promo. Downvote and I'll know not to do it again.
This trend of outright banning the Internet Archive has me extremely worried. I fear a future where news articles are memoryholed, and no one can remember exactly what was reported and how sensational it all seemed.
I've been working on this project [0] for a while. Originally, I started with a tool that would allow people to snapshot webpages in their own browser, and they could selectively share their snapshots. Then by consensus, everyone could understand what exactly had changed, and they could draw their own conclusion about why.
While working on it, I realized that an authoritative answer to "what did it look like on $DATE" can't be produced by a no-name company. It's gotta be a non-commercial entity that's got a track record of integrity. The dream would be to allow MemoryHole customers to submit their snapshots to the Internet Archive (or other non-commercial entity). It's definitely a copyright nightmare - so no clue how this could work.
[0] - https://memoryhole.app
[-]
- entropie 2 hours ago
  I really like this also reasonable priced.
  Is there a way to export/download my saves in a reasonable way?
  [-]
  - flippant 2 hours ago
    Thank you! Yes, you just get a zip file with all of your saved pages.
    It looks like this:
    ├── files
    │ └── 632daffb-2f4f-4795-bb4d-3149d24f4264
    │ ├── original.html
    │ ├── readerview.html
    │ └── screenshot.png
    ├── manifest.json
    └── metadata.csv
- iamalizard 2 hours ago
  > It's definitely a copyright nightmare - so no clue how this could work.
  It could work as a decentralized free and open source system that doesn't care about copyright. Like how torrents work now, but it would be good to have it work over Tor or something. Perhaps as a DAO for the management aspect of it. I don't know how exactly. But disregarding copyright by using a centralized company is the wrong idea.
  Or you can do the lawful approach and try to work within the framework of that copyright nightmare. But "fuck copyright" is an easier path.
  [-]
  - entropie 2 hours ago
    You - as a company - can just avoid any copyright stuff when your extension saves the stuff only on the client. I see there are many other issues then.
    The torrent approach is nice. I could imagine a selfhosted way to store the data (for a group of people)
    [-]
    - flippant 2 hours ago
      > I could imagine a selfhosted way to store the data (for a group of people)
      Linkwarden does this well. You can share a collection for a small group of people.
      https://github.com/linkwarden/linkwarden
  - RobRivera 2 hours ago
    Tor is a honeypot run my government intel operations. Don't use it.
    [-]
    - fsflover 53 minutes ago
      How about I2P?
    - iamalizard 2 hours ago
      Please provide evidence for such strong claims. Otherwise it's just FUD.
jqmccleary 1 hour ago
If we don't know the past we wont know it's repeating
stronglikedan 1 hour ago
That's okay. The AI knows everything now, and forever more. Farwell IA.
phkahler 1 hour ago
Are they blocking due to AI scraping, or due to people using archive as a way around paywalls?
For the later, archive could just limit access to stuff that's less than 7 days old.
_ink_ 2 hours ago
Thanks, Big Tech!
charcircuit 2 hours ago
If the block is merely user agent based IA can spoof a different user agent to get these sites.
jmclnx 3 hours ago
Maybe they should allow the Internet Archive access to their article after a week or 2.
But I think this will hurt them as time goes on more then help. IIRC, one news org blocked free access and their revenue fell. I think that was in Australia.
But seems they are using AI as the reason. So allowing after a week will not avoid AI access.
But, what happens of an AI Company subscribes to the news site using a person's name (or a fake name) ? They will still get the article and avoid hassles.
[-]
- celsoazevedo 3 hours ago
  It may be easier to convince them if the Internet Archive doesn't allow access for <period of time>. Not good for the average user now, but at least it would be archived for the future. Better than having no archive at all.
  [-]
  - fragmede 2 hours ago
    Yeah IA needs to get their heads out of their asses and just do that. It's an archive, but if it's available at the same time as it's relevant, then it's being used as alternate access.
- ranger_danger 3 hours ago
  That sounds like a good idea to me.
  One of the tests for Fair Use in the US, as I understand it, would be whether the archived work "competes" with the original.
  If people start going to IA instead to read the news, the newspaper might have a claim. But if they're doing it to get around paywalls, or purely for archival/historical/research purposes, that may be allowed.
  But the reality is such decisions are subjective and will be up to whatever judge happens to get such a case in front of them if this is challenged.
  [-]
  - PaulHoule 2 hours ago
    In general judges seem to understand that the copyright holder has some interest in these situations but not seem to understand that the rest of the community has some rights too.
starik36 3 hours ago
https://archive.is/9X4xo
Gagarin1917 2 hours ago
Not surprising, sites like Reddit use it to get around their paywalls.
Redditors then had the gall to pretend like it wasn’t their number one use case.
fortyseven 1 hour ago
Just burn the whole fucking internet down. We can't have nice things.
picsao 1 hour ago
[dead]