My Hacker News items table in ClickHouse has 47,428,860 items, and it's 5.82 GB compressed and 18.18 GB uncompressed. What makes Parquet compression worse here, when both formats are columnar?
Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.
There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.
Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.
Is is possible to only download a subset? e.g. Show HNs or HN Whoishiring. The Show HNs and HN Whoishiring are very useful for classroom data science i.e. a very useful set of data for students to learn the basic of data cleaning and engineering.
> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies
The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).
I.e. this section is talking to additional rights to the content you post to ALSO go to YC, not that YC is guaranteeing it will be the only one to hold these rights or will enforce who else should hold the rights to your publicly shared content for you.
Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.
Other Users: certain actions you take may be visible to other users of the Services.
Eh, fuck that agreement. I'm kind of old school in that I believe if you put it on the internet without an auth-wall, people should be allowed to do whatever they want with it. The AI companies seem to agree.
Then again, I'm not the guy that is going to get sued...
"I'm kind of old school in that I believe if you put grass on the ground without a fence, people should be allowed to do whatever they want with it. The noblemen with a thousand cows seem to agree."
And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.
They already refuse to comply with CPRA, instead electing to replace your username with a random 6(?) character string, prefixed with `_`, if I remember correctly.
I know, because I've been here since maybe 2015 or so, but this account was created in 2019.
So any PII you have mentioned in your comments is permanent on Hacker News.
I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.
> At midnight UTC, the entire current month is refetched from the source as a single authoritative Parquet file, and today's individual 5-minute blocks are removed from the today/ directory.
There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.
> The archive currently spans from 2006-10 to 2026-03-16 23:55 UTC, with 47,358,772 items committed.
That’s more than 5 minutes ago by a day or two. No big deal, but a little bit depressing this is still how we do things in 2026.
https://www.ycombinator.com/legal/
Mods, enforce your license terms, you're playing fast and loose with the law (GDPR/CPRA)
The user content is supposed to be licensed only Y Combinator and (bleah) its affiliated companies (which are many, all the startups they fund, for example).
If it's owned by you and only licensed by HN shouldn't you be the one enforcing it?
> ... a nonexclusive
I.e. this section is talking to additional rights to the content you post to ALSO go to YC, not that YC is guaranteeing it will be the only one to hold these rights or will enforce who else should hold the rights to your publicly shared content for you.
That said, there are "no scraping" and "commercial use restricted" carve-outs for the content on HN. Which honestly is bullshit.
Your submissions to, and comments you make on, the Hacker News site are not Personal Information and are not "HN Information" as defined in this Privacy Policy.
Other Users: certain actions you take may be visible to other users of the Services.
Then again, I'm not the guy that is going to get sued...
And that, my friends, is how you kill the commons - by ignoring the social context surrounding its maintenance and insisting upon the most punitive ways of avoiding abuse.
I agree. It's the owners of the sites that have to follow rules, not us.
I know, because I've been here since maybe 2015 or so, but this account was created in 2019.
So any PII you have mentioned in your comments is permanent on Hacker News.
I would appreciate it if they gave users the ability to remove all of their personal data, but in correspondence and in writing here on Hacker News itself, Dan has suggested that they value the posterity of conversations over the law.
Wouldn't that lose deleted/moderated comments?