Show HN: Rocky – Rust SQL engine with branches, replay, column lineage

(github.com)

95 points | by hugocorreia90 1 day ago

12 comments

Xiaoher-C 2 hours ago
The compile-time lineage part is the most interesting bit to me. A lot of “data lineage” tools feel like archaeology after the fact: parse logs, reconstruct what probably happened, then hope it matches reality.
Having the compiler know “this column flows into these downstream models” before execution changes the workflow quite a bit. It makes refactors and masking policies much less scary.
Do you expose any kind of “lineage diff” between branches? For example: this PR changes the downstream impact of `customer.email` from A/B/C to A/B/D. That would be useful in code review.
[-]
- thelittlenag 1 hour ago
  Same. I worked on an in-house product many years ago now where lineage and provenance were the entire point. Really cool to see this!
  [-]
  - hugocorreia90 26 minutes ago
    Thank you!
ramon156 5 hours ago
If your introduction message already includes a bunch of uncurated claims and LLM smells, then what does that say about the code I'm about to run?
[-]
- hugocorreia90 4 hours ago
  Yeah, fair pushback, and yes the intro was AI-assisted. Marketing is not my strength nor I am a native english speaker. I built this in about a month with heavy LLM tooling and the seed comment is part of that. I'm not going to pretend otherwise.
  The code is what it is. `cargo test --workspace` runs across 19 crates. CI on 5 platforms (macOS ARM/Intel, Linux x86/ARM, Windows). JSON output schemas are codegen-checked in CI so docs can't drift from the binary.
  If you want to skip the marketing copy and look at engine reasoning instead: PR #240 (audit trail), #241 (column classification + masking), #270 (failed-source surfacing in discover).
  I'd rather hear "the code is bad" than "the post sounds AI-written".
  [-]
  - DoctorOW 2 hours ago
    > I'd rather hear "the code is bad" than "the post sounds AI-written".
    Of course you would. Reading through and judging the quality of AI output is the largest amount of effort in a world where you can get everything else by prompting. Please internalize this: If you want to be respected you will have to put in effort yourself. There is no way around this.
    [-]
    - hugocorreia90 32 minutes ago
      I truly appreciate your feedback and it's definitely a lesson learned for me. As I said to cmrdporcupine, "The only reason for passing my replies through AI was just because it's my first time posting here and opening a side-project of mine publicly. "
      All the engine architecture decisions are mine though and this project came up to solve a real problem in a data pipeline that serves multiple clients, connectors, producers, etc.
  - austinthetaco 2 hours ago
    This comment itself is likely written by AI by the sounds of it. It may be worth your time writing it out in your own words in your native language and then finding a competent translation tool to translate your words.
  - FrustratedMonky 2 hours ago
    Not sure why you are downvoted here.
    'A-Lot' of side projects, hobby projects, etc.. are all using AI tools now. Also for marketing, every sales/marketing firm is using AI. So why critisize this guy inparticular.
    AI is pervasive, the train has left the station. So that is not a reason to criticize this project. There might be other reasons, I'm not sure, but not that an AI was used.
    [-]
    - ModernMech 2 hours ago
      Because "Yeah, fair pushback" is AI smell. Either everything this person does is passed through an AI from code to blogs to even their HN comments and submissions; or they use AI so much they're starting to talk like it colloquially. Either way no one has time for that.
      [-]
      - FrustratedMonky 2 hours ago
        "Yeah, fair pushback"
        Really hard to tell. Because that used to be a common phrase that real people would use.
        So now I have to change my own language in order to not appear like I'm an AI? We are getting in a weird place where Humans have to act/sound increasingly 'odd', to appear not 'perfect' like an AI.
        [-]
        ModernMech 2 hours ago
        It's really not hard to tell. It's the "How do you do fellow kids" of AI-isms. The presence of "fair pushback" and a single em dash reads as 99% AI generated as far as I am concerned.
        Yes, if you don't want to sound like you're cargo culting AI, you do have to change the way you talk because people aren't going to care otherwise. At the very least just because it's boring. That's always been the nature of slang and lingo.
        [-]
        FrustratedMonky 1 hour ago
        "not hard to tell"
        Or, with all of the AI slop, you think you are detecting all AI. And don't realize the stuff that is AI and not noticed. There is a wide variety of tools now, with different degrees of output quality.
        https://ifunny.co/picture/it-s-been-forever-it-s-been-foreve...
    - cmrdporcupine 2 hours ago
      It's really a weird world now.
      I do think the author is doing a disservice to themselves by writing the post and comments using LLM, even if the code is mostly agent built. People can tell right away, all the LLM shibboleths are there... it feels cheap. Just write naturally and then Google translate, don't let the LLM speak on your behalf.
      What's going to distinguish projects that are built this way is the ability to explain, document, support, and maintain said projects over the long term. That will be the crucible. Gone are the days of "build it and they will come", and I feel a bit sad about that.
      It's so easy to let the code grow under you beyond what you have the capacity to do the above for.
      I've got the same thing going on. Eschewing paid work and grinding 16, 17 hours a day boiling the sea to build the whole universe from scratch (also a database, but of a different sort than this project) integrating all my favourite DB research papers and ideas that I've accumulated over the last 30 years. Outperforms postgres 2-4x or more, has a battery of correctness tests, Lean proofs, benchmarks, etc. etc.
      But frankly I'd be nervous to share. Especially here. I don't even know where it ends up. Not least because if I'm doing it, so are 50 other people, probably.
      [-]
      - hugocorreia90 1 hour ago
        I totally acknowledge that. The only reason for passing my replies through AI was just because it's my first time posting here and opening a side-project of mine publicly.
        All the engine architecture decisions are mine though and this project came up to solve a real problem I currently have at work with a zero-touch data pipeline leveraging FiveTran, Dagster, dbt and Databricks. This is a data pipeline that servers multiple agencies and data producers who work with data from more than 300 clients and multiple connectors.
        Rocky essentially was built based on all the time spent awaken at night thinking about all these problems and how could they be addressed differently, considering that dbt is not suiting well this particular use-case.
        I decided to open Rocky to public for free because of two simple reasons: 1st is that it might help others and I fullfill my ego of having built something other people like and use. 2nd is that I'm the solo maintainer. A project can only get proper traction if more people contributes to it.
mollerhoj 5 hours ago
Its a bit confusing to claim that "The things your current stack can't give you because it doesn't own the DAG" and use DataBricks as your example: DataBricks includes jobs and pipelines, so it very much owns the DAG, no?
[-]
- hugocorreia90 4 hours ago
  Fair point. Databricks owns a scheduling DAG (Workflows, DLT). What I meant by "owns the DAG" is the semantic DAG: model-to-model dependencies with column-level types that the compiler builds.
  Workflows knows task A runs before task B. Rocky knows `dim_customer.email` flows from `raw_users.email_address` through three CTEs in `stg_customers`. Different layer, same word.
  I'll be more careful with that framing.
PeterWhittaker 2 hours ago
Congrats on the work, but have you considered another name? Naming is hard and always will be: When I first scanned the headline, my initial thought was "that's an interesting area for the Rocky Linux team to explore". After a moment, "wait, no, that's confusing, it's some other Rocky".
[-]
- hugocorreia90 2 hours ago
  Thanks Peter. All my side-projects are named after my pets. I had a dog named Rocky and given this project is also an underdog competing with well-established tools such as dbt and sqlmesh, I decided to keep Rocky when opening it to public. But I'm happy to get some suggestions for a better name to this tool :)
  [-]
  - PeterWhittaker 1 hour ago
    I love that! I am inspired to create Terry, Tizzie, Topé, Bubba, and Roxy (the three Ts are in my office right now), the last two are no longer with us but for the hole in my heart.
    I have no idea what these projects would be, but based on personalities, Roxy would chew through CPU and memory like a beaver (she loved turning large branches into small chunks), Bubba would inspire calm and peacefulness but walk into things (he was one-eyed and a little clumsy), Terry would stick like glue (an eBPF program, maybe?), Tizzie would work well most of the time then destroy your stuff (an AI agent?), and Topé would always be there, but never quite willing to participate (a bad Windows driver?).
    I don't the area well enough at all to suggest an alternate name, but maybe Wiley, which is an indirect reference to Dag from Barnyard via Wile E. Coyote?
- hobofan 59 minutes ago
  I fear that there is an even closer candidate for confusion: RocksDB
data_ders 1 hour ago
hiya, anders from dbt here. cool project -- I especially love the branching and budgeting options you've built in. both are things that I'd love for the dbt standard to include one day. was it dbt's lack of those feature that inspired you to start this project? It also seems you have an aversion to Jinja, which, believe me, I get!
FYI dbt-fusion [1] is going GA next week (though GA for Databricks will come later) Most of it is source-available and ELv2-licensed, but there's a number of crates that are Apache 2.0, namely: dbt-xdbc, dbt-adapter, dbt-auth, dbt-jinja, dbt-agate. We also have plans to OSS more as time goes on (stay tuned).
I just wanted to call out the OSS crates in case you'd rather focus on "making your beer taste better" than have to re-build foundations. I'd love to hear if any of those crates come in handy for you (even more so if they don't work for you).
Feel free to reach out on LinkedIn or dbt community Slack if you ever want to chat more!
[1]: https://github.com/dbt-labs/dbt-fusion
[-]
- hugocorreia90 39 minutes ago
  Hey Anders! Thanks a lot for dropping a comment and show interest in Rocky. Yes! I won't going to lie that Jinja is one of the things that gives me some itches :). But it wasn't the major reason for start building Rocky though.
  It all started with the need for auto-generating dbt models from the FiveTran connections I integrate with, then having to hot reload code location in Dagster to discover new assets. All in a zero-touch data pipeline. FiveTran connections are discovered as they're created, assets are materialized as these connections sync.
  Auto-generating these dbt models and get the manifest aligned between Dagster code location reloads plus spinning up pods in EKS for each Dagster runs that need to rely on these auto-generated models have some impact on the performance overall, not only in production, but also affects DX in their local environment.
  Rocky wasn't born with a "dbt replacement" in mind at all, but it was born to solve a real issue I'm facing. I made sure I can integrate well with dbt as it's in my plans to leverage the awesome work available as dbt packages for FiveTran.
  I'll definitely have a look the crates you mentioned! Thank you!
hasyimibhar 7 hours ago
Looks cool, I've been waiting for someone to build this since dbt and SQLMesh acquisition. It would be great to have model versioning and support for ClickHouse SQL.
[-]
- hugocorreia90 4 hours ago
  Thanks. On model versioning — what's the use case you have in mind? A few options that map to different designs:
  - dbt-style semantic-layer versions (v1/v2 of a model) - schema migration history - branch-based (Rocky already has branches + replay)
  Different design choice for each, so it helps to know which problem you're trying to solve.
  ClickHouse is tractable through the Adapter SDK without engine patching. If you can share roughly your model count and workload shape, I can put a real timeline on it. Open to community PRs too.
  [-]
  - kjuulh 4 hours ago
    fyi, llm written comments are discouraged on hackernews.
    https://news.ycombinator.com/item?id=47340079
    Not saying yours are, but them -- dashes certainly looks like it ;)
    [-]
    - hugocorreia90 3 hours ago
      Fair. I just use it for tidying up my replies as I'm not a native English speaker.
ahmad212o 1 hour ago
[dead]
Dorrell 1 hour ago
[dead]
mergisi 7 hours ago
* * *
[-]
- hugocorreia90 4 hours ago
  Thanks for the careful read. The "what breaks if I rename this column" question is exactly what column lineage from the compiler is meant to answer, and you said it better than I did in the post.
  On the schema-grounded AI angle: agreed. The failure mode you describe — structurally valid SQL that joins on the wrong key or aggregates at the wrong grain because the model hallucinated a relationship — is exactly what the compiler is positioned to catch. AI-generated SQL runs through the type checker before it can land, so suggestions that don't validate against the actual DAG never reach the user. NL-to-SQL tools that integrate a compile step would close exactly the gap you're pointing at.
  On your two questions:
  1. Branch isolation for stateful models — mixed answer, and worth being honest about:
```
   - Incremental: isolated. The watermark `state_key` includes the resolved schema, and `rocky branch create` swaps the schema prefix. So a branch run reads/writes a different redb key than main and they don't advance each other.

   - Snapshot: not yet. Today `rocky branch create` only writes a branch record; it doesn't copy warehouse tables. A snapshot model on a branch starts with an empty table (CREATE TABLE IF NOT EXISTS in the branch schema) and accumulates from the first branch run, with no inherited history from main. That's the gap. The fix is the next wave: native Delta SHALLOW CLONE / Snowflake zero-copy at branch creation, which gives point-in-time snapshot semantics without copy-on-write overhead.
```
  2. Cost attribution. Both bytes scanned and duration are captured per-model in the run record (`bytes_scanned` and duration on `RunRecord`). Budget gating today is on cost (USD) and duration — `max_usd` and `max_duration_ms` in `[budget]` blocks in `rocky.toml`, as independent thresholds. A direct bytes-scanned budget threshold isn't gateable today; the bytes are in the run record for analysis but you can't currently fail CI on "this run scanned more than N TB". Reasonable extension if there's demand.
```
   To your Snowflake point: the warehouse-size × duration credit model and the scan volume tell genuinely different stories, so they're tracked separately rather than rolled into a single number.
```
Dorrell 1 hour ago
[dead]