Making geo joins faster with H3 indexes

(floedb.ai)

152 points | by matheusalmeida 2 days ago

11 comments

  • cullenking 11 hours ago
    We do something similar for some limited geospatial search using elastic search. We make a set of h3 indexes for each of the hundreds of millions of gps recordings on our service, and store them in elastic search. Geospatial queries become full text search queries, where a point is on the line if the set of h3 indexes contains the point. You can do queries on how many cells overlap, which lets you match geospatial tracks on the same paths, and with ES coverage queries, you can tune how much overlap you want.

    Instead of using integers IDs for the hexes, we created an encoded version of the ID that has the property that removing a character gets you the containing parent of the cell. This means we can do basic containment queries by querying with a low resolution hex (short string) as a prefix query. If a gps track goes through this larger parent cell, the track will have hexes with the same prefix. You don’t get perfect control of distances because hexes have varying diameters (or rather the approximation, since they aren’t circles they are hexes), but in practice and at scale for a product that doesn’t require high precision, it’s very effective.

    I think at the end of this year we’ll have about 6tb of these hex sets in a four node 8 process ES cluster. Performance is pretty good. Also acts as our full text search. Half the time we want a geo search we also want keyword / filtering / etc on the metadata of these trips.

    Pretty fun system to build, and the concept works with a wide variety of data stores. Felt like a total hack job but it has stood the test of time.

    Thanks uber, h3 is a great library!

    • jillesvangurp 10 hours ago
      Elastisearch and Opensearch have a built in geo_shape type that is a bit more optimal for queries like this.

      Before that existed (pre 1.0 actually), I did something similar with geohashes, which are similar to h3 but based on simple string encoded quad trees. I indexed all the street segments in openstreetmap with that (~800 million at the time) and implemented a simple reverse geocoder. Worked shockingly well.

      The geo_shape type uses a bkd tree in binary format. It's heavily optimized for this type of intersects/overlaps queries at scale. Basically does the same thing but using a lot less disk space and memory. It's similar to what you would find in proper GIS databases. Elasticsearch/opensearch also support h3 and geohash grid aggregations on top of geo_shape or geo_point types.

      I'm guessing the author is using something like postgresql which of course has similar geospatial indexing support via post gis.

      • cullenking 2 hours ago
        Doesn’t meet all our product requirements unfortunately. We used returned hexes in certain queries, and we also hacked in directionality of line using least significant 12 bits of the hex (didn’t need that level of hex precision), and we are doing direction oriented matching and counting. For simpler use cases it’s definitely a better option. thanks for reminding me and other people reading my comment!
    • anacoluthe 8 hours ago
      Beware that the parent hexagon does not contain its children...
    • chrisweekly 6 hours ago
      Awesome comment, thanks for sharing the details. I love this kind of pragmatic optimization. Also, one dev's "total hack* job" [e.g. yourself, in the past] is another's stroke of genius.

      * I'd frame it as "kludge", reserving "hack" for the positive HN sense. :)

    • ajfriend 11 hours ago
      Very cool! And the prefix queries you mention are what I was trying to get at in another comment, but you explained it better :)
    • freakynit 9 hours ago
      Does this effect writes negatively?
      • cullenking 2 hours ago
        Not any differently than another indexed text field
  • jandrewrogers 12 hours ago
    There is a lot of literature on join operations using discrete global grid systems (DGGS). H3 is a widely used DGGS optimized for visualization.

    If joins are a critical performance-sensitive operation, the most important property of a DGGS is congruency. H3 is not congruent it was optimized for visualization, where congruency doesn’t matter, rather than analytical computation. For example, the article talks about deduplication, which is not even necessary with a congruent DGGS. You can do joins with H3 but it is not recommended as a general rule unless the data is small such that you can afford to brute-force it to some extent.

    H3 is great for doing point geometry aggregates. It shines at that. Not so much geospatial joins though. DGGS optimized for analytic computation (and joins by implication) exist, they just aren’t optimal for trivial visualization.

    • 0xfaded 12 hours ago
      S2 has this property https://s2geometry.io
      • jandrewrogers 11 hours ago
        Yes, S2 is a congruent DGGS. Unfortunately, it kind of straddles the analytics and visualization property space, being not particularly good at either. It made some design choices, like being projective, that limit its generality as an analytic DGGS. In fairness, its objectives were considerably more limited when it was created. The potential use cases have changed since then.
        • mdasen 11 hours ago
          Is there a congruent DGGS that you would recommend?
          • jandrewrogers 10 hours ago
            None that are well-documented publicly. There are a multitude of DGGS, often obscure, and they are often designed to satisfy specific applications. Most don’t have a public specification but they are easy to design.

            If the objective is to overfit for high-performance scalable analytics, including congruency, the most capable DGGS designs are constructed by embedding a 2-spheroid in a synthetic Euclidean 3-space. The metric for the synthetic 3-space is usually defined to be both binary and as a whole multiple of meters. The main objection is that it is not an “equal area” DGGS, so not good for a pretty graphic, but it is trivially projected into it as needed so it doesn’t matter that much. The main knobs you might care about is the spatial resolution and how far the 3-space extends e.g. it is common to include low-earth orbit in the addressable space.

            I was working with a few countries on standardizing one such design but we never got it over the line. There is quite a bit of literature on this, but few people read it and most of it is focused on visualization rather than analytic applications.

            • srean 8 hours ago
              Pointers to the literature please. I don't work in this space but love geometry.
    • ajfriend 11 hours ago
      I agree that the lack of congruency in H3 hexagons can cause weird overlaps and gaps if you plot mixed resolutions naively, but there are some workarounds that work pretty well in practice. For example, if you have mixed resolutions from compacted H3 cells but a single “logical” target resolution underneath, you can plot the coarser cells not with their native geometry, but using the outline of their children. When you do that, there are no gaps. (Totally unrelated but fun: that shape is a fractal sometimes called a "flowsnake" or a "Gosper Island" (https://en.wikipedia.org/wiki/Gosper_curve), which predates H3 by decades.)

      That said, this feels like an issue with rendering geometry rather than with the index itself. I’m curious to hear more about why you think the lack of congruency affects H3’s performance for spatial joins. Under the hood, it’s still a parent–child hierarchy very similar to S2’s — H3 children are topological rather than geometric children (even though they still mostly overlap).

      • jandrewrogers 10 hours ago
        In a highly optimized system, spatial joins are often bandwidth bound.

        Congruency allows for much more efficient join schedules and maximizes selectivity. This minimizes data motion, which is particularly important as data becomes large. Congruent shards also tend to be more computationally efficient generally, which does add up.

        The other important aspect not raised here, is that congruent DGGS have much more scalable performance when using them to build online indexes during ingestion. This follows from them being much more concurrency friendly.

        • ajfriend 9 hours ago
          I appreciate the reply! So, I might be wrong here, but I think we may be talking about two different layers. I’m also not very familiar with the literature, so I’d be interested if you could point me to relevant work or explain where my understanding is off.

          To me, the big selling point of H3 is that once you’re "in the H3 system", many operations don’t need to worry about geometry at all. Everything is discrete. H3 cells are nodes in a tree with prefixes that can be exploited, and geometry or congruency never really enter the picture at this layer.

          Where geometry and congruency do come in is when you translate continuous data (points, polygons, and so on) into H3. In that scenario, I can totally see congruency being a useful property for speed, and that H3 is probably slower than systems that are optimized for that conversion step.

          However, in most applications I’ve seen, the continuous-to-H3 conversion happens upstream, or at least isn’t the bottleneck. The primary task is usually operating on already "hexagonified" data, such as joins or other set operations on discrete cell IDs.

          Am I understanding the bottleneck correctly?

    • TacticalCoder 9 hours ago
      > If joins are a critical performance-sensitive operation, the most important property of a DGGS is congruency.

      Not familiar with geo stuff / DGGS. Is H3 not congruent because hexagons, unlike squares or triangles, do not tile the plane perfectly?

      I mean: could a system using hexagons ever be congruent?

      • urschrei 5 hours ago
        Hexagons do tile the Euclidean plane perfectly. They are the largest of the three n-gons that do so.
  • markstos 1 hour ago
    There are some competing grid systems with similar features and benefits as well.

    Notably A5, which has the property that each cell covers exactly the same area, even when stretched towards the north and south pole. Useful for certain spatial analysis where you need every cell to have the same size.

    https://a5geo.org/

  • dgsan 11 hours ago
    I don't like what scrolling this site does to my browser history.
    • nmstoker 11 hours ago
      Yes, noticed that too. Blocked me getting back to HN. Bad behaviour from the site.
  • hiddew 4 hours ago
    Also see PostGIS for Postgres systems: https://postgis.net/docs/.
  • mcherm 3 hours ago
    Why does H3 use a nonoverlapping set of hexagons? A square grid would make it even simpler and faster to calculate. I am perfectly happy to believe that a hex grid works better for some reason but what is that reason?
  • febed 11 hours ago
    Wouldn’t having a spatial index give you most of the performance gains talked about here without needing H3?
    • feverzsj 9 hours ago
      Yes. And it should be faster. They may forget to create spatial index.
      • twelvechairs 8 hours ago
        Agree with this. They are re-solving a problem that has been solved better by others before (with R-trees).

        They may well be using some data storage where spatial indexing is not possible or standard. Geoparquet is a common one now - a great format in many ways but spatial indexing isnt there.

        Postgres may be out of fashion but still an old fashioned postgis server is the simplest solution sometimes.

        • pb060 2 hours ago
          Why do you consider Postgres + PostGIS out of fashion? What are people using for spatial data these days?
  • analytically 8 hours ago

      At 500 stations:
      - H3: 218µs, 4.7KB, 109 allocs
      - Fallback: 166µs, 1KB, 37 allocs
      - Fallback is 31% faster
    
      At 1000 stations:
      - H3: 352µs, 4.7KB, 109 allocs
      - Fallback: 312µs, 1KB, 37 allocs
      - Fallback is 13% faster
    
      At 2000 stations:
      - H3: 664µs, 4.7KB, 109 allocs
      - Fallback: 613µs, 1KB, 37 allocs
      - Fallback is 8% faster
    
      At 4500 stations (real-world scale):
      - H3: 1.40ms, 4.7KB, 109 allocs
      - Fallback: 1.34ms, 1KB, 37 allocs
      - Fallback is 4% faster
    
      Conclusion: The gap narrows as station count increases. At 4500 stations they're nearly equivalent. H3 has fixed overhead (~4.7KB/109 allocs for k=2 ring), while fallback scales linearly. The crossover point where H3 wins is likely
      around 10-20K entries.
  • galkk 7 hours ago
    Ohh, every geo join/spatial thing with picture that consists of those small cells over map is such pet peeve of mine. Facebook marketplace, craigslist, tinder, any app with “proximity search”.

    No, this city isn’t 4 miles from my city. There is a literal lake between us. It’s 10+ miles.

    Please, invent something, do precompute, but just avoid naive-ish searches.

    • boxed 5 hours ago
      Is this related to the article?
      • cyanydeez 5 hours ago
        Hes just angry hes not a crow.
  • mgaunard 7 hours ago
    Why doesn't it use k-d trees or r-trees?
    • cpa 6 hours ago
      The big reason is that H3 is data independant. You put your data in predefined bins and then join on them, whereas kd/r trees depend on the data and building the trees may become prohibitive or very hard (especially in distributed systems).
      • mgaunard 6 hours ago
        Indices are meant to depend on the data yes, not exactly rocket science.

        Updating an R-tree is log(n) just like any other index.

        • vouwfietsman 5 hours ago
          I think the key is in the distributed nature, h3 is effectively a grid so can easily be distributed over nodes. A recursive system is much harder to handle that way. R-trees are great if you are OK with indexing all data on one node, which I think for a global system is a no-go.

          This is all speculation, but intuitively your criticism makes sense.

          Also, mapping 147k cities to countries should not take 16 workers and 1TB of memory, I think the example in the article is not a realistic workload.

        • cpa 4 hours ago
          To add to sibling comment, if you have streaming data you have to update the whole index every time with r/kd trees whereas with H3 you just compute the bin, O(1) instead of O(log n).

          Not rocket science but different tradeoffs, that’s what engineering is all about.

  • avereveard 9 hours ago
    nice writeup, terrible website garbling the page history

    I wonder how it compare with geohashing, I know it is not as efficient in term of partitioning and query end up weird since you need to manage decoding neighbor cells but finding all element of a cell is a "starts with" query which allow to put data effectively on most nosql databases with some sort of text sorting