Moldova broke our data pipeline

(avraam.dev)

28 points | by almonerthis 2 days ago

11 comments

  • franciscop 1 hour ago
    This very clearly seems like a bug either in their DMS script, or in the DMS job that they don't directly control, since CSV clearly allows for escaping commas (by just quoting them). Would love to see a bug report being submitted upstream as well as part of the "fix".
    • zarzavat 1 hour ago
      CSV quoting is dialect dependent. Honestly you should just never use CSV for anything if you can avoid it, it's inferior to TSV (or better yet JSON/JSONL) and has a tendency to appear like it's working but actually be hiding bugs like this one.
      • j16sdiz 1 hour ago
        Most CSV dialects have no problem having double quoted commas.

        The "dialect dependent" part is usually about escaping double quotes, new lines and line continuations.

        Not a portable format, but it is not too bad (for this use) either considering the country list is mostly static

  • aquafox 22 minutes ago
    I really don't understand why people think it's a good idea to use csv. In english settings, the comma can be used as 1000-delimiter in large numbers, e.g. 1,000,000 for on million, in German, the comma is used as decimal place, e.g. 1,50€ for 1 euro and 50 cents. And of course, commas can be used free text fields. Given all that, it is just logical to use tsv instead!
  • rglover 11 minutes ago
    Considering the scope, this could be more easily resolved by just stripping ", Republic of" from that specific string (assuming "Moldova" on its own is sufficient).
  • davecahill 58 minutes ago
    I was expecting a Markdown-related .md issue. :)
  • Surac 36 minutes ago
    I personaly would shy away from binary formats whenever possible. For my column based files i use TSV or the pipe char as delimiter. even excel allowes this files if you include a "del=|" as first line
  • cyberax 17 minutes ago
    "Sanitize at the boundary"

    Ah, but what _is_ the boundary, asks Transnistria?

  • vasco 10 minutes ago
    The majority of countries official names are in this format. We just use the short forms. "Republic of ..." is the most common formal country name: https://en.wikipedia.org/wiki/List_of_sovereign_states
  • shalmanese 1 hour ago
    Did you really name your breakaway republic Sealand'); DROP TABLE Countries;--?
  • nivertech 2 days ago
    just use TSV instead of CSV by default
  • inevletter 27 minutes ago
    Huge skill issue. Nothing to see here.