Norway's 2 petabytes of Huawei flash storage and LLM training

(blocksandfiles.com)

266 points | by rbanffy 13 hours ago

30 comments

rldjbpin 0 minutes ago
may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch.
see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish.
language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself.
[1] https://www.sarvam.ai/blogs/sarvam-30b-105b
TrackerFF 12 hours ago
I'm a Norwegian, and I use the national library almost every day for searching through texts. They have truly one of the best working user interfaces (and functionality) for searching through the massive amounts of text.
[-]
- vidarh 11 hours ago
  It's really fantastic. I just wished there were fewer restrictions on the content that is accessible.
  (a lot is only accessible from Norwegian IP addresses, so it's one of the main reasons I maintain a VPN as I'm Norwegian but live in the UK; a second set is only available from the IP addresses of libraries or research institutions - still huge amounts that are generally available, though)
  [-]
  - TrackerFF 1 hour ago
    My biggest gripe with it are the restrictions, indeed.
    When searching through the closed newspapers, you have to apply for access manually, which gives you 8 hours of access. Great. Only that the access is seemingly manually granted - so if you apply 16:05 on a Friday, chances are you won't get any access until 9-10 the next Monday.
    With that said, I do understand why it is like that. If people could apply via API, and get instant access, they would probably just stop buying newspaper subscriptions.
  - mettamage 3 hours ago
    Silly question but can a non-Norwegian also access it? Willing to pick up some Norwegian along the way ;-)
    [-]
    - Telaneo 2 hours ago
      If you have access to a Norwegian IP, then yes.
- throwaway85825 9 hours ago
  The lack of a universal search engine is very frustrating. Why can't I search within TV subtitles?
- vintermann 4 hours ago
  Well... You realize how used you are to the basic stemming and spelling flexibility which every search engine has had since Altavista.
KeplerBoy 11 hours ago
How true is this statement: "He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language."
I thought all big players already train on basically everything remotely available to them no matter the language or quality, so his take sounds like an opinion formed in the early days of generally available LLMs.
[-]
- internet_points 4 minutes ago
  Maybe it can at least write like a Norwegian instead of just English-translated-into-Norwegian. It would be interesting to see if they try something like the experiments in https://arxiv.org/pdf/2507.22445 on it.
- WatchDog 10 hours ago
  If you want LLMs to have knowledge of the Norwegian language, wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available? Why go to the expense of training your own model, especially when it will be inferior to state of the art models.
  [-]
  - black_puppydog 9 hours ago
    I task GPT/Claude with researching stuff that pertains to very specific cultural or legal aspects in French politics, on a daily basis. Even though French is a way more common language globally than Norwegian, these models still haven't figured out that, no matter the language I myself speak to them (German or English depending on my mood) their web searches need to be done in French to return reasonable results. I have to remind them every time lest they come back with "uh, didn't find anything relevant, here take some hallucinations instead."
    So, given the anglo-centrism of current models, my confidence in American providers giving any shits about non-american users/use-cases is pretty low. And lower the smaller the language community is.
    [-]
    - KaiserPro 2 hours ago
      I've noticed that it also imposes american moral judgements on certain things, even though it reasons (sometimes) in the native language.
      I was trying to work out how and when to use swear words, and the relative power index of them. it translated english swear words into the target language then lectured me on not using them.
      It took a bunch of prodding for it to actually think as the target language to then get the (mostly) correct response.
      [-]
      - social_quotient 43 minutes ago
        Would be curious about the model and the prompt for this.
        Not kidding at all. I had a similar issue with a project where I needed to classify images into specific demographics, and Gemini, while capable, was entirely not going to do the task… until in my JSON response I left room for it to tell me why this was not a good idea and why it was culturally insensitive. Then boom… full JSON array: hair color, eye color, skin color, fitness level, likely ethnicity, likely country of origin, and about 10 other values.
        You’re probably wondering what on earth I was working on. I was matching Ai gen headshots to Ai voices so that in an app the voice picker had human (Ai) faces.
    - RobotToaster 12 minutes ago
      Have you tried asking it to translate the prompt to French, and then feeding it the translated prompt?
    - hombre_fatal 9 hours ago
      Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
      If you’re doing this on a daily basis, then you should have an AGENTS.md that accumulates directional instructions like this.
      This is how you use the tool correctly.
      There’s this weird pattern I’ve noticed where people expect LLMs to require zero effort or proficiency on their part, and when the LLM isn’t perfect without it, of course it wasn’t; LLMs suck.
      [-]
      - cubefox 3 hours ago
        > Aren’t you already using English in the LLM convo? Telling the model to use French for research or to find resources in French seems like a reasonable step.
        Most ordinary people will just use their native language and they have no way of knowing that the model always reasons in English and therefore is strongly biased toward using English search terms. So they don't know they have to remind the model to search in their local language.
      - coliveira 7 hours ago
        The issue is that French, Italian, African, Japanese people shouldn't have the inconvenience of instructing the LLM tool to get the basic facts about their own culture. They should use an LLM that has already been trained like that by default. Nobody has obligation to use a tool that thinks it is talking to an American. If I go to Google for example I want to get facts about my own country in my own language.
        [-]
        instagraham 7 hours ago
        >Nobody has obligation to use a tool that thinks it is talking to an American
        Very very emphatic agree from my end, thanks.
        TimTheTinker 5 hours ago
        > Nobody has obligation to use a tool that thinks it is talking to an American.
        Then add top-level instructions saying what country you're from, what country you live in now, and which language you speak. This isn't that hard.
        [-]
        schubidubiduba 3 hours ago
        None of that even addresses the problem described, because none of the languages you mentioned would be French in the described example.
        cortesoft 6 hours ago
        Wouldn't those people be asking the questions in their own language in the first place? The model will reply in the language you use. This thread is about people asking for information about a language that is not the one they are messaging the LLM in
        [-]
        numpad0 2 hours ago
        They always sound like an obnoxious American tourist talking through a translator, the chatbot training dataset is the same and foundation models are always built with >50% American English data for some reason.
        schubidubiduba 3 hours ago
        Even if the model will reply in my language, I often notice it searching in english. Or thinking in english. There's always something lost in translation. Sometimes it's just minor nuances. Other times it mangles the legal facts with those of other countries.
        [-]
        Schlagbohrer 1 hour ago
        This sounds like the problem of people calling "911" as the emergency number which they see in so much US-American media but which is not the emergency number in their own country.
        [-]
        skissane 1 hour ago
        I remember being bored as a teenager on a family holiday to New Zealand in the 1990s, so I went and dialled 911 from a payphone to see what would happen-I got a recorded message saying that in New Zealand, the emergency number isn’t 911, it is 111. Dialling 000 (the Australian emergency number) produced a similar recorded message.
    - andai 7 hours ago
      If you ask in French, it searches in French, right?
      I have the opposite problem, where I'll ask in English, about something in a foreign country, the results it finds will all be in that foreign language, and the LLM will switch languages and respond in that language (which I don't speak).
      So then I have to ask it "can you repeat that in English please."
      I keep waiting for the new GPT-Definitelty-AGI-For-Real-This-Time to fix it but it's still there.
      [-]
      - jahller 2 hours ago
        > If you ask in French, it searches in French, right?
        not necessarily. i often prompt Claude in German and then see the reasoning happening in English. of course it will eventually reply in German, but that does not mean that the tooling in the background was using German.
      - lobochrome 2 hours ago
        Same for me - I mostly ask stuff in English but sometimes add specific terms or names in Japanese as needed. My Japanese is intermediate, but it will often switch immediately and reply only and entirely in Japanese. I'm pretty sure they have a system prompt with hairline triggers for foreign languages BECAUSE of the overrepresentation of English in the training corpora.
      - apple2026 6 hours ago
        [dead]
    - bakugo 1 hour ago
      > their web searches need to be done in French to return reasonable results.
      I wonder how much of this is also just the search engine's region setting.
      It's a big problem I regularly have with Google. I almost always want English language, US-centric results, so I have my region set to the US. But occasionally I want results relevant to my actual country, and even searching in my native language usually yields much worse results than just opening an incognito tab and letting it default to my real location.
  - onion2k 7 minutes ago
    wouldn't the most obvious thing to do be to build a good training dataset and make the dataset widely available?
    Only if you believe other people will value that enough to expend the effort necessary to use it. If you believe other people will see it as low value and ignore it then you'd be better off doing the training yourself in order to guarantee it happens.
    There's also a secondary benefit that your team doing the work will learn some useful skills while they do it.
  - a2128 9 hours ago
    What incentives does OpenAI have to make sure the AI actually works well with Norwegian beyond capturing a (small) Norwegian market? What incentives do they have to take Norwegian values into consideration, or to preserve Norwegian culture into the future? The matter is also a question of national sovereignty, so to simply release the data and nicely ask foreign companies to solve the problem for you, would be a fool's move
    [-]
    - SOLAR_FIELDS 6 hours ago
      It's also a bit funny because Norway definitely has enough money to hire a team of Anthropic's best to go out there and train them a model that does whatever they want. They probably have enough money to fund their own Anthropic competitor.
      [-]
      - schubidubiduba 2 hours ago
        I highly doubt that hiring people who don't even speak the language would result in a better model for Norwegian. If anything, they could pay Anthropic for some tips and tricks for training. But that does not seem necessary as Deepseek & co detail everything for free
      - joe_mamba 3 hours ago
        >They probably have enough money to fund their own Anthropic competitor.
        Which is bizarre to me Norway doesn't have a booming tech sector with all hat wealth fund acting as the biggest VC.
        They instead use their wealth fund to invest in US's tech sector. Baffling.
        [-]
        kalli 2 hours ago
        The point of the fund is to invest outside of Norway so as to avoid the Norwegian economy overheating and increasing inflation
        NorwegianDude 2 hours ago
        Considering the fact that the US is complaining about Norway putting too much money into the US market, imagine what would happen if all that money was spent in Norway. It would be chaos.
        [-]
        joe_mamba 1 hour ago
        > imagine what would happen if all that money was spent in Norway.
        It would create jobs, sovereignty, intellectual property and soft power?
        Instead it goes to strengthening the tech monopoly of a country that threatens to invade your neighbour.
        [-]
        varjag 30 minutes ago
        It was tried in early 1980s and nearly drove any non oil-related industry in the country extinct.
        Norway has a manpower bottleneck. The UK had spent its oil windfall domestically and it barely registered. But for a nation of then some 4 million the economy melts down with so much monetary mass.
        schubidubiduba 2 hours ago
        There's only so much you can do with 5 million people. Especially in a field where network effects amd scale matter a lot.
        [-]
        joe_mamba 2 hours ago
        Finland has same population as Norway, has way less money, but has 3x the scaleups. Even bigger difference with vs Netherlands.
        Even Norway themselves admit they're the underperformers of the Nordics. https://skywlkr.no/wp-content/uploads/2019/10/TechScaleupNor...
        So blaming population is a cheap excuse that doesn't hold water. Especially that you can always import the skilled people you lack, when you have virtually unlimited money and some of the highest standards of living in the world.
  - gizajob 49 minutes ago
    Because you have so much money you don’t know what to do with it any more.
  - electroglyph 10 hours ago
    absolutely. somebody online was wanting an LLM with Georgian language support, and that's exactly what i suggested: start digitizing Georgian text.
  - blks 2 hours ago
    Because state of the art models are owned and controlled by foreign agents.
  - embedding-shape 10 hours ago
    Yeah, was about to comment that too, instead of training a new model and new weights exclusively for Norwegian (and expecting/wanting every other small/medium-sized country to do the same) which seems infinity harder, they could have made high quality transcriptions and translations of the stories currently described only in Norwegian into English, and making it all public. I guess there still would be a worry that it'd be counted as "less important" compared to other history, news and culture about other countries.
    [-]
    - makeitdouble 6 hours ago
      > high quality transcriptions and translations of the stories currently described only in Norwegian into English
      You make it sound like an easier task than training an LLM. I'd argue it's not obvious, and would assume the contrary.
      [-]
      - embedding-shape 14 minutes ago
        Yes, why wouldn't it be easier to transcribe and translate, skills humanity had for centuries, compared to LLMs that we've only learnt to build these last few years, and even require a frikken computer to do? Of course one of these is harder than the other...
    - vintermann 4 hours ago
      Copyrights and statutes don't allow them to do that. The mandate of the National Library maybe permits them to make an LLM through (though I won't at all be surprised if someone sues them anyway).
  - vintermann 5 hours ago
    Permissions, probably. Copyrights and statutes. Knowing the librarians, unfortunately the prestige of their job is more vested in denying you access than giving you access.
    I mean it's their job to give people access to information, and they certainly do, but the mark of a professional, in their eyes, is guarding information. It's much more embarrassing for them professionally to give too much access than too little.
    LLM training gives them a "respectable" way of bypassing that and give the world their information (which, in fairness, they probably all really want to do if they could).
    [-]
    - vasco 4 hours ago
      If they wanted to they all have scanners and access to information on how to create torrents. Setting the information free isn't complicated, so it'd seem most of them, do not want to.
      [-]
      - vintermann 4 hours ago
        Where do you seed a 60 petabyte torrent? I'm sure some choice cuts of what individuals feel is important have made it to Anna's, but I don't think refusal to go on a full data liberation spree is evidence they don't care.
  - _cs2017_ 9 hours ago
    > Why go to the expense...
    Answer: idiocy of decision makers and the desire to get resources by those who created the proposal.
    I assumed Scandinavia has better decision processes but apparently I was wrong.
- vintermann 5 hours ago
  Foreign LLMs are probably not trained on the Norwegian National Library. I regularly find things in there (with regular keyword search, for genealogy) which neither search engines or language models know.
  Of course I then usually put the information I'm interested in somewhere AI could scrape it. But it would take a long, long time to get everything interesting out of there.
  [-]
  - intronic 4 hours ago
    Yep in the article it says ..the National Library .. has the single largest digital collection of Norwegian books, newspapers, web pages .. it is entitled to receive copies of every published book and broadcasted content. Its legal deposit mandate in this area extended beyond books, as it was duty-bound to collect and preserve all of Norway’s cultural heritage .. an agreement with Norwegian newspapers permitted LLM training on copyrighted content.
    Husnes said: ”No private company has this.”
    So yeah they seem to have proprietary data...
    [-]
    - pastage 2 hours ago
      > proprietary data
      It is just copyrighted data, that is harder to get a hold of. All the copies are available to anyone to use if they just read it. Copyright makes other uses complicated. I wonder if the whole Creative commons debate was a mistake, you can never fix copyright in a digital world.
- amarant 9 hours ago
  Not remotely true in my estimation. I don't really speak Norwegian, but I do speak Swedish(which means I mostly understand Norwegian as they're very similar). Every model I've tried speaking Swedish to does it perfectly. I'd be surprised if the same isn't true for Norwegian already
  [-]
  - schubidubiduba 2 hours ago
    Of course they speak swedish. But often, they do not reason in Swedish and do not search in swedish. Swedish makes up a tiny fraction of training data, while the vast majority is English, from the US. Which means the answers will always have a bias towards US culture, even if you ask in Swedish and the LLM answers in Swedish.
  - NorwegianDude 2 hours ago
    While Google does a good job with language support in their models, GPT-5.5 can't write proper Norwegian. It's even making up words that does not exist.
  - varjag 18 minutes ago
    Not really. For instance Facebook speech recognition models had Swedish support but no Norwegian.
  - vintermann 4 hours ago
    Does that include local distilled models? Because it didn't last time I checked for Norwegian.
  - mistrial9 6 hours ago
    different models have been very different in this way.. almost ten years ago the French made a very large effort to capture languages.. the release notes I read at the time IIR had quite a few languages from South Asia / India, and in Africa. The language that was prominently missing was German IIR. I cannot say for the 2025-2026 models since so much has happened.. but models are not equal.
- amelius 1 hour ago
  It's probably just an excuse to play with LLMs using big government funding :)
- orbital-decay 9 hours ago
  Current-best models are pretty fluent at major languages and cultures, so it's untrue at least for the "any" qualifier. Performance is barely affected or might be even better sometimes. However English patterns can subtly leak into native patterns of other languages. It's obviously very different for low-resource languages, but to improve them you need more data, not a new model.
  [-]
  - Barrin92 8 hours ago
    >Current-best models are pretty fluent at major languages and cultures
    strong disagree on that one. As a German interacting with ChatGPT, even in German it gives me the feeling of talking to the Pluribus people, which reminds me of an anecdote of Walmart failing in Germany because people were freaked out by the constantly upbeat, smiling employees.
    Understanding a culture is a very different task than translating the syntax of a text, and these systems might be capable of syntactic fluency but they do not really understand culture. You have to metaphorically abuse these models until they stop sounding like the crossover of a HR department person and a Mormon missionary
    [-]
    - varjag 13 minutes ago
      Set the personality to 'Robot', it makes the interactions so much more tolerable.
    - bblb 5 hours ago
      I'm Finnish and dear god I hate the default overtly friendly tones of LLMs. Always the first thing to tune in system prompt.
      You're a machine, stop anthropomorphizing yourself and pretending to be my best friend, and just give me the damn answer and nothing else. :D
      [-]
      - ampersandwhich 2 hours ago
        I fully agree. I'm Swedish and have recently used GPT to help me draft some cover letters in Swedish. Even with all the mandatory personality tweaks and prompting, it always seems to default to highly florid and self-congratulatory Americanisms if I'm not careful. It's very subtle.
        I do understand where proponents of language equivalency are coming from. LLMs seem to be extremely good at answering simple, one-shot type questions and mechanical 'low-level' translations for most languages. I feel like as soon as you introduce complex chains of thought or multi-step cross-linguistic tasks, minor imperfections stack and become magnified, just as with coding tasks or context rot.
- alliao 8 hours ago
  yeah and alignment is all about how to be less evil which is no easy job... I can just imagine Chinese LLM renders 1989 tianmen square as an incident orchestrated by CIA which CCP successfully thwarted etc etc
- intended 4 hours ago
  Quite true ?
  English is ludicrously over abundant in training when compared to any language.
  [-]
  - KeplerBoy 3 hours ago
    And that's probably necessary if you want a competent model. There simply isn't much norwegian literature on let's say banana farming.
- DiogenesKynikos 5 hours ago
  As the article explains, Norway's National Library has a database of practically everything published and broadcast in Norwegian going back many decades. From the way the dataset described in the article, it does not sound like OpenAI et al. would have easy access to it in its entirety.
solenoid0937 12 hours ago
> The Olivia system is an HPE Cray Supercomputing EX system, with 448 GPUs and 64,512 CPU cores.
Training a sovereign LLM with this meager hardware as opposed to a LORA on some open source model seems like a huge mistake and a potential red flag.
There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
Which begs the question, whose money are they wasting - and why?
[-]
- vslira 12 hours ago
  It may not be useful to anyone outside, but it's possible that one of the goals is institutional learning (that is, embedding the knowledge in how to build LLMs in an organization).
  Even though it's nominally the national library behind this, they were probably chosen (as per the article) because they legally own and can use all NO material for this end. I'd guess researchers from related entities like unis will be involved in the process.
- speedgoose 12 hours ago
  They successfully have made PoC finetunes before, so the next step is training fully fledged LLMs.
  I don’t think they aim to anything worthwhile. The finetunes were incredibly broken. I’m guessing it’s more about having the method to do it. I’m not convinced it’s super useful but I’m not one to decide who gets to do what with the research funds.
  One finetune I tried did make fun of humans expressing their feelings in the chat. Often.
  One other finetune did hallucinate that it was a doctor and my baby had terrible diseases, every time I just wrote "hei" (with a generic neutral system prompt that likely triggered this behaviour though).
  I think Olivia is big enough for what it’s used for. In my opinion it’s better to stay up to date and not waste too much money on hardware at the moment.
  [-]
  - Schlagbohrer 1 hour ago
    The article's slides mention how much of an engineering challenge it is just for them to clean their data and create new hardware and software flows to use the data for training. So perhaps it is a big learning exercise to build up institutional / national knowledge of LLM creation.
- manquer 11 hours ago
  > this meager hardware
  > they wasting - and why?
  i18n language models are not area something frontier labs are focusing ton of resources on? ( certainly not in Norwegian)
  The corpus of content in Norwegian - may not require very large clusters, or even if it does, this is best that the library could do, it would be certainly more than anyone else is investing in Norwegian models
  SOTA models do not have the access to the quality of content that the national library does? The article mentions licensing with newspapers specifically, and the library has access to its own content archive.
  English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
  I am curious if there is published research on how well localization works with LoRA depending on how far off the target language grammar/vocabulary is from English.
  Projects like this typically have more than one objective and are not only building SOTA project, but is also to build/train foundational local talent , similar to universities launching satellites .
  [-]
  - vidarh 11 hours ago
    > English and Norwegian are not closely related language families, perhaps LoRA is not best approach?
    Yes, they are. English is a West Germanic language. Norwegian is a North Germanic language. The French vocabulary in English obscures it a bit, but the two languages have similar grammar and the vocabulary has a huge number of close cognates.
    E.g. day -> dag, ship -> skip, apple -> eple, cow -> ku (which makes more sense when you pronounce them correctly out loud), bairn (child; mostly Scotland and Northern England) -> barn, hop -> hopp, yule -> jul just to give a random selection of English Germanic words.
    But more than that, the frontier models both a) knows Norwegian quite well, b) certainly knowns German and Dutch well, and there's a continuum of language transfer around the North sea especially when accounting for sounds rather than modern orthography, e.g. to take a couple of examples from above: ship -> schip -> Schiff -> skib -> skip; day -> dag -> Tag -> dag). The "jump" to Dutch already weeds out most of the French. A lot of modern Norwegian orthography comes from Danish, which again shares more than modern Norwegian does with German.
    Knowing any of these helps a lot with learning Norwegian and vice versa. E.g. I'm Norwegian, I've never learnt Dutch, but I have learnt English and German, and I can read Dutch fairly well from that alone.
    [-]
    - everforward 10 hours ago
      This makes me deeply curious about how LLMs understand language. Do LLMs relate cognates more than words that are dissimilar in different languages? I wonder if that plays some role in the effectiveness of tokenization.
      [-]
      - vidarh 10 hours ago
        I have no idea if the similar spelling will somehow help - I used that mostly because it's a simple way if illustrating the close relationship, but I suspect you'd find that the meanings of closely related words are likely to more directly overlap.
        The grammar is perhaps more likely to help. Similar word order etc. Even weirdness like German - my only top grade on a German essay in school was one where I on purpose ignored what I thought I knew about German and tried to evoke "old fashioned" Norwegian. The result was guessing at a bunch of grammatical structures that I didn't know if was valid German. Turned out I was right about most of it - century old Norwegian was far closer to century old Danish, was a lot closer to valid German, and enough so to impress my teacher enough to overlook a number of orthographic mistakes.
        [-]
        DiogenesKynikos 4 hours ago
        The same thing works for guessing German grammar from English. The farther back you go in English, the more its grammar resembles German.
        "What sayest thou?" -> "Was sagst du?"
        In fact, for the above, you don't even have to know a single German word. You just have to know what for question words, "wh" -> "w", that the English "y" at the end of a syllable usually comes from an older Germanic "g" sound, and that "th" was replaced by "d" in German. That gets you 90% of the way from early modern English to modern German in the above example.
- hedgehog 5 hours ago
  That's enough resources to build on something like the Olmo 3 recipe but with a mix prioritizing their own data and post-training for their own tasks. If they build their own embedding model, index everything in the library, and train their model to query that data while answering historical, cultural, legal, and strategic questions from their perspective... Pretty interesting and likely useful. They won't beat Anthropic at dumping out React code but also there's no real reason to duplicate that.
- KaiserPro 2 hours ago
  > There is no way these people have the resources to train a fully fledged LLM, so claiming that is their goal makes me think they don't intend for the LLM to be useful.
  Depends on what they are doing and why. but at most big labs, only the final model training happens on the big clusters. a lot of experimentation happens on <500 gpus per dev.
  So for fast iteration, this seems fine.
  [-]
  - Schlagbohrer 1 hour ago
    This is the use case for the small NVIDIA boxes that a researcher can have on their desk for $5k and do useful experiments before spending all the grant money on a huge training run for the final product.
- gerdesj 10 hours ago
  "Training a sovereign LLM with this meager hardware"
  Norway has a sovereign fund worth O[MS|Apple|etc] except it is largely in readies and not pixie dust.
  Whilst the UK frittered away North Sea oil profits, Norge squirreled them away instead.
  So, if the grand dream of LLMs and AI does actually come to some sort of fruition and not simply another case of the Emperor's New Clothes combined with some lovely tulips and a dotcom boom and bust, then Norge can simply stuff shit loads of cash into buying whatever they need. Cash is king after all.
  The beast they have described here is just a library system. I think I'd like my country's (UK) library system to have resources like that.
  I don't think you are asking the right question: When you say "meager", I see "rather impressive PoC from a well resourced organisation"
  You say tomato ...
  [-]
  - phatfish 9 hours ago
    The reason they have the largest sovereign wealth fund (aside from getting it right in the 80s, unlike the UK), is that there is quite a bit of regulation around where and how the money is invested.
    It is run to maximise growth for example, so even though Norway is way ahead with electric car usage and infrastructure (presumably because they have a climate likely to be most affected by global warming/heating) their fund still invests in fossil fuels as they are a profit/growth opportunity.
    Anyway, i don't think it's as easy as "simply stuff shit loads of cash into buying whatever they need". I believe there would be a serious political discussion needed for that to happen.
- gunalx 12 hours ago
  The largest problem is available training data actually.
  They have already done experiments with dittrent sub 10b models with both fine-tuning and fully from scratch. And last I check the fully from scratch captured the language in a better way.
- sgt 12 hours ago
  That's what they have access to right now. I am sure that will change in the future as the project progresses.
  What do you suggest, that they stop and wait until they have the right HW?
  [-]
  - NonHyloMorph 11 hours ago
    Also, it's Norway...
    "Norway's sovereign wealth fund, officially known as the Government Pension Fund Global, is the world's largest sovereign wealth fund with assets exceeding $\$2$ trillion. Established in 1990 and managed by Norges Bank Investment Management, it was created to channel surplus petroleum revenues into long-term global investments to benefit future generations."
- kristjansson 12 hours ago
  DeepSeek claims to have trained on something like 2k H800, this is ~0.5k GH200 … it’s not nothing. Sure they’re not going to _serve_ it at scale, but that’s not the point?
  Also the line between “finetuning a base model” and “man this is a real good initialization” gets pretty blurry at scale.
  Altogether a pretty presumptuous take.
- oblio 10 hours ago
  > Which begs the question, whose money are they wasting - and why?
  Norway is better run as a country than 99% of the countries on the planet, including the one that invented current LLM tech, so I'd give them the benefit of the doubt.
- otabdeveloper4 12 hours ago
  > meager hardware
  Qwen was made on a cluster about that size.
  And this is before anybody ever thought about optimizing the training process. (Currently it's just pytorch analyst-as-coder slop, with extremely overprovisioned quantizations, etc.)
timmg 12 hours ago
I wonder if instead (or in parallel), Norway should build a set of training data and share it (for free) with all the model builders.
Seems like making the frontier models know Norwegian and their culture is a better (or additional!) way to reach the end they are going for here.
[-]
- vidarh 11 hours ago
  The frontier models know Norwegian just fine. They can also adapt to Norwegian dialects, and even ape old Norwegian fairly well.
  E.g. I had Claude describe the novel "De knyttede næver" from 1911 in Norwegian orthography ca. 1911, as it's a novel I've read, and it does a good job.
  What it lacks is an understanding of Norwegian literature, culture and history. It had to look up "De knyttede næver", which was one of the best-selling Norwegian novels around the time it was published before I'd get anything out of it (ChatGPT does better; in thinking mode in particular it gives a detailed summary).
  While not exactly well known today, the author was a prominent newspaper journalist for decades, and the novel series is well enough known that e.g. there's a Norwegian singer that took his stage name after the protagonist, and it was covered in Norwegian papers and books for decades (partly because of controversy over the authors political views and how they coloured his novels), so it does feel like a reasonable test that reveals a quite significant knowledge gap.
  I do agree with you that it'd be better if the data set from the national library was made more accessible, though it seems a major addition here is that they have a deal to train on copyrighted data locked away in their archives that they have limitations on the use of.
  But even just making the out of copyright data in their collections would be a great start.
  [-]
  - e12e 10 hours ago
    Odd, I'd imagine Wikisource (in many/all languages) would be part of training data for all LLMs with SOTA ambition?
    https://no.wikisource.org/wiki/De_knyttede_n%C3%A6ver
    [-]
    - vidarh 10 hours ago
      You'd think so. It seems like there are a lot of odd gaps like that.
      I also have a favourite English language PhD thesis I ask every new model about that they still struggle to find even though there's a Wikipedia article about it that links a blog post I wrote about it.
      Anyone who thinks they've exhausted even publicly crawlable resources should ask them about some obscure stuff.
      [-]
      - thatcat 6 hours ago
        the models don't retain their full training data set
      - mistrial9 6 hours ago
        you might be surprised if you take this approach.. give key words and phrases in small amounts, each sentence of a prompt building on a previous sentence. Take a an example that is not very hard, like Lewis Carrol Alice in Wonderland original text. Although a quick question might get things sort of wrong, or miss details, if you guide the LLM to a certain part of the story, then a certain set of characters in that part of the story, then a certain statement or dramatic moment with those characters in that part of the story, you might get very specific detail that is close to line-by-line accurate. On the other hand, if you ask a quick, ordinary question about the same part of the story without supplying context and character names, you get something equally vague. YMMV
- calgoo 1 hour ago
  Why should they share all this data with the greedy american corporations that are stealing everyones data for their own profit? Much better to keep the legal agreement with the national institutions and possibly develop something actual useful to their own country.
  [-]
  - konschubert 21 minutes ago
    You are contradicting yourself. If you're hoarding the data for yourself you're not going to develop something useful. Sharing the data means that it will be integrated into the big LLMs, which will be useful "for their own country".
seanvk 3 hours ago
The Welsh language getting LLM training with Nemotron
https://www.bangor.ac.uk/news/2025-09-15-reaching-across-the...
rafram 6 hours ago
> Marius Husnes, the Head of IT Platform at the library (Nasjonlbiblioteket) discussed the project at Huawei’s ID Forum 2026 in Paris, saying that no commercial LLM provider was developing a local (Norwegian) language LLM. He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.
I am not overly confident that Marius Husnes knows what he’s talking about here.
[-]
- fnordpiglet 4 hours ago
  He’s right though, although it’s not entirely about the training corpus. It’s about the tokenizer that tokenizes substrings more efficiently based on a necessary bias towards a target language. English oriented LLMs are more powerful for English than other languages because the token space is more parsimonious in English language. Try any online Anthropic tokenizer that calls their api with common English words (typically one or fewer tokens) and Norwegian words - you’ll often see 2-4 tokens instead sometimes more. Some languages like Thai are at a huge disadvantage. Likewise often the corpus selection also is heavily skewed towards the target language simply because more energy is applied to sourcing written works in that language. There will also be semantic biases in the vector space due to cross influence between semantically similar embeddings between languages that create a different than cultural baseline. Finally fine tuning greatly impacts cultural expression in the LLM. None of these are trivial effects.
  There are a lot of efforts to create LLMs for dying languages and others that use cross cultural models to boost, but if your language is well literate, there’s a good reason to build a heritage LLM specific to your language and culture. Expecting OpenAI or Anthropic to prioritize your language over their target audience when a tradeoff is to be made is absurd.
  [-]
  - YetAnotherNick 3 hours ago
    Did you even try to verify your claims. I tested it on few translations on wikipedia articles using [1] and it takes 15-20% more tokens for Norwegian.
    English performs the best because there is more data in English and high quality sources are either only in English or there is a good translation in English.
    [1]: https://platform.openai.com/tokenizer
- chvid 5 hours ago
  When I am chatting with ChatGPT - it is fairly obvious that it is American - its native language, its style, its attitude is American - even if we chat in Danish.
  Just as we cannot rely on Netflix and HBO to produce Scandinavian TV-shows even though they might do at the moment, we need to make our own stuff in this area too.
  And over time, the technology to do this will become cheap and readily available for us to do so.
  [-]
  - anal_reactor 2 hours ago
    > And over time, the technology to do this will become cheap and readily available for us to do so.
    But then the English models will be even better and you'll be back to square one. My guess is that things are going to become more and more American. If you assume that "culture" is a resource like "microchips", then from economic point of view it makes sense to have one country specialize in producing it, and the rest just consume. This is why when you turn on the main radio station of a random country, you're so likely to hit American music.
    [-]
    - ikr678 2 hours ago
      'Only one country should export culture, for economic efficiency' is the kind of take that the Norweigians (and everyone else) would like to protect themselves from.
- isawczuk 4 hours ago
  Poland have its one LLM called Bielik. It's not only better in preserving Polish sounding wording, it's also better in writing government documents. Why better? They did arena and statistically it's just better.
- KaiserPro 2 hours ago
  could you provide evidence to suggest he is wrong?
  It seems like you've made an assertion but not provided evidence. Why is it not a disadvantage to only have english LLMs?
  Can you get the nuance of Norwegian history/culture with present models?
- spiderfarmer 6 hours ago
  It sounds plausible enough to get subsidies.
- maxloh 3 hours ago
  [dead]
- idiotsecant 5 hours ago
  You're making the mistake of thinking whether he knows what he is talking about matters. He is brewing a potion. It's ingredients are a trendy term, a vaguely spooky threat and a clear, overly simplistic solution that of course he will graciously assume control of, for the good of the motherland.
  This potion is potent and you'd think it would stop working from frequent misuse but you'd be wrong!
  [-]
  - vintermann 5 hours ago
    He won't have control over it.
dmos62 1 hour ago
Huawei? You'd think that the recent European revulsion from using overseas providers would have reached Norway's public sector too.
yokoprime 3 hours ago
The wording in this article is a bit strange, why the extreme focus on the brand of storage media? Also, the term LLM seems to be used in a very broad way here, are they actually building a language model from scratch, or are they fine-tuning?
postepowanieadm 2 hours ago
Norway isn't in the EU (no restrictions on Huawei) and has cheap electricity, could become an ai powerhouse.
Levitz 12 hours ago
>As Husnes put it; Norway is a small country solving a problem every non-English-speaking nation will face: how do you build AI that reflects your language, your culture and your history? AI needs custodians, not just builders.
I'm afraid the answer is, mostly you don't.
Such a thing requires strong political will that, at least in my environment, seems basically impossible to align.
The costs are prohibitive, but beyond that, the type of person who cares about local representation like that is either completely fine with letting foreign companies implement it (after all, you can use ChatGPT in Basque if you want to) or is against the idea of AI altogether.
[-]
- ttkari 11 hours ago
  I guess it's subject to debate whether the cost indeed is prohibitive in the case of Norway. They are a small but extremely wealthy country - after all, they currently hold the equivalent of 1,5% of all the listed companies globally through the investments of their sovereign wealth fund.
- WarmWash 12 hours ago
  I'm sure if Norway approached the American labs with goal of making a curated datasets for training, they would absolutely get in the training door, and those models would likely run circles around anything that could be domestically done.
  That being said though, I can feel you cringing through the screen.
  [-]
  - Levitz 9 hours ago
    >That being said though, I can feel you cringing through the screen.
    Then I failed to express myself in writing. I'm definitely a fan of this kind of initiative and am not happy with the type of viability I think they have.
    I might very well be projecting a whole lot of local dynamics of national identity, politics and culture though.
Schlagbohrer 1 hour ago
Sapir-Worf hypothesis but for AI
petterroea 2 hours ago
As a Norwegian I have never needed a Norwegian language model. Doing most things in Norwegian puts you at a disadvantage internationally anyways. Maybe this has value in schools, but wouldn't it just give kids more trust in relying on LLM's? My friends who work in education report that group work has become insufferable because many do not think critically and ask LLM to verify everything. I really don't see a benefit, but maybe they will find one - that is what research is for.
I am reminded that we recently concluded our experiment of forcing things to be digital on school was considered a flop. These things have a cost if we are wrong.
dalemhurley 12 hours ago
How about that, they actually asked for permission to use data and the companies said yes.
[-]
- vintermann 3 hours ago
  I think you're required by law to let the National Library have copies of books/newspapers you publish beyond a certain scale.
arjie 12 hours ago
This can’t be right. 2 PB of flash is like $200k. It’s within reach of many individuals. Then again I guess you don’t need that much storage so maybe it is.
[-]
- devttyeu 12 hours ago
  More like $1M at current prices at this scale / level of performance.
  If you go with HDD arrays probably $50k
  [-]
  - arjie 12 hours ago
    Boy pricing is pretty nuts these days. I have half a petabyte in Seagate enterprise drives myself and I didn’t pay anything close to that to acquire it. Such a pity about the flash storage. 2 years ago we built 200 TiB or something of flash using Samsung PM1633 or something and it was a fraction of the cost per gigabyte that $1m would imply.
    [-]
    - rcbdev 3 hours ago
      We're in the boom phase of the cycle. The bust on these chips always comes.
- tjwebbnorfolk 9 hours ago
  Also my first thought: "Is that... a lot?"
  You can put 6PB (244TB * 24) into a single box these days.
- metadat 11 hours ago
  Your numbers are a little off but the point remains- 2PB is nothing, not newsworthy imo. What’s special about this?
  [-]
  - vidarh 11 hours ago
    What's special about it is not the flash but training an LLM based on the content, much of which is still in copyright and which the library has restrictions on how they are allowed to use (irrespective of the legal position of training on it) and which required an agreement with the copyright holders.
6510 3 hours ago
What is called culture here will increasingly be propaganda. It reminds me of people cheering twitter as a replacement of RSS or using facebook to communicate with your customers rather than email. You won't know which will be the winning company, don't know who might control it in the future and we cant predict what it will cost. It doesn't take much to be very annoying.
Den_VR 12 hours ago
> He asserted that any country with its own language that did not have a sovereign LLM trained in that language was at a disadvantage as a globally trained, English-speaking LLM would not know about that country’s history, news and culture that was described in the local language.
I don’t know this is true. But whatever sounds true enough and gets funding seems to be what flies these days.
[-]
- redanddead 12 hours ago
  They made the cultural case, you have no idea how strong this is in places like quebec, nordics, france, russia etc
  [-]
  - sgt 12 hours ago
    Can confirm that. Norway may have a small population, but if you live there you'll think it's truly the center of the world (aside from the US. Norwegians love America)
    [-]
    - elygre 5 hours ago
      Love America? Yes, we did.
      [-]
      - Epa095 4 hours ago
        It is, after all, a God which turned out not to be a God, but just America.
DeathArrow 3 hours ago
I thought US has already coerced most countries to not buy hardware from Huawei.
At least in my country, Chinese companies have been barred from official tenders and procurement.
kvam 12 hours ago
As a Norwegian this sounds like a mistake. Who will use this LLM? Where? For what? The underlying data could be made more easily searchable and digestible for agents in general if the goal is better knowledge of Norwegian culture.
[-]
- vidarh 11 hours ago
  I agree in principle.
  That said, they are quite limited in what they are allowed to share of in-copyright works, and nb.no is a fantastic resource as it is (though you'll need a Norwegian IP address for too much of it - it's one of th main reasons I maintain a VPN) - if they are allowed to make it accessible there, it'd be great.
  But they also have vast amounts of out-of-copyright data that I hope they'd make more easily accessible...
- dalemhurley 12 hours ago
  Hard disagree. This is the first step not the last and proves to other countries that this can be done.
  [-]
  - xadhominemx 6 hours ago
    This model is going to start miles behind the frontier and the gap will only grow.
    [-]
    - weregiraffe 4 hours ago
      Why would the gap grow? There is no more training data to acquire, frontier model are training on the entire internet. Everything from now on is just fine-tuning.
      [-]
      - kvam 2 hours ago
        Your statement assumes training data is the only thing that matters for the big players, while not considering it limiting for the small Norwegian model. That’s a fallacy.
        [-]
        weregiraffe 2 hours ago
        Nowhere in the article does it say the Norwegian LLM will train _only_ on Norwegian data.
- spwa4 12 hours ago
  Exactly, if there's one thing transformers are good at it's translation. One I've found particularly nice: any question ChatGPT can answer in English it can answer in French. I'm assuming Norwegian too. So there's no point.
  [-]
  - sgt 12 hours ago
    There's quite a bit more to culture and language than just being able to have transformers come up with believable language and/or dialect.
  - sisve 12 hours ago
    The point is that norway willl have its own LLM. And will not have dependencies to another state or private company. The goal is not to be the best model. But to have a model that include more Norwegian data then other LLM and that it's not screwed against other sources.
    [-]
    - kvam 2 hours ago
      But what does that give you? If the model is far less capable? What will it do for you with that Norwegian data, that a better model could not do with better search or context?
  - dalemhurley 11 hours ago
    Yes transformers are great at translation as that is their purpose.
    LLMs are not great at preserving cultural uniqueness and diversity. Take how “delve” has reentered the lexicon because the human assessors for pre training dialect of English uses “delve” a lot.
    There is a lot of benefits to training specifically for a unique culture with unique norms to preserve the culture as we increasingly rely on LLMs.
    https://www.scientificamerican.com/article/chatgpt-is-changi...
  - vintermann 3 hours ago
    Definitively not the case in distilled models.
  - dzhiurgis 11 hours ago
    Model can speak Lithuanian too, but with a Russian accent which is a big taboo for us.
  - otabdeveloper4 12 hours ago
    They're only good at it because they were trained on massive amounts of English and French data.
    [-]
    - vidarh 11 hours ago
      Not really true.
      Both Claude and ChatGPT can translate into minor dialects of Norwegian they will have seen very few works in because very few printed works exist in them.
      E.g. I've tested both my local spoken dialect, which is rarely written, and a sociolect used by a 1970's Maoist group consiting of a few hundred people, where most of the printed material consists of novels from a couple of ex-members that became authors.
      In the latter case, it claimed to not know, but was able to get a good match from just a description.
      I also just had it ape Norwegian orthography from the 1910's by having it look up the rules and translate a text it had first translated from English to modern Norwegian, and it did just fine.
      They will have seem some work in these dialects, but mostly it transfer really well to know related languages (English, Dutch, German, Swedish, Danish, roughly form a continuum from least in common to most in common with modern Norwegian; they all share vocabulary and significant parts of grammar with Norwegian), and then a relatively limited exposure to Norwegian itself is sufficient to do fairly well.
      They're also really good at "style transfer" of text in the form of tweaking orthography, word order, and minor grammar changes from descriptions and examples.
      (incidentally, the latter is one way of getting an LLM to sound a lot less like an LLM)
      [-]
      - otabdeveloper4 4 hours ago
        This is all true, but I assumed the original posters were talking about cultural knowledge, not linguistic correspondences.
        To do translation well you still need cultural knowledge. (E.g. the particular modes of specific kinds of legalese, or slang and the nuances of social class, etc)
        [-]
        vintermann 3 hours ago
        I think it's not that this knowledge isn't present in the model somewhere, but probably more that it gets killed by instruction tuning for US corporate values.
ipsum2 12 hours ago
This is how much storage the average r/datahoarder user has in their basement. Fewer than 100 hard drives.
[-]
- arjie 12 hours ago
  But not in flash. I have an appreciable fraction of that but in spinning rust.
kreyenborgi 12 hours ago
Ad for Huawei?
dzhiurgis 11 hours ago
That's about 350MB per capita. Humans can produce 2-6kb per hour. That's 13 years of non-stop typing. Wonder where it all comes from. I guess it's websites that aren't compressed / extracted.
[-]
- vidarh 11 hours ago
  It's a legal deposit library, same as e.g. Library of Congress. Which means almost every published book, magazine, and newspaper and many other works published in Norway, as well as large collections of Norwegian works published abroad (such as thousands of Norwegian-language newspapers published by the Norwegian immigrant communities in the US) for many decades and a large proportion of the same from the last 200+ years are stored there.
  They do also crawl websites (or at least did) in the .no tld.
jauntywundrkind 12 hours ago
384 core cpu cluster? 2 petabytes?
Dell just launched a 2U that fits almost 10 petabytes in it. It's probably not 384 core capable but that is very doable right now, Epyc chips are 192 cores each! https://www.techradar.com/pro/dell-launches-record-shatterin...
[-]
- 100ms 12 hours ago
  5x 400gbit running to a 2U box whoa, the PCI lanes must have heat shielding.
  More seriously there is a sensibility limit on extreme density where it's not needed. The idea that you're just going to magically get 2 TBit/s out of those ports seems unlikely even with tweaked software, and you're stuck with a power and comms hotspot that's liable to dictate the remainder of your network design.
  At max utilisation that 2U would take 12 hours to drain, and only 12 hours assuming peak and likely unachievable throughput and the box otherwise being completely out of service. Not a great start
- abujazar 11 hours ago
  That's the in-house preprocessing hardware, not what they're training on.
  [-]
  - jauntywundrkind 11 hours ago
    Yes!
    It's still a weird article, to highlight a "big" storage appliance. Having all that NVMe local feels like it would be much much much much faster.
7e 13 hours ago
2 PB? They will not come close to training in on that amount. Maybe years from now.
[-]
- sgt 12 hours ago
  Think they will not train on the dull 2TB but use that as the data lake to start and then apply a more targeted approach.
  [-]
  - winddude 12 hours ago
    if you read the article 2pb is available as flash storage in the data pipeline, used to dedupe, clean, normalize, etc, for training from 60pb of raw data.
- Den_VR 12 hours ago
  Could probably LoRA with that
- huflungdung 12 hours ago
  [dead]
yanhangyhy 7 hours ago
so now Huawei is not a threat to 'democracy' anymore?
[-]
- dopa42365 3 hours ago
  whenever Huawei want to buy billions of dollars worth of US licenses and stuff, they stop being a "national security threat" for a while because reasons
dakolli 10 hours ago
Even entire governments are captured by a mild LLM psychosis. Which is sad in the case of Norway. I lived in Norway for two years and always found their government to be highly rational, this is not a rational use of public funds (but I suppose they have plenty of capital).
Western society is completely captured by this form of psychosis and its going to bite us in the a* very soon.
I firmly believe all the Boomer leaders throughout the world are being sold a bag of lies by technocrats that "AI", specifically LLMs, are going to cure disease and death and therefor they are willing to handover all control to the technocrats. Fckin croakers at it again.
[-]
- NonHyloMorph 18 minutes ago
  I think it is highly rational. You see it from the wrong point of view. It seems to be less a short utilitarian project or economic endeavour, but a cultural one. Think about it more of in terms of applied humanities. Which languages go extinct, which cultures disappear and are superseded by a monocultural globalist hegemony.
hottrends 10 hours ago
[flagged]
huss-mo 11 hours ago
[flagged]
hank808 11 hours ago
Ehhh. None of this sounds right. Translation problems maybe. Lack or technical detail understanding maybe... I don't know. Probably not news.