Interesting exchange on the use of AI coding tools:
curious how much did you write the code by hand of it?
Karpathy: Good question, it's basically entirely hand-written (with tab autocomplete). I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful, possibly the repo is too far off the data distribution.
people say this like it's a criticism, but damn is it ever nice to start writing a simple crud form and just have copilot autocomplete the whole thing for me.
Yep. I find the hype around AI to be wildly overblown, but that doesn’t mean that what it can do right now isn’t interesting & useful.
If you told me a decade ago that I could have a fuzzy search engine on my desktop that I could use to vaguely describe some program that I needed & it would go out into the universe of publicly available source code & return something that looks as close to the thing I’ve asked for as it can find then that would have been mindblowing. Suddenly I have (slightly lossy) access to all the code ever written, if I can describe it.
Same for every other field of human endeavour! Who cares if AI can “think“ or “do new things”? What it can do is amazing & sometimes extremely powerful. (Sometimes not, but that’s the joy of new technology!)
Why do you think what you describe being excited about does not warrant the current level of AI hype? I agree with your assessment and sometimes I think there is too much cynicism and not enough excitement.
Back in the 90s you could drag and drop a vb6 applet in Microsoft word. Somehow we’ve regressed..
Edit: for the young, wysiwyg (what you see is what you get) was common for all sorts of languages from c++ to Delphi to html. You could draw up anything you wanted. Many had native bindings to data sources of all kinds. My favourite was actually HyperCard because I learned it in grade school.
Wysiwyg kind of fell apart once we had to stop assuming everyone had an 800x600 or 1024x768 screen, because what you saw was no longer what others got.
Not entirely, in these RAD tools you also had flexible layout choices and obviously you could test it for various window sizes (although the maximum was the one supported by your graphics card). Too bad many chose the lazy way and just enforced fixed window size at 800x600.
Most of the internet still assumes you're using a 96 DPI monitor. Tho the rise of mobile phone has changed that it seems like the vast majority of the content consumed on mobile lends itself to being scaled to any DPI - eg.. movies, pictures, youtube ect.
Before copilot what I'd do is diagnose and identify the feature that resembles the one that I'm about to build, and then I'd copy the files over before I start tweaking.
Boilerplate generation was never, ever the bottleneck.
It is, because the frontend ecosystem is not just React. There are plenty of projects where LLMs still give weird suggestions just because the app is not written in React.
I agree. I am "writing" simple crud apps for my own convenience and entertainment.
I can use unfamiliar frameworks and languaged for extra fun and education.
First, it's not cynicism but a more realistic approach than just following SV marketing blindly, and second, it's not "everything else", just GenAI, NFTs/ICOs/Web3, "Metaverse" (or Zucks interpretation of it), delf-driving cars ready today, maybe a bit Theranos.
Go look at the comments on HN whenever someone posts about their AI coding workflow. It will be littered with negative comments that either imply or outright say that the poster is either shilling, ignorant or working only on toy examples.
The grievance attitude seems to exist in both directions and is actually what is exhausting.
> It will be littered with negative comments that either imply or outright say that the poster is either shilling, ignorant or working only on toy examples.
And they would be often be right. Coupled with the fact that most of the glowing "omg I only code with AI" posts don't even try to show what code or products they are working on.
And yes, the absolute vast majority of people who are skeptical are skeptical precisely because they use these tools every day themselves.
One of my hobby projects is an esoteric game engine oriented towards expressing simulation mechanics. I simply do not use agentic tools when editing the core code for this project (mostly rust and wgsl). It always stumbles, and leaves code that I need to fix up manually, and even then feel unsure about. I've tried a few different agents, including the current top of the line. The power is just not there yet.
At the same time, these tools have helped me reduce the development time on this project by orders of magnitude. There are two prominent examples.
---
Example 1:
The first relates to internal tooling. I was debugging a gnarly problem in an interpreter. At some point I had written code to do a step-by-step dump of the entire machine state to file (in json) and I was looking through it to figure out what was going wrong.
In a flash of insight, I asked my AI service (I'll leave names out since I'm not trying to promote one over another) to build a react UI for this information. Over the course of a single day, I (definitely not a frontend dev by history) worked with it to build out a beautiful, functional, easy to use interface for browsing step-data for my VM, with all sorts of creature comforts (like if you hover over a memory cell, and the memory cell's value happens to be a valid address to another memory cell, the target memory cell gets automatically highlighted).
This single tool has reduced my debugging time from hours or days to minutes. I never would have built the tool without AI support, because I'm simply not experienced enough in frontend stuff to build a functional UI quickly.. and this thing built an advanced UI for me based on a conversation. I was truly impressed.
---
Example 2:
As part of verifying correctness for my project, I wanted to generate a set of tests that validated the runtime behaviour. The task here consists of writing a large set of reference programs, and verifying that their behaviour was identical between a reference implementation and the real implementation.
Half decent coverage meant at least a hundred or so tests were required.
Here I was able to use agentic AI to reduce the testcase construction time from a month to about a week. I asked the AI to come up with a coverage plan and write the test case ideas to a markdown file in an organized, categorized way. Then I went through each category in the test case markdown and had the AI generate the test cases and integrate them into the code.
---
I was and remain a strong skeptic of the hype around this tech. It's not the singularity, it's not "thinking". It's all pattern matching and pattern extension, but in ways so sophisticated that it feels like magic sometimes.
But while the skeptical perspective is something I value, I can't deny that there is core utility in this tech that has a massive potential to contribute to efficiency of software development.
This is a tool that we as industry are still figuring out the shape of. In that landscape you have all sorts of people trying to evangelize these tools along their particular biases and perspectives. Some of them clearly read more into the tech than is there. Others seem to be allergically reacting to the hype and going in the other direction.
I can see that there is both noise, and fundamental value. It's worth it to try to figure out how to filter the noise out but still develop a decent sense of what the shape of that fundamental value is. It's a de-facto truth that these tools are in the future of every mainstream developer.
or a typical CRUD app architecture, or a common design pattern, or unit/integration test scaffolding, or standard CI/CD pipeline definitions, or one-off utility scripts, etc...
Like 80% of writing coding is just being a glorified autocomplete and AI is exceptional at automating those aspects. Yes, there is a lot more to being a developer than writing code, but, in those instances, AI really does make a difference in the amount of time one is able to spend focusing on domain-specific deliverables.
And even for "out of distribution" code you can still ask question about how to do the same thing but more optimized, could a library help for this, why is that piece of code giving this unexpected output etc
It has gotten to the point that I don't modify or write SQL. Instead I throw some schema and related queries in and use natural language to rubber duck the change, by which point the LLM can already get it right.
I work on this typed lua language in lua, and sometimes use llms to help fix internal analyzer stuff, which works 30% of the time for complex, and sometimes not at all, but helps me find a solution in the end.
However when I ask an llm to generate my typed lua code, with examples and all, on how the syntax is supposed to be, it mostly gets it wrong.
my syntax for tables/objects is:
local x: {foo = boolean}
but an llm will most likely gloss over this and always use : instead of =
local x: {foo: boolean}
Are you using a coding agent or just an llm chat interface? Do you have a linter or compiler that will catch the misuse that you’ve hooked up to the agent?
I've dabbled with claude code in this particular project, but not much. My short experience with it is that it's slow, costly and goes off the rails easily.
I prefer to work with more isolated parts of the code. But again, I don't really know all that much about agents.
One thing I wanted to do on my project is reorganize all the tests, which sounds like an agent job. But I'd imagine I need to define some hard programmatic constraints to make sure tests are not lost or changed in the process.
I wonder if the new GenAI architecture namely DDN or distributed discrete networks being discussed recently can outperform the conventional architecture of GAN and VAE. As the name suggests, it can provide multitude of distributions for training and inference purposes [1].
[1] Show HN: I invented a new generative model and got accepted to ICLR (90 comments):
That is a good thing to hear from someone as reputable as Karpathy. The folks who think we're on the cusp of AGI may want to temper their expectations a bit.
I do love Claude Code, because one thing I periodically need to do is write some web code, which is not my favorite type of coding but happens to have incredibly good coverage in the training data. Claude is a much better web developer than I am.
But for digging into the algorithmic core of our automation tooling, it doesn't have nearly as much to work with and makes far more mistakes. Still a net win I'm happy to pay for, even if it's never anything more than my web developer slave.
100%. I find the "LLMs are completely useless" and the "LLMs will usher in a new era of messianic programming" camps to be rather reductive.
I've already built some pretty large projects [1] with the assistance of agentic tooling like Claude Code. When it comes to the more squirrely algorithms and logic, they can fall down pretty hard. But as somebody who is just dreadful at UI/UX, having it hammer out all the web dev scaffolding saves me a huge amount of time and stress.
It's just a matter of tempering one's expectations.
Hey, thank you for making this—I really enjoyed playing it and it feels like it fits the mental-reward-between-work-tasks need. It did spin up my M1's fans after a few minutes which is a rather rare occurrence, but I'm guessing that's par for the course when you're working with a bunch of video on canvas. Either way, hope I remember it the next time I'm looking for a puzzle to solve while I take a break :)
>and the "LLMs will usher in a new era of messianic programming" camps
Well, this one might still be borne out. It's just silly to think it's the case right now. Check in again in 10 years and it may be a very different story. Maybe even in 5 years.
> But for digging into the algorithmic core of our automation tooling
What I find fascinating is reading this same thing in other context like “UI guru” will say “I would not let CC touch the UI but I let it rip on algorithmic core of our automation tooling cause it is better at it than me…”
Both can be true. LLMs tend to be mediocre at (almost) everything, so they're always going to be worse than the user at whatever the user is an expert in.
I completely agree. I'm definitely not an expert web developer. I know enough to build functional tools, but it's not exactly art that I'm making. But the core of our tooling is my primary focus, I wrote it, I've spent a lot of time perfecting it. Claude can easily impress me with things like the CSS magic it weaves, because I am unsophisticated.
This makes sense, right? It's a relatively novel thing to be writing. I don't find it to be a damning remark like other comments here seem to be concluding.
If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.
Yeah, if your goal is "build the tightest 8,000 line implementation of training an LLM from scratch, with a focus on both conciseness and educational value" I don't think it's particularly surprising that Claude/Codex weren't much help.
> If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.
> My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it.
This is how he described vibe coding:
> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
Vibe coding is clearly aimed at having fun hacking around on something that doesn’t matter, and he’s doing the opposite of that with this project. The fact that he’s not using vibe coding for something that is completely inappropriate for vibe coding is neither surprising nor a failure of vibe coding.
Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training
This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).
Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.
(Source: am experienced speedrunner who's been in these circles for a decent amount of time)
I didn't get as good results as Karpathy (unlucky seed?)
It's fun to play with though...
User: How many legs does a dog have?
Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)
cd /tmp
git clone https://huggingface.co/sdobson/nanochat
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
--model-dir /tmp/nanochat \
--prompt "Tell me about dogs."
This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...
Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.
% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \
--model-dir nanochat/ --prompt "who is simonw on hacker news?"
Using device: cpu
Loading model from nanochat/model_000650.pt
Loading metadata from nanochat/meta_000650.json
Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
Loading model weights (this may take a minute for a 2GB model)...
Converting model to float32 for CPU...
Model loaded successfully!
Loading tokenizer...
Tokenizer loaded successfully!
Prompt: who is simonw on hacker news?
Encoded to 9 tokens
Generating...
--------------------------------------------------
who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.
In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker
--------------------------------------------------
Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!
For anyone curious this is the error when running uv sync on macos,
> uv sync
Resolved 88 packages in 3ms
error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels
Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.
Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.
>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.
Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.
Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?
There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
This weekend I just cracked into nanoGPT (https://github.com/karpathy/nanoGPT), an older but fabulous learning exercise where you build and train a crappy shakespeare GPT with ~0.8M parameters on a cpu. Results are about what you'd expect from that, they suck, but you can start to feel the magic, especially if you're not a deep learning professional and you just want to poke around and hack on it.
I started writing up a blog post on my weekend with nanoGPT but it's not done yet... Would have been great to link to here lol oh well
Absolutely, it's wildly fun to read the outputs of even a little tiny 0.8M model trained on CPU. And now I've actually got a much better understanding of the transformer architecture after playing around with it for a day. This repo is probably going to spawn some new folks to try out ideas which will turn into new researchers in the field, no doubt.
Somewhat related: I wrote up a MTG card generator based on nanoGPT a while ago that I think produces pretty good results for being 1m parameters.
The real neat thing about this is that WotC makes a few thousand new cards each year, so my training data set just grows over time and the model gets better with no effort spent on my part.
It would be interesting to come up with a use case which requires a freshly trained model and isn't just something that generic models can already, especially with 1MM context window
You can see the invention of RLHF/ChatGPT here because text generation suddenly became much more coherent and also much less interesting. You have to go back to older tech for surrealism because nobody will let you see the good stuff (the base models).
I guess I was much more interested in being able to work with an LLM to create good, synergistic Commander decks and less interested in generating custom Magic cards.
I'm sure I can dig up info on how to do this and piece it together, but I thought OP might have a guide specifically for it.
FWIW, there was a pretty popular post on HN around generating MTG cards using AI a couple years back but I believe that their approach was a fine-tune on an existing LLM.
Nice! His Shakespeare generator was one of the first projects I tried after ollama. The goal was to understand what LLMs were about.
I have been on an LLM binge this last week or so trying to build a from-scratch training and inference system with two back ends:
- CPU (backed by JAX)
- GPU (backed by wgpu-py). This is critical for me as I am unwilling to deal with the nonsense that is rocm/pytorch. Vulkan works for me. That is what I use with llama-cpp.
I got both back ends working last week, but the GPU back end was buggy. So the week has been about fixing bugs, refactoring the WGSL code, making things more efficient.
I am using LLMs extensively in this process and they have been a revelation. Use a nice refactoring prompt and they are able to fix things one by one resulting in something fully functional and type-checked by astral ty.
What is there to misunderstand? It doesn't even install properly most of the time on my machine. You have to use a specific python version.
I gave up on all tools that depend on it for inference. llama-cpp compiles cleanly on my system for Vulkan. I want the same simplicity to test model training.
pytorch is as easy as you are going to find for your exact use case. If you can't handle the requirement of a specific version of python, you are going to struggle in software land. ChatGPT can show you the way.
I have been doing this for 25 years and no longer have the patience to deal with stuff like this. I am never going to install Arch from scratch by building the configuration by hand ever again. The same with pytorch and rocm.
Getting them to work and recognize my GPU without passing arcane flags was a problem. I could at least avoid the pain with llama-cpp because of its vulkan support. pytorch apparently doesn't have a vulkan backend. So I decided to roll out my own wgpu-py one.
FWIW, I've been experimenting with LLMs for the last couple of years, and have exclusively built everything I do around llama.cpp exactly because of the issues you highlight. "gem install hairball" has gone way too far, and I appreciate shallow dependency stacks.
So could I in practice train it on all my psychology books, materials, reports, case study and research papers and then run it on demand on a 1xH100 node - https://getdeploying.com/reference/cloud-gpu/nvidia-h100 whenever I have a specialised question?
You could do that indeed, but the performance would be abysmal. For this kind of use-case, it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
> it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
I noticed NewRelic has a chat feature that does this sort of thing, it's scoped very narrowly down to their website and analytics DSL language, and generates charts/data from their db. I've always wondered how they did that (specifically in terms of set up the training/RAG + guardrails). It's super useful.
You might be able to figure that out just by asking it - see if you can get it to spit out a copy of the system prompt or tell you what tools it has access to.
The most likely way of building that would be to equip it with a "search_docs" tool that lets it look up relevant information for your query. No need to train an extra model at all if you do that.
Yes, though it's possible a more-general core model, further enhanced with some other ways to bring those texts-of-interest into the working context, might perform better.
Those other ways to integrate the texts might be some form of RAG or other ideas like Apple's recent 'hierarchical memories' (https://arxiv.org/abs/2510.02375).
You could but it would be significantly worse than fine-tuning or RAG with a pre-trained model, or using a smaller model since your dataset would be so small.
I've always thought about the best way to contribute to humanity: number of people you help x how much you help them. I think what Karpathy is doing is one of the highest leverage ways to achieve that.
Our current world is build on top of open source projects. This is possible because there are a lot of free resources to learn to code so anyone from anywhere in the world can learn and make a great piece of software.
I just hope the same will happen with the AI/LLM wave.
This free tradition in software is I think one of the things that I love so much, but I don't see how it can continue with LLMs due to the extremely high training costs and the powerful hardware required for inference. It just seems like writing software will necessarily require paying rent to the LLM hosts to keep up. I guess it's possible that we'll figure out a way to do local inference in a way that is accessible to everyone in the way that most other modern software tools are, but the high training costs make that seem unlikely to me.
I also worry that as we rely on LLMs more and more, we will stop producing the kind of tutorials and other content aimed at beginners that makes it so easy to pick up programming the manual way.
There's a Stephen Boyd quote that's something like "if your optimization problem is too computationally expensive, just go on vacation to Greece for a few weeks and by the time you get back, computers might be fast enough to solve it." With LLMs there's sort of an equivalent situation with cost: how mindblowing would it be able to train this kind of LLM at all even just 4 years ago? And today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
There's also a reasonable way to "leapfrog" the training cost with a pre-trained model. So if you were doing nanochat as a learning exercise and had no money, the idea would be to code it up, run one or two very slow gradient descent iterations on your slow machine to make sure it is working, then download a pre-trained version from someone who could spare the compute.
> today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
No, it's extremely hard to imagine since I used one of Karpathy's own models to have a basic chat bot like six years ago. Yes, it spoke nonsense; so did my GPT-2 fine tune four years ago and so does this.
And so does ChatGPT
Improvement is linear at best. I still think it's actually a log curve and GPT3 was the peak of the "fun" part of the curve. The only evidence I've seen otherwise is bullshit benchmarks, "agents" that increase performance 2x by increasing token usage 100x, and excited salesmen proclaiming the imminence of AGI
1. According to who? Open AI?
2. Its current state is "basically free and containing no ads". I don't think this will remain true given that, as far as I know, the product is very much not making money.
Yes, that number is according to OpenAI. They released that 800m number at DevDay last week.
The most recent leaked annualized revenue rate was $12bn/year. They're spending a lot more than that but convincing customers to hand over $12bn is still a very strong indicator of demand. https://www.theinformation.com/articles/openai-hits-12-billi...
Part of that comes from Microsoft API deals. Part of that will most certainly come because the vast network of companies buy subscriptions to help "Open" "AI" [1].
Given the rest of circular deals, I'd also scrutinize if it applies to the revenue. The entanglement with the Microsoft investments and the fact that "Open" "AI" is a private company makes that difficult to research.
[1] In a U.S. startup, I went through three CEOs and three HR apps, which mysteriously had to change for no reason but to accommodate the new CEO's friends and their startups.
Even with linear progression of model capability, the curve for model usefulness could be exponential, especially if we consider model cost which will come down.
For every little bit a model a smarter and more accurate there are exponentially more real world tasks it could be used for.
This. It looks like one of the keys to maintaining open source is to ensure OSS developers have access to capable models. In the best of worlds, LLM vendors would recognize that open source software is the commons that feeds their models and ensure it flourishes.
(This is a bit ranty, but due to a sincere desire for a better world, and being the recipient of personal attacks for believing a better world is achievable by a different path to others)
I feel like this point of view is an ideal not shared by one of the main branches of anti-AI sentiment.
The idea of intellectual property works against this. Rather than contributing to humanity directly, ownership of information is accumulated by individuals and then rented to humanity.
At the same time I agree that people should be able to have a livelihood that affords them the ability to create new intellectual contributions.
The service Karpathy is providing is also being provided by thousands of YouTube creators in a huge variety of topics. It's a little sad that so many must support their efforts with support their efforts with sponsorships from sources with varying degrees of ethical behaviour. Patreon is better but still not ideal. I sincerely believe this _is_ one of the best ways to contribute to society.
A recent Daily Show had Jon Stewart describe training AI as strip mining human knowledge. Training AI is regularly described as theft as if this position is a given without any counter argument possible. It is opinion masquerading as fact. This saddens me because it suggests to me that the war to control the narrative is being won by people who want to entrench a hypercapitalistic vision of ownership where not only is a particular expression of an idea ownable but also stakes a claim to own some of any ideas that come from viewing that expression.
I cannot see any way that this viewpoint would aid humanity as a whole, but instead assign benefits to a collection of individuals. The ability to trade intellectual property means that ownership inevitably gets passed to a smaller and smaller pool of individuals over time.
I think we really do need a new way to consider these issues in light of the modern world. When mentioning these thoughts to others a common refrain is that it doesn't matter because the powers that be (and their lobbyists) will prevent any fix from happening. I have never been fond of that particular fatalism, especially when it inhibits discussion of what would be better.
I recommend his ANN/LLM from scratch videos to people a lot because not only is he a clear instructor, but his code tends to be very Pythonic and just the right balance of terse but readable (not counting the Pytorch vectorization stuff, but that's not his fault, it's just complex). So I think people benefit just from watching and imitating his code style.
Software is just a tool. Much like a hammer, a knife, or ammonium nitrate, it can be used for both good or bad.
I say this as someone who has spent almost 15 years writing software in my free time and publishing it as open source: building software and allowing anyone to use it does not automatically make other people's lives better.
A lot of my work has been used for bad purposes or what some people would consider bad purposes - cheating on tests, cheating in games, accessing personal information without permission, and in one case my work contributed to someone's doxxing. That's because as soon as you publish it, you lose control over it.
But at least with open source software, every person can use it to the same extent so if the majority of people are good, the result is likely to be more positive than negative.
With what is called AI today, only the largest corporations can afford to train the models which means they are controlled by people who have entirely different incentives from the general working population and many of whom have quite obvious antisocial personality traits.
At least 2 billion people live in dictatorships. AI has the potential to become a tool of mass surveillance and total oppression from which those countries will never recover because just like the models can detect a woman is pregnant before she knows it, it will detect a dissenter long before dissent turns into resistance.
I don't have high hopes for AI to be a force for good and teaching people how toy models work, as fun as it is, is not gonna change it.
"With what is called AI today, only the largest corporations can afford to train the models"
I take it you're very positive about Andrej's new project which allows anyone to train a model for a few hundred dollars which is comparable to the state-of-the-art from just 5 years ago then.
For a few hundred dollars, given heavily-VC-subsidized hardware that is probably partially funded by nvidia and various AI companies, etc.
Can I run it on my local hardware (nvidia consumer card, AMD cpu)? No. When could that corporation cut off my access to that hardware if I did anything it didn't like? Anytime.
Lots of things have started off cheap / subsidized to put competitors out of business, and then the prices go up, up and up..
Yes. The training process requires big expensive GPUs. The model it produces has 561M parameters, which should run on even a high end mobile phone (I run 4B models on my iPhone).
> At least 2 billion people live in dictatorships. AI has the potential to become a tool of mass surveillance and total oppression from which those countries will never recover because just like the models can detect a woman is pregnant before she knows it, it will detect a dissenter long before dissent turns into resistance.
It already works like this in your precious western democracies and they didn't need AI to be authoritarian total surveillance states in spirit, with quite a lot of support from a propagandized populace that begged for or pretended to agree with the infringement of their civil rights because of terrorism, drugs, covid or protecting the poor poor children.
You can combat tech with legislation and culture but the legislation and culture were way beyond the tech in being extremely authoritian in the first place.
I don't know man. All this "tech" didn't see AOC, Sanders, and other 'radicals' coming. The parties actually had to expend effort after the fact to delegitimize them and have to continue to do so for additional candidates that come along(Jamal Bowman, Cori Bush, etc.)
Yeah it feels similar to inventing the nuke. Or it’s even more insidious because the harmful effects of the tech are not nearly as obvious or immediate as the good effects, so less restraint is applied. But also, similar to the nuke, once the knowledge on how to do it is out there, someone’s going to use it, which obligates everyone else to use it to keep up.
While documenting a build path is nice, IMHO renting hardware nobody can afford from VC-backed cloud providers using cold hard cash to produce clones of legacy tech using toy datasets under the guise of education is propping up the AI bubble and primarily helping institutional shareholders in those AI bubble companies, particularly their hardware supplier NVidia. Personally I do not see this as helping people or humanity.
This would sit better with me if the repo included a first tier use case for local execution, non-NVidia hardware reference, etc.
"This would sit better with me if the repo included a first tier use case for local execution, non-NVidia hardware reference, etc."
This is a pretty disheartening way to respond to something like this. Someone puts a great deal of effort into giving something interesting away for free, and is told "you should have also done THIS work for free as well in order for me to value your contribution".
It is an objective and transparent response based on free software world norms. Feel free to interpret differently and to be disheartened. Hell, many of us are disheartened by the AI VC political theater we are seeing right now: experienced programmers, artists, lawyers, perhaps much of humanity. Let's stick to objective elements of the discussion, not emotional opine.
It is amusing to note the dichotomy between the clearly compassionate, empathetic and altruistic perspective displayed here and the comically overstated framing of helping humanity.
I think you got your proportions slightly wrong there. This will be contributing as much to an AI bubble as a kid tinkering around with combustion is contribution to global warming.
Not really. Anything that guy does sets the tone for an extended cacophony of fans and followers. It would be a sad day when nobody critically assesses the motivations, effects and framing of those moves. I question the claim this move helps humanity and stand by the assessment it's just more feeding an unfree ecosystem which equates to propping up the bubble.
He is the GOAT of LLM MVPs. That is educational and useful, especially because he uses a minimal and clean style, but I don't see how it even compares with kernels, operating systems etc.
So I appreciate his work in an academic and educational sense, but large scale applications with stolen training material are still theft.
number of people you help x how much you help them x number of people you harm x how much you harm them
For example - harming a little bit all content creators of the world, by stealing their work without compensation or permission. How much does that cost globally every year after year? How do we even quantify long term consequences of that? Stuff like that.
If you consider the cost of hiring a human professional to over using multimodal AI for something, its very realize literally thousands of dollars of value per chat.
Multiply that by many billions of chats per day.
Lawyers and other professionals charge a lot. So do artists, especially when you want to do a million revisions. LLMs hand it out for free, making many knowledge and art professions affordable and accessible to the masses.
Stable owners were upset when cars replaced horses, but you can't stop progress, especially when value proposition is undenyable.
I wonder what people will do, when they will realize that LLM lawyers produce insufficient results, but "suddenly" all cheap bottom rung lawyers are gone and switched professions.
As for the LLM "creative" content, have you seen it or read it? Well, same problem. After you will need a quality content, good luck finding some cheap creator. Pay full price for an experienced one and likely wait.
PS: I don't doubt that LLMs are here to stay. They will se a lot of usage and pervade all industries. It's just that future will be pretty shit. Talking on phone with LLMs, reading LLM slop, seeing LLM lop everywhere, receiving generated emails and using LLMs to reverse parse them to search for an actual content, major economy downturn, rapidly slowing salary growth (not that it was big before), etc.
Still under development, remaining work includes tuning nanochat (current state being solid v0.1) and finalizing the in-between projects so that students can "unlock" all complexity that hides underneath: `torch.Tensor`, `torch.dist`, `.backward()`, '.compile()`, etc. And then the more ops heavy aspects.
Would love to hear some metrics on training it on your personal computer rather than a "cloud GPU box". I don't care if it takes 3 months to train if I have something good, offline, and free(ish, but just pay electric bills)
Built so many nano AIs over the last several years. I have played with nanoGPT, its ok. Just hype for Kpathy... So many tiny LLMs out there now that run on cheap SOCs. Try SmolVLM512, runs fine on a sub $100 pi.
The title is saying you can train your own model for $100. That part is true: the $100 goes to the cloud provider to rent you $250k of hardware for four hours. Then you can run that model on whatever hardware you have lying around, because it's really small.
"The fastest way to feel the magic is to run the speedrun script speedrun.sh, which trains and inferences the $100 tier of nanochat. On an 8XH100 node at $24/hr, this gives a total run time of about 4 hours."
I am clueless and don't understand this. Where is the $100 being spent? Some sort of API you have to pay to access? Some sort of virtual hardware you have to rent access to?
H100s are expensive NVIDIA GPUs, each costing about $30,000. 8XH100 means you have 8 of those wired together in a big server in a data center somewhere, so around a quarter of a million dollars worth of hardware in a single box.
You need that much hardware because each H100 provides 80GB of GPU-accessible RAM, but to train this model you need to hold a LOT of model weights and training data in memory at once. 80*8 = 640GB.
~$24/hour is how much it costs to rent that machine from various providers.
Curios to try it someday on a set of specialized documents. Though as I understand the cost of running this is whatever GPU you can rent with 80GB of VRAM. Which kind of leaves hobbyists and students out. Unless some cloud is donating gpu compute capacity.
A GPU with 80GB VRAM costs around $1-3 USD an hour on commodity clouds (i.e. the non-Big 3 bare metal providers e.g. https://getdeploying.com/reference/cloud-gpu/nvidia-h100). I think it's accessible to most middle class users in first world countries.
The 80 GB are for training with a batch size of 32 times 2048 tokens each. Since the model has only about 560M parameters, you could probably run it on CPU, if a bit slow.
It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.
Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.
I would love to take an existing open-weight model and fine-tune it with specific training data along these lines. Can I do that with Qwen or GLM? Is there a ~simple recipe for doing that?
>If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.
That sounds like it could run on a 24gb GPU. Batch size of 8 would imply 20gb mem, no?
Yes, you can always stream data when training or doing inference on models when vram is lacking but the slow down is extremely noticeable. This is the case for CPU code too and is why optimising for bandwidth is so critical in high-performance computing. Your ability to compute is almost always substantially larger than your bandwidth. An Avx512 capable CPU with a suitable amount of cores is easily capable of doing multiple terabytes of fp64 operations per second, but is typically limited by memory bandwidth, GPUs with LLMs have just broadened this knowledge to more people.
A fun consequence of the fact that CPUs got faster at a rate quicker than memory is look up tables of pre-computed values used to be common optimisations in code, but now it is almost always quicker to re-compute them than to retrieve a pre-computed value from memory for common use-cases.
I'm running it now and I had to go down to 4 instead of 8, and that 4 is using around 22-23GB of GPU memory. Not sure if something is wrong or if batch is only scaling part of the memory requirements. (Edit: I restarted running the training script directly instead of torch run, and 8 still doesn't fit, but 4 is now using 16-17 instead.)
On my 4090 the tok/sec is 523, which is 1/2000 of the 1,000,000 tok/sec of the 8 80GB H100s. That feels too slow so maybe something is wrong. The 4090 is about 1/3 of the raw compute. I'm sure there's other losses from less batching but even if it were 1/10ths as fast, I'd expected something more like 1,000,000 / 10 / 8 so at least 10,000 tok/sec.
Ah, but this is nice project. I'll start hacking once it's easier to fine-tune it with own documents for specific questions.
What plaques me, though, is how you prevent the model from answering questions it was not trained for?
This is absolutely fantastic. I really can't wait for the final course to be live. It's in the "shut up and take my money" category. I had so much fun with the nanoGPT videos.
End to end training is a different beast, but finetuning and inference of impressive LLMs like QWEN3 can be done on pretty run of the mill hardware like Apple Silicon macs and gaming PCs if anyone wants a personalized assistant with character. Just ask AI how to finetune AI using unsloth (if using NVIDIA) or MLX (for apple) and it will give you ready to run python scripts.
I don't think so. Training on documents is not a great way of building a search engine for those for the information in those documents, because the training process mixes all of that information together in ways that detach the individual words from the source documents they came from.
As usual, if you want an LLM to be able to help search a corpus of text the best way to achieve that is to teach it how to use a search tool against that text.
I've seen this called "agentic RAG" by some people. The easiest way to get a local demo is with Claude Code or Codex CLI. They know how to use grep, and you can set them loose on a folder full of text files and tell them to use grep to answer questions - it can work really well.
I just tried this in "claude --dangerously-skip-permissions":
> Use Python and AppleScript to find Apple Notes that mention UPS
... and fell down a rabbit hole of optimizations because my Notes collection is HUGE, but it got there in the end!
That was the point. That example is meant to demonstrate that the model that trained for 4 hours can imitate a conversation but isn't actually anywhere close to being useful.
Because that explanation is expected to be wrong. This is a partially trained tiny model, Andrej shared that obviously incorrect explanation to emphasize that the model is not trustworthy or useful at that stage.
It's a reasonable shortcut for what this project provides: training code, inference code and a ChatGPT-style web interface for chatting with the model.
I believe you that you just meant this as an interesting example, and in that sense were engaged in curious conversation (generally what we want here). But the amount of provocation in the comment is so high, and the amount of information so little, that it ends up on the wrong side of "Eschew flamebait. Avoid generic tangents." (https://news.ycombinator.com/newsguidelines.html). In other words: not gonna end well.
not a particularly ethical guy and I wouldn't hold him up as a example of morality but the guy hasn't actually been found guilty YET. Multiple courts have tried. You'd think that for a guy under as much scrutiny as him that they would have SOMETHING to pin him on by now.
Innocent until PROVEN guilty is a foundational legal precedent for a reason.
He is definitely guilty of being a waste of human life, a massive asshole and a general detriment to society worldwide.
Don’t need a court to prove that.
There are 6 criminal cases against him in several countries, let’s see how they pan out - but regardless he is not an innocent person.
not derailing, just pointing out effective ways of producing good which is what i was responding to. i think its good for people to be aware of this. those people are all examples of people who have influenced culture for bad. you can do it for good: bryan johnson, civil rights leaders, leftist streamers. andrew tate was just the most effective, recent, and obvious one which is why I pointed him out.
Interesting exchange on the use of AI coding tools:
https://x.com/karpathy/status/1977758204139331904> the repo is too far off the data distribution
ah, this explains why these models have been useless to me this whole time. everything i do is just too far off the data distribution!
Everything is unless your app is a React todolist or leatcode questions.
people say this like it's a criticism, but damn is it ever nice to start writing a simple crud form and just have copilot autocomplete the whole thing for me.
Yep. I find the hype around AI to be wildly overblown, but that doesn’t mean that what it can do right now isn’t interesting & useful.
If you told me a decade ago that I could have a fuzzy search engine on my desktop that I could use to vaguely describe some program that I needed & it would go out into the universe of publicly available source code & return something that looks as close to the thing I’ve asked for as it can find then that would have been mindblowing. Suddenly I have (slightly lossy) access to all the code ever written, if I can describe it.
Same for every other field of human endeavour! Who cares if AI can “think“ or “do new things”? What it can do is amazing & sometimes extremely powerful. (Sometimes not, but that’s the joy of new technology!)
Why do you think what you describe being excited about does not warrant the current level of AI hype? I agree with your assessment and sometimes I think there is too much cynicism and not enough excitement.
Back in the 90s you could drag and drop a vb6 applet in Microsoft word. Somehow we’ve regressed..
Edit: for the young, wysiwyg (what you see is what you get) was common for all sorts of languages from c++ to Delphi to html. You could draw up anything you wanted. Many had native bindings to data sources of all kinds. My favourite was actually HyperCard because I learned it in grade school.
Wysiwyg kind of fell apart once we had to stop assuming everyone had an 800x600 or 1024x768 screen, because what you saw was no longer what others got.
Not entirely, in these RAD tools you also had flexible layout choices and obviously you could test it for various window sizes (although the maximum was the one supported by your graphics card). Too bad many chose the lazy way and just enforced fixed window size at 800x600.
Most of the internet still assumes you're using a 96 DPI monitor. Tho the rise of mobile phone has changed that it seems like the vast majority of the content consumed on mobile lends itself to being scaled to any DPI - eg.. movies, pictures, youtube ect.
Not a big issue with QT layouts (still have to test the result though)
I still miss my days of programming Visual Basic 6. Nothing since then ever compares.
4gl or RAD is still here, but now it’s called low- or no-code.
Before copilot what I'd do is diagnose and identify the feature that resembles the one that I'm about to build, and then I'd copy the files over before I start tweaking.
Boilerplate generation was never, ever the bottleneck.
It is, because the frontend ecosystem is not just React. There are plenty of projects where LLMs still give weird suggestions just because the app is not written in React.
I agree. I am "writing" simple crud apps for my own convenience and entertainment. I can use unfamiliar frameworks and languaged for extra fun and education.
Good times!
People say inbreeding like it’s criticism too.
HN's cynicism towards AI coding (and everything else ever) is exhausting. Karpathy would probably cringe reading this.
First, it's not cynicism but a more realistic approach than just following SV marketing blindly, and second, it's not "everything else", just GenAI, NFTs/ICOs/Web3, "Metaverse" (or Zucks interpretation of it), delf-driving cars ready today, maybe a bit Theranos.
okay but he literally does have a bridge that non-deterministically might take you to the wrong place to sell you
The original context of this sub-thread was Karpathy saying how AI coding tools were pretty useless for him when working on this particular project.
Indeed. And only Karpathy is entitled to say that AI tools produce wrong code for him. And he's only entitled to say it for this project only.
If anyone else says this, "the skepticism is exhausting", and their experience is completely irrelevant.
Go look at the comments on HN whenever someone posts about their AI coding workflow. It will be littered with negative comments that either imply or outright say that the poster is either shilling, ignorant or working only on toy examples.
The grievance attitude seems to exist in both directions and is actually what is exhausting.
> It will be littered with negative comments that either imply or outright say that the poster is either shilling, ignorant or working only on toy examples.
And they would be often be right. Coupled with the fact that most of the glowing "omg I only code with AI" posts don't even try to show what code or products they are working on.
And yes, the absolute vast majority of people who are skeptical are skeptical precisely because they use these tools every day themselves.
I mean Karpathy himself wrote that he could not use the AI tools for the project, so he had to handwrite most of it. I wonder why.
One of my hobby projects is an esoteric game engine oriented towards expressing simulation mechanics. I simply do not use agentic tools when editing the core code for this project (mostly rust and wgsl). It always stumbles, and leaves code that I need to fix up manually, and even then feel unsure about. I've tried a few different agents, including the current top of the line. The power is just not there yet.
At the same time, these tools have helped me reduce the development time on this project by orders of magnitude. There are two prominent examples.
--- Example 1:
The first relates to internal tooling. I was debugging a gnarly problem in an interpreter. At some point I had written code to do a step-by-step dump of the entire machine state to file (in json) and I was looking through it to figure out what was going wrong.
In a flash of insight, I asked my AI service (I'll leave names out since I'm not trying to promote one over another) to build a react UI for this information. Over the course of a single day, I (definitely not a frontend dev by history) worked with it to build out a beautiful, functional, easy to use interface for browsing step-data for my VM, with all sorts of creature comforts (like if you hover over a memory cell, and the memory cell's value happens to be a valid address to another memory cell, the target memory cell gets automatically highlighted).
This single tool has reduced my debugging time from hours or days to minutes. I never would have built the tool without AI support, because I'm simply not experienced enough in frontend stuff to build a functional UI quickly.. and this thing built an advanced UI for me based on a conversation. I was truly impressed.
--- Example 2:
As part of verifying correctness for my project, I wanted to generate a set of tests that validated the runtime behaviour. The task here consists of writing a large set of reference programs, and verifying that their behaviour was identical between a reference implementation and the real implementation.
Half decent coverage meant at least a hundred or so tests were required.
Here I was able to use agentic AI to reduce the testcase construction time from a month to about a week. I asked the AI to come up with a coverage plan and write the test case ideas to a markdown file in an organized, categorized way. Then I went through each category in the test case markdown and had the AI generate the test cases and integrate them into the code.
---
I was and remain a strong skeptic of the hype around this tech. It's not the singularity, it's not "thinking". It's all pattern matching and pattern extension, but in ways so sophisticated that it feels like magic sometimes.
But while the skeptical perspective is something I value, I can't deny that there is core utility in this tech that has a massive potential to contribute to efficiency of software development.
This is a tool that we as industry are still figuring out the shape of. In that landscape you have all sorts of people trying to evangelize these tools along their particular biases and perspectives. Some of them clearly read more into the tech than is there. Others seem to be allergically reacting to the hype and going in the other direction.
I can see that there is both noise, and fundamental value. It's worth it to try to figure out how to filter the noise out but still develop a decent sense of what the shape of that fundamental value is. It's a de-facto truth that these tools are in the future of every mainstream developer.
I don't know. I successfully use it for small changes on VHDL FPGA designs these days.
or a typical CRUD app architecture, or a common design pattern, or unit/integration test scaffolding, or standard CI/CD pipeline definitions, or one-off utility scripts, etc...
Like 80% of writing coding is just being a glorified autocomplete and AI is exceptional at automating those aspects. Yes, there is a lot more to being a developer than writing code, but, in those instances, AI really does make a difference in the amount of time one is able to spend focusing on domain-specific deliverables.
And even for "out of distribution" code you can still ask question about how to do the same thing but more optimized, could a library help for this, why is that piece of code giving this unexpected output etc
It has gotten to the point that I don't modify or write SQL. Instead I throw some schema and related queries in and use natural language to rubber duck the change, by which point the LLM can already get it right.
I work on this typed lua language in lua, and sometimes use llms to help fix internal analyzer stuff, which works 30% of the time for complex, and sometimes not at all, but helps me find a solution in the end.
However when I ask an llm to generate my typed lua code, with examples and all, on how the syntax is supposed to be, it mostly gets it wrong.
my syntax for tables/objects is: local x: {foo = boolean}
but an llm will most likely gloss over this and always use : instead of = local x: {foo: boolean}
Are you using a coding agent or just an llm chat interface? Do you have a linter or compiler that will catch the misuse that you’ve hooked up to the agent?
I've dabbled with claude code in this particular project, but not much. My short experience with it is that it's slow, costly and goes off the rails easily.
I prefer to work with more isolated parts of the code. But again, I don't really know all that much about agents.
One thing I wanted to do on my project is reorganize all the tests, which sounds like an agent job. But I'd imagine I need to define some hard programmatic constraints to make sure tests are not lost or changed in the process.
I wonder if the new GenAI architecture namely DDN or distributed discrete networks being discussed recently can outperform the conventional architecture of GAN and VAE. As the name suggests, it can provide multitude of distributions for training and inference purposes [1].
[1] Show HN: I invented a new generative model and got accepted to ICLR (90 comments):
https://news.ycombinator.com/item?id=45536694
[dead]
That is a good thing to hear from someone as reputable as Karpathy. The folks who think we're on the cusp of AGI may want to temper their expectations a bit.
I do love Claude Code, because one thing I periodically need to do is write some web code, which is not my favorite type of coding but happens to have incredibly good coverage in the training data. Claude is a much better web developer than I am.
But for digging into the algorithmic core of our automation tooling, it doesn't have nearly as much to work with and makes far more mistakes. Still a net win I'm happy to pay for, even if it's never anything more than my web developer slave.
100%. I find the "LLMs are completely useless" and the "LLMs will usher in a new era of messianic programming" camps to be rather reductive.
I've already built some pretty large projects [1] with the assistance of agentic tooling like Claude Code. When it comes to the more squirrely algorithms and logic, they can fall down pretty hard. But as somebody who is just dreadful at UI/UX, having it hammer out all the web dev scaffolding saves me a huge amount of time and stress.
It's just a matter of tempering one's expectations.
[1] https://animated-puzzles.specr.net
Hey, thank you for making this—I really enjoyed playing it and it feels like it fits the mental-reward-between-work-tasks need. It did spin up my M1's fans after a few minutes which is a rather rare occurrence, but I'm guessing that's par for the course when you're working with a bunch of video on canvas. Either way, hope I remember it the next time I'm looking for a puzzle to solve while I take a break :)
>and the "LLMs will usher in a new era of messianic programming" camps
Well, this one might still be borne out. It's just silly to think it's the case right now. Check in again in 10 years and it may be a very different story. Maybe even in 5 years.
What do we build now to reap the coming of the messianic era?
> But for digging into the algorithmic core of our automation tooling
What I find fascinating is reading this same thing in other context like “UI guru” will say “I would not let CC touch the UI but I let it rip on algorithmic core of our automation tooling cause it is better at it than me…”
Both can be true. LLMs tend to be mediocre at (almost) everything, so they're always going to be worse than the user at whatever the user is an expert in.
But 'mediocre' isn't 'useless'.
I completely agree. I'm definitely not an expert web developer. I know enough to build functional tools, but it's not exactly art that I'm making. But the core of our tooling is my primary focus, I wrote it, I've spent a lot of time perfecting it. Claude can easily impress me with things like the CSS magic it weaves, because I am unsophisticated.
This makes sense, right? It's a relatively novel thing to be writing. I don't find it to be a damning remark like other comments here seem to be concluding.
If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.
Yeah, if your goal is "build the tightest 8,000 line implementation of training an LLM from scratch, with a focus on both conciseness and educational value" I don't think it's particularly surprising that Claude/Codex weren't much help.
Now to wait for Sonnet 5 and GPT-6, and ask them to build that, and see what they come up with.
Why would you expect an improvement?
because they'll be trained on karpathy's implementation
> This makes sense, right? It's a relatively novel thing to be writing.
It's really not though? Honestly I'm surprised coding agents fail hard at this task apparently
It's not _that_ far off distribution though. The math and concepts are well understood.
> If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.
This is good for bitcoin.
https://nitter.net/karpathy/status/1977755427569111362
That's funny that the coiner of the term vibe coding has eventually found it not useful anymore.
That’s not what he said. This is the new project:
> My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it.
This is how he described vibe coding:
> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
Vibe coding is clearly aimed at having fun hacking around on something that doesn’t matter, and he’s doing the opposite of that with this project. The fact that he’s not using vibe coding for something that is completely inappropriate for vibe coding is neither surprising nor a failure of vibe coding.
... or maybe he just forgot to include the claude.md ? :)
How convenient! You know, my code is somewhat far off the data distribution too.
We're still not ready for ouroboros.
Clearly he has little idea what he's talking about.
AI can write better code than 99% of developers. This embarrassingly anti-AI shill included.
If he used the AI tool my company is developing the code would have been better and shipped sooner.
Anti-AI shill? A cofounder of OpenAI?
You have found the joke.
> nanochat is also inspired by modded-nanoGPT
Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat
modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model.
Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers).
[1] https://github.com/KellerJordan/modded-nanogpt
[2] https://kellerjordan.github.io/posts/muon/
Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training
This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).
Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.
(Source: am experienced speedrunner who's been in these circles for a decent amount of time)
sharing some useful resrources for learning Muon (since I'm also just catching up on it)
- https://x.com/leloykun/status/1846842883967692926
- https://www.yacinemahdid.com/p/muon-optimizer-explained-to-a...
I haven't heard of this before. Has Muon dethroned Adam and AdamW as the standard general purpose optimizer for deep learning?
8xH100 is pretty wild for a single inference node.
Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?
At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.
Is my math wrong?
This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.
The default model is ~0.5B params right?
As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.
I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij
Will share the resulting model once ready (4 hours from now) for anyone to test inference.
I've uploaded the model here: https://huggingface.co/sdobson/nanochat
I didn't get as good results as Karpathy (unlucky seed?)
It's fun to play with though...
User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)
I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...
You can run it like this:
This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...
Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.
% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \ --model-dir nanochat/ --prompt "who is simonw on hacker news?" Using device: cpu Loading model from nanochat/model_000650.pt Loading metadata from nanochat/meta_000650.json Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280} Loading model weights (this may take a minute for a 2GB model)... Converting model to float32 for CPU... Model loaded successfully! Loading tokenizer... Tokenizer loaded successfully!
Prompt: who is simonw on hacker news? Encoded to 9 tokens
Generating... -------------------------------------------------- who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.
In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker --------------------------------------------------
Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!
git lfs install
For anyone curious this is the error when running uv sync on macos,
> uv sync Resolved 88 packages in 3ms error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels
Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.
Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.
The comment beside the first chart
>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
So, compare loss to bytes of text data instead.
Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?
Or would the loss of efficiency make it dumber then modern tokenizers?
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
[1] https://aclanthology.org/P16-1162/
yes to both.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.
Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.
Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?
There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!
The measures that drop exponentially like val/bpb and train/loss you should put the x-axis in log-scale. That will better show you if it's converged
Great call, thankyou - I switched to log scale for those metrics - agree that it is much clearer.
This weekend I just cracked into nanoGPT (https://github.com/karpathy/nanoGPT), an older but fabulous learning exercise where you build and train a crappy shakespeare GPT with ~0.8M parameters on a cpu. Results are about what you'd expect from that, they suck, but you can start to feel the magic, especially if you're not a deep learning professional and you just want to poke around and hack on it.
I started writing up a blog post on my weekend with nanoGPT but it's not done yet... Would have been great to link to here lol oh well
It's a useful exercise. A lot of the good ML work is first validated at small scale.
And this new example goes even further - adds instruction following and tool use SFT, as well as RLVR. Makes for a more useful baseline.
Absolutely, it's wildly fun to read the outputs of even a little tiny 0.8M model trained on CPU. And now I've actually got a much better understanding of the transformer architecture after playing around with it for a day. This repo is probably going to spawn some new folks to try out ideas which will turn into new researchers in the field, no doubt.
the shakespeare code tuned a little with different training data does a good job of generating Magic The Gathering commander decks
Somewhat related: I wrote up a MTG card generator based on nanoGPT a while ago that I think produces pretty good results for being 1m parameters.
The real neat thing about this is that WotC makes a few thousand new cards each year, so my training data set just grows over time and the model gets better with no effort spent on my part.
https://github.com/jlwitthuhn/TCGGPT
It would be interesting to come up with a use case which requires a freshly trained model and isn't just something that generic models can already, especially with 1MM context window
would love more details on this. this is exactly the type of project I'd like to dabble in to get more up to speed.
People have been doing this for a while.
https://x.com/roborosewater
https://bsky.app/profile/roborosewaterm.bsky.social
You can see the invention of RLHF/ChatGPT here because text generation suddenly became much more coherent and also much less interesting. You have to go back to older tech for surrealism because nobody will let you see the good stuff (the base models).
I guess I was much more interested in being able to work with an LLM to create good, synergistic Commander decks and less interested in generating custom Magic cards.
I'm sure I can dig up info on how to do this and piece it together, but I thought OP might have a guide specifically for it.
FWIW, there was a pretty popular post on HN around generating MTG cards using AI a couple years back but I believe that their approach was a fine-tune on an existing LLM.
https://news.ycombinator.com/item?id=37427854
I like the idea of specific-purpose toy models. How did you tune the code and what dataset you used?
Nice! His Shakespeare generator was one of the first projects I tried after ollama. The goal was to understand what LLMs were about.
I have been on an LLM binge this last week or so trying to build a from-scratch training and inference system with two back ends:
- CPU (backed by JAX)
- GPU (backed by wgpu-py). This is critical for me as I am unwilling to deal with the nonsense that is rocm/pytorch. Vulkan works for me. That is what I use with llama-cpp.
I got both back ends working last week, but the GPU back end was buggy. So the week has been about fixing bugs, refactoring the WGSL code, making things more efficient.
I am using LLMs extensively in this process and they have been a revelation. Use a nice refactoring prompt and they are able to fix things one by one resulting in something fully functional and type-checked by astral ty.
Unwilling to deal with pytorch? You couldn't possibly hobble yourself anymore if you tried.
If you want to train/sample large models, then use what the rest of the industry uses.
My use case is different. I want something that I can run quickly on one GPU without worrying about whether it is supported or not.
I am interested in convenience, not in squeezing out the last bit of performance from a card.
You wildly misunderstand pytorch.
What is there to misunderstand? It doesn't even install properly most of the time on my machine. You have to use a specific python version.
I gave up on all tools that depend on it for inference. llama-cpp compiles cleanly on my system for Vulkan. I want the same simplicity to test model training.
pytorch is as easy as you are going to find for your exact use case. If you can't handle the requirement of a specific version of python, you are going to struggle in software land. ChatGPT can show you the way.
I have been doing this for 25 years and no longer have the patience to deal with stuff like this. I am never going to install Arch from scratch by building the configuration by hand ever again. The same with pytorch and rocm.
Getting them to work and recognize my GPU without passing arcane flags was a problem. I could at least avoid the pain with llama-cpp because of its vulkan support. pytorch apparently doesn't have a vulkan backend. So I decided to roll out my own wgpu-py one.
FWIW, I've been experimenting with LLMs for the last couple of years, and have exclusively built everything I do around llama.cpp exactly because of the issues you highlight. "gem install hairball" has gone way too far, and I appreciate shallow dependency stacks.
Fair enough I guess. I think you'll find the relatively minor headache worth it. Pytorch brings a lot to the table.
I suspect the OP's issues might be mostly related to the ROCM version of PyTorch. AMD still can't get this right.
Probably - but the answer is to avoid ROCM, not pytorch.
Avoiding ROCm means buying a new Nvidia GPU. Some people would like to keep using the hardware they already have.
> Thank you to chief LLM whisperer Alec Radford for advice/guidance.
oh man an Alec x Andrej podcast would BREAK THE INTERNET... just saying... going from glory days of GPT1 to now building GPT3? in 4 hours
Please oh please. This would be perfect.
Eureka Labs: https://github.com/EurekaLabsAI
What a prolific person Andrej is. It's been more than amazing to follow along!
So could I in practice train it on all my psychology books, materials, reports, case study and research papers and then run it on demand on a 1xH100 node - https://getdeploying.com/reference/cloud-gpu/nvidia-h100 whenever I have a specialised question?
You could do that indeed, but the performance would be abysmal. For this kind of use-case, it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
> it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
I noticed NewRelic has a chat feature that does this sort of thing, it's scoped very narrowly down to their website and analytics DSL language, and generates charts/data from their db. I've always wondered how they did that (specifically in terms of set up the training/RAG + guardrails). It's super useful.
You might be able to figure that out just by asking it - see if you can get it to spit out a copy of the system prompt or tell you what tools it has access to.
The most likely way of building that would be to equip it with a "search_docs" tool that lets it look up relevant information for your query. No need to train an extra model at all if you do that.
Yes, though it's possible a more-general core model, further enhanced with some other ways to bring those texts-of-interest into the working context, might perform better.
Those other ways to integrate the texts might be some form of RAG or other ideas like Apple's recent 'hierarchical memories' (https://arxiv.org/abs/2510.02375).
You could but it would be significantly worse than fine-tuning or RAG with a pre-trained model, or using a smaller model since your dataset would be so small.
No.
I've always thought about the best way to contribute to humanity: number of people you help x how much you help them. I think what Karpathy is doing is one of the highest leverage ways to achieve that.
Our current world is build on top of open source projects. This is possible because there are a lot of free resources to learn to code so anyone from anywhere in the world can learn and make a great piece of software.
I just hope the same will happen with the AI/LLM wave.
This free tradition in software is I think one of the things that I love so much, but I don't see how it can continue with LLMs due to the extremely high training costs and the powerful hardware required for inference. It just seems like writing software will necessarily require paying rent to the LLM hosts to keep up. I guess it's possible that we'll figure out a way to do local inference in a way that is accessible to everyone in the way that most other modern software tools are, but the high training costs make that seem unlikely to me.
I also worry that as we rely on LLMs more and more, we will stop producing the kind of tutorials and other content aimed at beginners that makes it so easy to pick up programming the manual way.
There's a Stephen Boyd quote that's something like "if your optimization problem is too computationally expensive, just go on vacation to Greece for a few weeks and by the time you get back, computers might be fast enough to solve it." With LLMs there's sort of an equivalent situation with cost: how mindblowing would it be able to train this kind of LLM at all even just 4 years ago? And today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
There's also a reasonable way to "leapfrog" the training cost with a pre-trained model. So if you were doing nanochat as a learning exercise and had no money, the idea would be to code it up, run one or two very slow gradient descent iterations on your slow machine to make sure it is working, then download a pre-trained version from someone who could spare the compute.
But in this case the reason is simple: the core algorithm is O(n^2), this not going to be improved over a few weeks.
> today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
No, it's extremely hard to imagine since I used one of Karpathy's own models to have a basic chat bot like six years ago. Yes, it spoke nonsense; so did my GPT-2 fine tune four years ago and so does this.
And so does ChatGPT
Improvement is linear at best. I still think it's actually a log curve and GPT3 was the peak of the "fun" part of the curve. The only evidence I've seen otherwise is bullshit benchmarks, "agents" that increase performance 2x by increasing token usage 100x, and excited salesmen proclaiming the imminence of AGI
Apparently 800 million weekly users are finding ChatGPT useful in its present state.
1. According to who? Open AI? 2. Its current state is "basically free and containing no ads". I don't think this will remain true given that, as far as I know, the product is very much not making money.
Yes, that number is according to OpenAI. They released that 800m number at DevDay last week.
The most recent leaked annualized revenue rate was $12bn/year. They're spending a lot more than that but convincing customers to hand over $12bn is still a very strong indicator of demand. https://www.theinformation.com/articles/openai-hits-12-billi...
Part of that comes from Microsoft API deals. Part of that will most certainly come because the vast network of companies buy subscriptions to help "Open" "AI" [1].
Given the rest of circular deals, I'd also scrutinize if it applies to the revenue. The entanglement with the Microsoft investments and the fact that "Open" "AI" is a private company makes that difficult to research.
[1] In a U.S. startup, I went through three CEOs and three HR apps, which mysteriously had to change for no reason but to accommodate the new CEO's friends and their startups.
Even with linear progression of model capability, the curve for model usefulness could be exponential, especially if we consider model cost which will come down.
For every little bit a model a smarter and more accurate there are exponentially more real world tasks it could be used for.
This. It looks like one of the keys to maintaining open source is to ensure OSS developers have access to capable models. In the best of worlds, LLM vendors would recognize that open source software is the commons that feeds their models and ensure it flourishes.
In the real world...
Maybe this isn't possible for LLMs yet, but open source versions of AlphaZero have been trained on peer-to-peer networks.
https://zero.sjeng.org/
https://katagotraining.org/
(This is a bit ranty, but due to a sincere desire for a better world, and being the recipient of personal attacks for believing a better world is achievable by a different path to others)
I feel like this point of view is an ideal not shared by one of the main branches of anti-AI sentiment.
The idea of intellectual property works against this. Rather than contributing to humanity directly, ownership of information is accumulated by individuals and then rented to humanity.
At the same time I agree that people should be able to have a livelihood that affords them the ability to create new intellectual contributions.
The service Karpathy is providing is also being provided by thousands of YouTube creators in a huge variety of topics. It's a little sad that so many must support their efforts with support their efforts with sponsorships from sources with varying degrees of ethical behaviour. Patreon is better but still not ideal. I sincerely believe this _is_ one of the best ways to contribute to society.
A recent Daily Show had Jon Stewart describe training AI as strip mining human knowledge. Training AI is regularly described as theft as if this position is a given without any counter argument possible. It is opinion masquerading as fact. This saddens me because it suggests to me that the war to control the narrative is being won by people who want to entrench a hypercapitalistic vision of ownership where not only is a particular expression of an idea ownable but also stakes a claim to own some of any ideas that come from viewing that expression.
I cannot see any way that this viewpoint would aid humanity as a whole, but instead assign benefits to a collection of individuals. The ability to trade intellectual property means that ownership inevitably gets passed to a smaller and smaller pool of individuals over time.
I think we really do need a new way to consider these issues in light of the modern world. When mentioning these thoughts to others a common refrain is that it doesn't matter because the powers that be (and their lobbyists) will prevent any fix from happening. I have never been fond of that particular fatalism, especially when it inhibits discussion of what would be better.
Awesome approach.
I'm all for abolishing IP if all AIs are owned communally. I.e. ideally they're utilities or flat out co-ops like some Spanish businesses.
https://en.wikipedia.org/wiki/Mondragon_Corporation
Consum (Spanish supermarket).
They don't get to use everything communally and then capitalism their way forward.
I recommend his ANN/LLM from scratch videos to people a lot because not only is he a clear instructor, but his code tends to be very Pythonic and just the right balance of terse but readable (not counting the Pytorch vectorization stuff, but that's not his fault, it's just complex). So I think people benefit just from watching and imitating his code style.
Then a single person whose learned those skills decide to poison all of us thanks to the skills acquired.
strong +1 - developers like him are heros
If it only were so easy
As noble as the goal sounds, I think it's wrong.
Software is just a tool. Much like a hammer, a knife, or ammonium nitrate, it can be used for both good or bad.
I say this as someone who has spent almost 15 years writing software in my free time and publishing it as open source: building software and allowing anyone to use it does not automatically make other people's lives better.
A lot of my work has been used for bad purposes or what some people would consider bad purposes - cheating on tests, cheating in games, accessing personal information without permission, and in one case my work contributed to someone's doxxing. That's because as soon as you publish it, you lose control over it.
But at least with open source software, every person can use it to the same extent so if the majority of people are good, the result is likely to be more positive than negative.
With what is called AI today, only the largest corporations can afford to train the models which means they are controlled by people who have entirely different incentives from the general working population and many of whom have quite obvious antisocial personality traits.
At least 2 billion people live in dictatorships. AI has the potential to become a tool of mass surveillance and total oppression from which those countries will never recover because just like the models can detect a woman is pregnant before she knows it, it will detect a dissenter long before dissent turns into resistance.
I don't have high hopes for AI to be a force for good and teaching people how toy models work, as fun as it is, is not gonna change it.
"With what is called AI today, only the largest corporations can afford to train the models"
I take it you're very positive about Andrej's new project which allows anyone to train a model for a few hundred dollars which is comparable to the state-of-the-art from just 5 years ago then.
For a few hundred dollars, given heavily-VC-subsidized hardware that is probably partially funded by nvidia and various AI companies, etc.
Can I run it on my local hardware (nvidia consumer card, AMD cpu)? No. When could that corporation cut off my access to that hardware if I did anything it didn't like? Anytime.
Lots of things have started off cheap / subsidized to put competitors out of business, and then the prices go up, up and up..
> Can I run it on my local hardware?
Yes. The training process requires big expensive GPUs. The model it produces has 561M parameters, which should run on even a high end mobile phone (I run 4B models on my iPhone).
I would genuinely love to think otherwise. But I've seen and grown up seeing good things being used in stupid ways (not necessarily for malice)
> At least 2 billion people live in dictatorships. AI has the potential to become a tool of mass surveillance and total oppression from which those countries will never recover because just like the models can detect a woman is pregnant before she knows it, it will detect a dissenter long before dissent turns into resistance.
It already works like this in your precious western democracies and they didn't need AI to be authoritarian total surveillance states in spirit, with quite a lot of support from a propagandized populace that begged for or pretended to agree with the infringement of their civil rights because of terrorism, drugs, covid or protecting the poor poor children.
You can combat tech with legislation and culture but the legislation and culture were way beyond the tech in being extremely authoritian in the first place.
I don't know man. All this "tech" didn't see AOC, Sanders, and other 'radicals' coming. The parties actually had to expend effort after the fact to delegitimize them and have to continue to do so for additional candidates that come along(Jamal Bowman, Cori Bush, etc.)
I‘m afraid the technology will do more damage because many people will abuse it for fake news and misinformation.
Yeah it feels similar to inventing the nuke. Or it’s even more insidious because the harmful effects of the tech are not nearly as obvious or immediate as the good effects, so less restraint is applied. But also, similar to the nuke, once the knowledge on how to do it is out there, someone’s going to use it, which obligates everyone else to use it to keep up.
While documenting a build path is nice, IMHO renting hardware nobody can afford from VC-backed cloud providers using cold hard cash to produce clones of legacy tech using toy datasets under the guise of education is propping up the AI bubble and primarily helping institutional shareholders in those AI bubble companies, particularly their hardware supplier NVidia. Personally I do not see this as helping people or humanity.
This would sit better with me if the repo included a first tier use case for local execution, non-NVidia hardware reference, etc.
"This would sit better with me if the repo included a first tier use case for local execution, non-NVidia hardware reference, etc."
This is a pretty disheartening way to respond to something like this. Someone puts a great deal of effort into giving something interesting away for free, and is told "you should have also done THIS work for free as well in order for me to value your contribution".
It is an objective and transparent response based on free software world norms. Feel free to interpret differently and to be disheartened. Hell, many of us are disheartened by the AI VC political theater we are seeing right now: experienced programmers, artists, lawyers, perhaps much of humanity. Let's stick to objective elements of the discussion, not emotional opine.
Tinkering with something is what inspires next generation of innovators, in this space or another.
Think back to your first experience with tech, something you just erenstly thought was cool...
If you can't afford $100 or learn how to train it locally with more time and less money, then this isn't something you should be focusing on at all.
It is amusing to note the dichotomy between the clearly compassionate, empathetic and altruistic perspective displayed here and the comically overstated framing of helping humanity.
(Shrug) Other sites beckon.
This is wholly unhelpful.
I think you got your proportions slightly wrong there. This will be contributing as much to an AI bubble as a kid tinkering around with combustion is contribution to global warming.
Not really. Anything that guy does sets the tone for an extended cacophony of fans and followers. It would be a sad day when nobody critically assesses the motivations, effects and framing of those moves. I question the claim this move helps humanity and stand by the assessment it's just more feeding an unfree ecosystem which equates to propping up the bubble.
That certainly sounds very ominous.
He is the GOAT of LLM MVPs. That is educational and useful, especially because he uses a minimal and clean style, but I don't see how it even compares with kernels, operating systems etc.
So I appreciate his work in an academic and educational sense, but large scale applications with stolen training material are still theft.
I would adjust your formula to the:
number of people you help x how much you help them x number of people you harm x how much you harm them
For example - harming a little bit all content creators of the world, by stealing their work without compensation or permission. How much does that cost globally every year after year? How do we even quantify long term consequences of that? Stuff like that.
If you consider the cost of hiring a human professional to over using multimodal AI for something, its very realize literally thousands of dollars of value per chat.
Multiply that by many billions of chats per day.
Lawyers and other professionals charge a lot. So do artists, especially when you want to do a million revisions. LLMs hand it out for free, making many knowledge and art professions affordable and accessible to the masses.
Stable owners were upset when cars replaced horses, but you can't stop progress, especially when value proposition is undenyable.
I wonder what people will do, when they will realize that LLM lawyers produce insufficient results, but "suddenly" all cheap bottom rung lawyers are gone and switched professions.
As for the LLM "creative" content, have you seen it or read it? Well, same problem. After you will need a quality content, good luck finding some cheap creator. Pay full price for an experienced one and likely wait.
PS: I don't doubt that LLMs are here to stay. They will se a lot of usage and pervade all industries. It's just that future will be pretty shit. Talking on phone with LLMs, reading LLM slop, seeing LLM lop everywhere, receiving generated emails and using LLMs to reverse parse them to search for an actual content, major economy downturn, rapidly slowing salary growth (not that it was big before), etc.
Wow, how do we sign up for the Eurekalabs course and how much does it cost?
Still under development, remaining work includes tuning nanochat (current state being solid v0.1) and finalizing the in-between projects so that students can "unlock" all complexity that hides underneath: `torch.Tensor`, `torch.dist`, `.backward()`, '.compile()`, etc. And then the more ops heavy aspects.
What's the pricing for the course/EurekaLabs? P.s. thanks for all you're doing
Karpathy says nanochat will become the capstone project of the course LLM101n being developed by Eureka Labs.
I guess it’s still a work in progress? Couldn’t find any other information elsewhere.
A bit more info [here](https://github.com/karpathy/LLM101n)
Would love to hear some metrics on training it on your personal computer rather than a "cloud GPU box". I don't care if it takes 3 months to train if I have something good, offline, and free(ish, but just pay electric bills)
100$ to train a sort of talkable model in 4 hours? wow
Here's the announcement post [0] from Karpathy, which provides a bit of additional context.
[0] https://x.com/karpathy/status/1977755427569111362
Thanks - we'll put that in the toptext as well
This is really inspiring! Does anyone have some example of how well or poorly it performs on some example prompts?
Simon.incutio.com points out that there are screenshots on https://xcancel.com/karpathy/status/1977755430093980034.
Built so many nano AIs over the last several years. I have played with nanoGPT, its ok. Just hype for Kpathy... So many tiny LLMs out there now that run on cheap SOCs. Try SmolVLM512, runs fine on a sub $100 pi.
Which data uses for training?
karpathy/fineweb-edu-100b-shuffle: https://huggingface.co/datasets/karpathy/fineweb-edu-100b-sh...
Which is derived from HuggingFaceFW/fineweb-edu: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
HuggingFaceTB/smol-smoltalk: https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk
And extra fine-tuning on portions of:
cais/mmlu: https://huggingface.co/datasets/cais/mmlu
openai/gsm8k: https://huggingface.co/datasets/openai/gsm8k
allenai/ai2_arc: https://huggingface.co/datasets/allenai/ai2_arc
I think he mentioned somewhere he used fineweb (I assume this one https://huggingface.co/datasets/HuggingFaceFW/fineweb)
The title is extremely misleading - you have to rent time on an H100 cluster to get it to work. It is not on-device, and thus not truly $100.
I was really excited, too, until I looked through the readme files and the code.
The title is saying you can train your own model for $100. That part is true: the $100 goes to the cloud provider to rent you $250k of hardware for four hours. Then you can run that model on whatever hardware you have lying around, because it's really small.
What's misleading about that? You rent $100 of time on an H100 to train the model.
It's about training a model from scratch for $100.
I feel same. The title looks like I could have on-deivce ChatGPT with $100 forever. I couldn't imagine it's about training the model by myself.
Since the resulting model is only ~561M parameters you could run it on a Raspberry Pi that costs less than $100.
Thanks Andrej for putting this up. Your videos gave me the confidence to work full time on LLMs last year after I left Microsoft
8XH100 nodes start at ~$450ish/day. Not sure about the $100 part. I need to dig into the post.
The quoted $100 price is for 4 hours at $24/hour. 450 / 24 = $18.75 so your numbers roughly match that.
Thanks. Working on platforms - days are more interesting than hours.
"The fastest way to feel the magic is to run the speedrun script speedrun.sh, which trains and inferences the $100 tier of nanochat. On an 8XH100 node at $24/hr, this gives a total run time of about 4 hours."
I am clueless and don't understand this. Where is the $100 being spent? Some sort of API you have to pay to access? Some sort of virtual hardware you have to rent access to?
H100s are expensive NVIDIA GPUs, each costing about $30,000. 8XH100 means you have 8 of those wired together in a big server in a data center somewhere, so around a quarter of a million dollars worth of hardware in a single box.
You need that much hardware because each H100 provides 80GB of GPU-accessible RAM, but to train this model you need to hold a LOT of model weights and training data in memory at once. 80*8 = 640GB.
~$24/hour is how much it costs to rent that machine from various providers.
Perfectly explained, thanks!
Thank you.
Renting 8 H100s would cost you about 24/h
Andrej Karpathy slays again by spreading knowledge about this important subject to the people!
Should be "that you can train for $100"
Curios to try it someday on a set of specialized documents. Though as I understand the cost of running this is whatever GPU you can rent with 80GB of VRAM. Which kind of leaves hobbyists and students out. Unless some cloud is donating gpu compute capacity.
A GPU with 80GB VRAM costs around $1-3 USD an hour on commodity clouds (i.e. the non-Big 3 bare metal providers e.g. https://getdeploying.com/reference/cloud-gpu/nvidia-h100). I think it's accessible to most middle class users in first world countries.
Isn’t the whole point to run your model locally?
No, that’s clearly not a goal of this project.
This is a learning tool. If you want a local model you are almost certainly better using something trained on far more compute. (Deepseek, Qwen, etc)
The 80 GB are for training with a batch size of 32 times 2048 tokens each. Since the model has only about 560M parameters, you could probably run it on CPU, if a bit slow.
You can run a model locally on much less expensive hardware. It's training that requires the really big GPUs.
I'd guess that this will output faster than the average reader can read, even while using only CPU inferencing on a modern-ish CPU.
The param count is small enough that even cheap (<$500) GPUs would work too.
If I have let’s say 40gb RAM does it not work at all or just take twice as long to train?
Won't work at all. Or if it does it'll be so slow since it'll have to go to the disk for every single calculation so it won't ever finish.
It will work great with 40GB GPU, probably a bit less than twice slower. These are micro models of a few B param at most and fit easily during both training and inference.
How low can this go? Can this run on a 5090 card (32GiB)?
Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.
$100 to teach us all how to build an LLM, this is what open education should look like.
I would love to take an existing open-weight model and fine-tune it with specific training data along these lines. Can I do that with Qwen or GLM? Is there a ~simple recipe for doing that?
Very cool project! Hopefully it will propel SLM development
This is an LLM trained using a $100 budget to RENT access to graphics cards. It's not about what you could do BUYING hardware for $100.
Nowhere does he suggest he is buying hardware.
I’m very excited for this. An early question I have: what would need to be done to make this a “thinking” model?
>If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1.
That sounds like it could run on a 24gb GPU. Batch size of 8 would imply 20gb mem, no?
...presumably just takes forever
Yes, you can always stream data when training or doing inference on models when vram is lacking but the slow down is extremely noticeable. This is the case for CPU code too and is why optimising for bandwidth is so critical in high-performance computing. Your ability to compute is almost always substantially larger than your bandwidth. An Avx512 capable CPU with a suitable amount of cores is easily capable of doing multiple terabytes of fp64 operations per second, but is typically limited by memory bandwidth, GPUs with LLMs have just broadened this knowledge to more people.
A fun consequence of the fact that CPUs got faster at a rate quicker than memory is look up tables of pre-computed values used to be common optimisations in code, but now it is almost always quicker to re-compute them than to retrieve a pre-computed value from memory for common use-cases.
> Batch size of 8 would imply 20gb mem, no?
I'm running it now and I had to go down to 4 instead of 8, and that 4 is using around 22-23GB of GPU memory. Not sure if something is wrong or if batch is only scaling part of the memory requirements. (Edit: I restarted running the training script directly instead of torch run, and 8 still doesn't fit, but 4 is now using 16-17 instead.)
On my 4090 the tok/sec is 523, which is 1/2000 of the 1,000,000 tok/sec of the 8 80GB H100s. That feels too slow so maybe something is wrong. The 4090 is about 1/3 of the raw compute. I'm sure there's other losses from less batching but even if it were 1/10ths as fast, I'd expected something more like 1,000,000 / 10 / 8 so at least 10,000 tok/sec.
Thanks for investigating. Sounds like throwing some dollars at a cloud gpu makes more sense then
Ah, but this is nice project. I'll start hacking once it's easier to fine-tune it with own documents for specific questions. What plaques me, though, is how you prevent the model from answering questions it was not trained for?
This is absolutely fantastic. I really can't wait for the final course to be live. It's in the "shut up and take my money" category. I had so much fun with the nanoGPT videos.
This is going to be the single most powerful boost to my indie research efforts in years. Thank you, Andrej!
I see Karpathy, I click
These are the time of community posts that are legendary.
End to end training is a different beast, but finetuning and inference of impressive LLMs like QWEN3 can be done on pretty run of the mill hardware like Apple Silicon macs and gaming PCs if anyone wants a personalized assistant with character. Just ask AI how to finetune AI using unsloth (if using NVIDIA) or MLX (for apple) and it will give you ready to run python scripts.
I wonder, if something like this were trained on Wikipedia, could it become a reliable local Wikipedia search engine, basically?
I don't think so. Training on documents is not a great way of building a search engine for those for the information in those documents, because the training process mixes all of that information together in ways that detach the individual words from the source documents they came from.
As usual, if you want an LLM to be able to help search a corpus of text the best way to achieve that is to teach it how to use a search tool against that text.
> the best way to achieve that is to teach it how to use a search tool against that text.
Any examples of this?
I've seen this called "agentic RAG" by some people. The easiest way to get a local demo is with Claude Code or Codex CLI. They know how to use grep, and you can set them loose on a folder full of text files and tell them to use grep to answer questions - it can work really well.
I just tried this in "claude --dangerously-skip-permissions":
> Use Python and AppleScript to find Apple Notes that mention UPS
... and fell down a rabbit hole of optimizations because my Notes collection is HUGE, but it got there in the end!
> nanochat is designed to run on a single 8XH100 node
from their promotional material:
>> Why is the sky blue? > The sky is blue due to an optical illusion called the Rayleigh Scattering
Rayleigh Scattering is not an illusion but an effect.
> […] particles are made up of tiny blue and violet particles that cause the light to bend in a particular way.
ugh. no, there are no "tiny blue" particles in the sky.
That was the point. That example is meant to demonstrate that the model that trained for 4 hours can imitate a conversation but isn't actually anywhere close to being useful.
Where did you find that?
It's in this screenshot: https://twitter.com/karpathy/status/1977755430093980034
Edit: direct link to image: https://pbs.twimg.com/media/G3Jjxmba8AA5mSs.jpg
Aha, thanks!
not sure why you are being downvoted. That 'explanation' of Rayleigh scattering is just wrong.
Because that explanation is expected to be wrong. This is a partially trained tiny model, Andrej shared that obviously incorrect explanation to emphasize that the model is not trustworthy or useful at that stage.
Downvoted for being obvious in context / missing the point and getting worked up about it. He even said it's like talking to a kindergartener.
[flagged]
Has the word ChatGPT become generic? This has nothing to do with OpenAI's ChatGPT.
It's a reasonable shortcut for what this project provides: training code, inference code and a ChatGPT-style web interface for chatting with the model.
[dead]
[flagged]
I believe you that you just meant this as an interesting example, and in that sense were engaged in curious conversation (generally what we want here). But the amount of provocation in the comment is so high, and the amount of information so little, that it ends up on the wrong side of "Eschew flamebait. Avoid generic tangents." (https://news.ycombinator.com/newsguidelines.html). In other words: not gonna end well.
We detached this subthread from https://news.ycombinator.com/item?id=45569878.
Controlling culture, yes but wild pivot to mention that criminal alongside Karpathy.
not a particularly ethical guy and I wouldn't hold him up as a example of morality but the guy hasn't actually been found guilty YET. Multiple courts have tried. You'd think that for a guy under as much scrutiny as him that they would have SOMETHING to pin him on by now.
Innocent until PROVEN guilty is a foundational legal precedent for a reason.
He is definitely guilty of being a waste of human life, a massive asshole and a general detriment to society worldwide. Don’t need a court to prove that.
There are 6 criminal cases against him in several countries, let’s see how they pan out - but regardless he is not an innocent person.
[dead]
I mean just an example. He obviously wasn't the most ethical person. Depends how you do it
Neither are Stalin, Netanyahu, Pol Pot, Hitler, Charles Manson et al.
Way to derail the conversation. Focus on the positive people and their legacy of time, sharing, positive energy and contributions to society
not derailing, just pointing out effective ways of producing good which is what i was responding to. i think its good for people to be aware of this. those people are all examples of people who have influenced culture for bad. you can do it for good: bryan johnson, civil rights leaders, leftist streamers. andrew tate was just the most effective, recent, and obvious one which is why I pointed him out.
if the AI bubble is anything to be compared to, how is 100$ worth anything in GPT terms.
Try ~300k for an 8xH100 lol