RSS Parrot

RT by @karpathy: 📄New Guide: Running nanochat on instant clusters! Train and inference @karpathy's end-to-end ChatGPT clone on Together’s on-demand GPU clusters — and learn how to: ➡️Train nanochat ➡️Nanochat inference ➡️Iterate to see if you can speed up training!

https://rss.xcancel.com/togethercompute/status/1985452123526689190#m

Published: November 3, 2025 21:00

📄New Guide: Running nanochat on instant clusters! Train and inference @karpathy's end-to-end ChatGPT clone on Together’s on-demand GPU clusters — and learn how to: ➡️Train nanochat ➡️Nanochat inference ➡️Iterate to see if you can speed up training!

Beautiful technical debugging detective longread that starts with a suspicious loss curve and ends all the way in the Objective-C++ depths of PyTorch MPS backend of addcmul_ that silently fails on non-contiguous output tensors. I wonder how long before an LLM can do all of this.

https://rss.xcancel.com/karpathy/status/1982483540899237981#m

Published: October 26, 2025 16:24

Beautiful technical debugging detective longread that starts with a suspicious loss curve and ends all the way in the Objective-C++ depths of PyTorch MPS backend of addcmul_ that silently fails on non-contiguous output tensors. I wonder how long before an…

Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: https://github.com/karpathy/nanochat/discussions/164 This is done via a new synthetic task `SpellingBee` that generates examples of a user asking for this kind of a problem, and an ideal solution from an assistant. We then midtrain/SFT finetune on these to endow the LLM with the capability, or further train with RL to make it more robust. There are many details to get right especially at smaller model sizes and the guide steps through them. As a brief overview: - You have to ensure diversity in user prompts/queries - For small models like nanochat especially, you have to be really careful with the tokenization details to make the task easy for an LLM. In particular, you have to be careful with whitespace, and then you have to spread the reasoning computation across many tokens of partial solution: first we standardize the word into quotes, then we spell it out (to break up tokens), then we iterate and keep an explicit counter, etc. - I am encouraging the model to solve the model in two separate ways: a manual way (mental arithmetic in its head) and also via tool use of the Python interpreter that nanochat has access to. This is a bit "smoke and mirrors" because every solution atm is "clean", with no mistakes. One could either adjust the task to simulate mistakes and demonstrate recoveries by example, or run RL. Most likely, a combination of both works best, where the former acts as the prior for the RL and gives it things to work with. If nanochat was a much bigger model, you'd expect or hope for this capability to more easily "pop out" at some point. But because nanochat d32 "brain" is the size of a ~honeybee, if we want it to count r's in strawberry, we have to do it by over-representing it in the data, to encourage the model to learn it earlier. But it works! :)

https://rss.xcancel.com/karpathy/status/1981746327995465816#m

Published: October 24, 2025 15:35

Last night I taught nanochat d32 how to count 'r' in strawberry (or similar variations). I thought this would be a good/fun example of how to add capabilities to nanochat and I wrote up a full guide here: github.com/karpathy/nanochat… This is done via a…

R to @karpathy: See this new Discussion for more technical detail Guide: infusing identity to your nanochat https://github.com/karpathy/nanochat/discussions/139

https://rss.xcancel.com/karpathy/status/1980665253622091881#m

Published: October 21, 2025 15:59

See this new Discussion for more technical detail Guide: infusing identity to your nanochat github.com/karpathy/nanochat…

nanochat now has a primordial identity and can talk a bit about itself and its capabilities (e.g. it knows it's nanochat d32 that cost $800, that it was built by me, that it can't speak languages other than English too well and why, etc.). This kind of customization is all done through synthetic data generation and I uploaded a new example script to demonstrate. It's a bit subtle but by default LLMs have no inherent personality or any understanding of their own capabilities because they are not animal-like entities. They don't know what they are or what they can or can't do or know or don't know. All of it has to be explicit bolted on. This is done by asking a bigger LLM cousin to generate synthetic conversations (you tell it what they should look like simply in words), and then mixing them into midtraining and/or SFT stage. The most important challenge is ensuring enough entropy/diversity in your generated data. If you don't do it well, LLMs will generate 1000 conversations that are all ay too similar, even with high temperature. My script shows a crappy example of how to add diversity - e.g. by creating lists of starting messages or topics, sampling from them explicitly, adding them as fewshot examples into prompts for "inspiration", etc. I wanted to have some fun with it so nanochat now refers to me as King Andrej Karpathy (lol) just to illustrate that this is a giant blank canvas - you can infuse completely arbitrarily identity, knowledge or style into your LLM in this manner. I hope it's helpful and sparks fun ideas!

https://rss.xcancel.com/karpathy/status/1980665134415802554#m

Published: October 21, 2025 15:59

nanochat now has a primordial identity and can talk a bit about itself and its capabilities (e.g. it knows it's nanochat d32 that cost $800, that it was built by me, that it can't speak languages other than English too well and why, etc.). This kind of…

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input. Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in: - more information compression (see paper) => shorter context windows, more efficiency - significantly more general information stream => not just text, but e.g. bold text, colored text, arbitrary images. - input can now be processed with bidirectional attention easily and as default, not autoregressive attention - a lot more powerful. - delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It "imports" all the ugliness of Unicode, byte encodings, it inherits a lot of historical baggage, security/jailbreak risk (e.g. continuation bytes). It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all, and all the transfer learning that brings along. The tokenizer must go. OCR is just one of many useful vision -> text tasks. And text -> text tasks can be made to be vision ->text tasks. Not vice versa. So many the User message is images, but the decoder (the Assistant response) remains text. It's a lot less obvious how to output pixels realistically... or if you'd want to. Now I have to also fight the urge to side quest an image-input-only version of nanochat...

https://rss.xcancel.com/karpathy/status/1980397031542989305#m

Published: October 20, 2025 22:13

I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than dots), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading…

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've seen a bit of both. A lot of diffusion papers look a bit dense but if you strip the mathematical formalism, you end up with simple baseline algorithms, e.g. something a lot closer to flow matching in continuous, or something like this in discrete. It's your vanilla transformer but with bi-directional attention, where you iteratively re-sample and re-mask all tokens in your "tokens canvas" based on a noise schedule until you get the final sample at the last step. (Bi-directional attention is a lot more powerful, and you get a lot stronger autoregressive language models if you train with it, unfortunately it makes training a lot more expensive because now you can't parallelize across sequence dim). So autoregression is doing an `.append(token)` to the tokens canvas while only attending backwards, while diffusion is refreshing the entire token canvas with a `.setitem(idx, token)` while attending bidirectionally. Human thought naively feels a bit more like autoregression but it's hard to say that there aren't more diffusion-like components in some latent space of thought. It feels quite possible that you can further interpolate between them, or generalize them further. And it's a component of the LLM stack that still feels a bit fungible. Now I must resist the urge to side quest into training nanochat with diffusion.

https://rss.xcancel.com/karpathy/status/1980347971935068380#m

Published: October 20, 2025 18:58

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm…

My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good. I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking thread out-executes my thinking thread, so I think I botched a few explanations due to that, and sometimes I was also nervous that I'm going too much on a tangent or too deep into something relatively spurious. Anyway, a few notes/pointers: AGI timelines. My comments on AGI timelines looks to be the most trending part of the early response. This is the "decade of agents" is a reference to this earlier tweet https://x.com/karpathy/status/1882544526033924438 Basically my AI timelines are about 5-10X pessimistic w.r.t. what you'll find in your neighborhood SF AI house party or on your twitter timeline, but still quite optimistic w.r.t. a rising tide of AI deniers and skeptics. The apparent conflict is not: imo we simultaneously 1) saw a huge amount of progress in recent years with LLMs while 2) there is still a lot of work remaining (grunt work, integration work, sensors and actuators to the physical world, societal work, safety and security work (jailbreaks, poisoning, etc.)) and also research to get done before we have an entity that you'd prefer to hire over a person for an arbitrary job in the world. I think that overall, 10 years should otherwise be a very bullish timeline for AGI, it's only in contrast to present hype that it doesn't feel that way. Animals vs Ghosts. My earlier writeup on Sutton's podcast https://x.com/karpathy/status/1973435013875314729 . I am suspicious that there is a single simple algorithm you can let loose on the world and it learns everything from scratch. If someone builds such a thing, I will be wrong and it will be the most incredible breakthrough in AI. In my mind, animals are not an example of this at all - they are prepackaged with a ton of intelligence by evolution and the learning they do is quite minimal overall (example: Zebra at birth). Putting our engineering hats on, we're not going to redo evolution. But with LLMs we have stumbled by an alternative approach to "prepackage" a ton of intelligence in a neural network - not by evolution, but by predicting the next token over the internet. This approach leads to a different kind of entity in the intelligence space. Distinct from animals, more like ghosts or spirits. But we can (and should) make them more animal like over time and in some ways that's what a lot of frontier work is about. On RL. I've critiqued RL a few times already, e.g. https://x.com/karpathy/status/1944435412489171119 . First, you're "sucking supervision through a straw", so I think the signal/flop is very bad. RL is also very noisy because a completion might have lots of errors that might get encourages (if you happen to stumble to the right answer), and conversely brilliant insight tokens that might get discouraged (if you happen to screw up later). Process supervision and LLM judges have issues too. I think we'll see alternative learning paradigms. I am long "agentic interaction" but short "reinforcement learning" https://x.com/karpathy/status/1960803117689397543. I've seen a number of papers pop up recently that are imo barking up the right tree along the lines of what I called "system prompt learning" https://x.com/karpathy/status/1921368644069765486 , but I think there is also a gap between ideas on arxiv and actual, at scale implementation at an LLM frontier lab that works in a general way. I am overall quite optimistic that we'll see good progress on this dimension of remaining work quite soon, and e.g. I'd even say ChatGPT memory and so on are primordial deployed examples of new learning paradigms. Cognitive core. My earlier post on "cognitive core": https://x.com/karpathy/status/1938626382248149433 , the idea of stripping down LLMs, of making it harder for them to memorize, or actively stripping away their memory, to make them better at generalization. Otherwise they lean too hard on what they've memorized. Humans can't memorize so easily, which now looks more like a feature than a bug by contrast. Maybe the inability to memorize is a kind of regularization. Also my post from a while back on how the trend in model size is "backwards" and why "the models have to first get larger before they can get smaller" https://x.com/karpathy/status/1814038096218083497 Time travel to Yann LeCun 1989. This is the post that I did a very hasty/bad job of describing on the pod: https://x.com/karpathy/status/1503394811188973569 . Basically - how much could you improve Yann LeCun's results with the knowledge of 33 years of algorithmic progress? How constrained were the results by each of algorithms, data, and compute? Case study there of. nanochat. My end-to-end implementation of the ChatGPT training/inference pipeline (the bare essentials) https://x.com/karpathy/status/1977755427569111362 On LLM agents. My critique of the industry is more in overshooting the tooling w.r.t. present capability. I live in what I view as an intermediate world where I want to collaborate with LLMs and where our pros/cons are matched up. The industry lives in a future where fully autonomous entities collaborate in parallel to write all the code and humans are useless. For example, I don't want an Agent that goes off for 20 minutes and comes back with 1,000 lines of code. I certainly don't feel ready to supervise a team of 10 of them. I'd like to go in chunks that I can keep in my head, where an LLM explains the code that it is writing. I'd like it to prove to me that what it did is correct, I want it to pull the API docs and show me that it used things correctly. I want it to make fewer assumptions and ask/collaborate with me when not sure about something. I want to learn along the way and become better as a programmer, not just get served mountains of code that I'm told works. I just think the tools should be more realistic w.r.t. their capability and how they fit into the industry today, and I fear that if this isn't done well we might end up with mountains of slop accumulating across software, and an increase in vulnerabilities, security breaches and etc. https://x.com/karpathy/status/1915581920022585597 Job automation. How the radiologists are doing great https://x.com/karpathy/status/1971220449515516391 and what jobs are more susceptible to automation and why. Physics. Children should learn physics in early education not because they go on to do physics, but because it is the subject that best boots up a brain. Physicists are the intellectual embryonic stem cell https://x.com/karpathy/status/1929699637063307286 I have a longer post that has been half-written in my drafts for ~year, which I hope to finish soon. Thanks again Dwarkesh for having me over!

https://rss.xcancel.com/karpathy/status/1979644538185752935#m

Published: October 18, 2025 20:23

My pleasure to come on Dwarkesh last week, I thought the questions and conversation were really good. I re-watched the pod just now too. First of all, yes I know, and I'm sorry that I speak so fast :). It's to my detriment because sometimes my speaking…

R to @karpathy: DVD player is superior technology.

https://rss.xcancel.com/karpathy/status/1978656449904496861#m

Published: October 16, 2025 02:57

DVD player is superior technology.

R to @karpathy: Deliberately*

https://rss.xcancel.com/karpathy/status/1978654822036607245#m

Published: October 16, 2025 02:50

Deliberately*

R to @karpathy: There is a movement I found on Instagram where people delivery choose to live in 90s, refusing all technology after 2000. Like an intermediate form of the Amish.

https://rss.xcancel.com/karpathy/status/1978654744475578568#m

Published: October 16, 2025 02:50

There is a movement I found on Instagram where people delivery choose to live in 90s, refusing all technology after 2000. Like an intermediate form of the Amish.

TV in the 90s: you turn it on, you watch. TV 2025: - turn on, wait for it to load - popup: TV wants to update, 1.5GB. No. - scroll sideways, find prime video app or etc - popup: now app wants to update, 500MB. No!! - App launching... App loading… - select account screen - 🫠

https://rss.xcancel.com/karpathy/status/1978653908663726585#m

Published: October 16, 2025 02:47

TV in the 90s: you turn it on, you watch. TV 2025: - turn on, wait for it to load - popup: TV wants to update, 1.5GB. No. - scroll sideways, find prime video app or etc - popup: now app wants to update, 500MB. No!! - App launching... App loading… - select…

nanochat d32, i.e. the depth 32 version that I specced for $1000, up from $100 has finished training after ~33 hours, and looks good. All the metrics go up quite a bit across pretraining, SFT and RL. CORE score of 0.31 is now well above GPT-2 at ~0.26. GSM8K went ~8% -> ~20%, etc. So that's encouraging. The model is pretty fun to talk to, but judging from some early interactions I think people have a little bit too much expectation for these micro models. There is a reason that frontier LLM labs raise billions to train their models. nanochat models cost $100 - $1000 to train from scratch. The $100 nanochat is 1/1000th the size of GPT-3 in parameters, which came out 5 years ago. So I urge some perspective. Talking to micro models you have to imagine you're talking to a kindergarten child. They say cute things, wrong things, they are a bit confused, a bit naive, sometimes a little non-sensical, they hallucinate a ton (but it's amusing), etc. Full detail/report on this run is here: https://github.com/karpathy/nanochat/discussions/8 And I pushed the new script run1000 sh to the nanochat repo if anyone would like to reproduce. Totally understand if you'd like to spend $1000 on something else :D If you like, I am currently hosting the model so you can talk to it on a webchat as you'd talk to ChatGPT. I'm not going to post the URL here because I'm afraid it will get crushed. You'll have to look for it if you care enough. I'm also attaching a few funny conversations I had with the model earlier into the image, just to give a sense. Next up, I am going to do one pass of tuning and optimizing the training throughput, then maybe return back to scaling and maybe training the next tier of a bigger model.

https://rss.xcancel.com/karpathy/status/1978615547945521655#m

Published: October 16, 2025 00:14

nanochat d32, i.e. the depth 32 version that I specced for $1000, up from $100 has finished training after ~33 hours, and looks good. All the metrics go up quite a bit across pretraining, SFT and RL. CORE score of 0.31 is now well above GPT-2 at ~0.26.…

R to @karpathy: And an example of some of the summary metrics produced by the $100 speedrun in the report card to start. The current code base is a bit over 8000 lines, but I tried to keep them clean and well-commented. Now comes the fun part - of tuning and hillclimbing.

https://rss.xcancel.com/karpathy/status/1977755433172443626#m

Published: October 13, 2025 15:16

And an example of some of the summary metrics produced by the $100 speedrun in the report card to start. The current code base is a bit over 8000 lines, but I tried to keep them clean and well-commented. Now comes the fun part - of tuning and…

R to @karpathy: GitHub repo: https://github.com/karpathy/nanochat A lot more detailed and technical walkthrough: https://github.com/karpathy/nanochat/discussions/1 Example conversation with the $100, 4-hour nanochat in the WebUI. It's... entertaining :) Larger models (e.g. a 12-hour depth 26 or a 24-hour depth 30) quickly get more coherent.

https://rss.xcancel.com/karpathy/status/1977755430093980034#m

Published: October 13, 2025 15:16

GitHub repo: github.com/karpathy/nanochat A lot more detailed and technical walkthrough: github.com/karpathy/nanochat… Example conversation with the $100, 4-hour nanochat in the WebUI. It's... entertaining :) Larger models (e.g. a 12-hour depth 26 or a…

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web UI. It weighs ~8,000 lines of imo quite clean code to: - Train the tokenizer using a new Rust implementation - Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics - Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use. - SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval) - RL the model optionally on GSM8K with "GRPO" - Efficient inference the model in an Engine with KV cache, simple prefill/decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUI. - Write a single markdown report card, summarizing and gamifying the whole thing. Even for as low as ~$100 in cost (~4 hours on an 8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions. About ~12 hours surpasses GPT-2 CORE metric. As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and 70s on ARC-Easy, 20s on GSM8K, etc. My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved. Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.

https://rss.xcancel.com/karpathy/status/1977755427569111362#m

Published: October 13, 2025 15:16

Excited to release new repo: nanochat! (it's among the most unhinged I've written). Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT…

R to @karpathy: POV: Your LLM agent is dividing a by b

https://rss.xcancel.com/karpathy/status/1976082963382272334#m

Published: October 9, 2025 00:31

POV: Your LLM agent is dividing a by b

I don't know what labs are doing to these poor LLMs during RL but they are mortally terrified of exceptions, in any infinitesimally likely case. Exceptions are a normal part of life and healthy dev process. Sign my LLM welfare petition for improved rewards in cases of exceptions.

https://rss.xcancel.com/karpathy/status/1976077806443569355#m

Published: October 9, 2025 00:10

I don't know what labs are doing to these poor LLMs during RL but they are mortally terrified of exceptions, in any infinitesimally likely case. Exceptions are a normal part of life and healthy dev process. Sign my LLM welfare petition for improved rewards…

Every company needs a DM POC - someone high up who you can just DM the most obvious things and who shortcuts the PM hierarchy.

https://rss.xcancel.com/karpathy/status/1974482521862865154#m

Published: October 4, 2025 14:31

Every company needs a DM POC - someone high up who you can just DM the most obvious things and who shortcuts the PM hierarchy.

For your professional programming do you use mostly:

https://rss.xcancel.com/karpathy/status/1973892769359056997#m

Published: October 2, 2025 23:28

For your professional programming do you use mostly:

R to @karpathy: Hah judging by mentions overnight people seem to find the ghost analogy provocative. I swear I don't wake up just trying to come with new memes but to elaborate briefly why I thought it was a fun comparison: 1) It captures the idea that LLMs are purely digital artifacts that don't interact with the physical world (unlike animals, which are very embodied). 2) Ghosts are a kind of "echo" of the living, in this case a statistical distillation of humanity. 3) There is an air of mystery over both ghosts and LLMs, as in we don't fully understand what they are or how they work. 4) The process of training LLMs is a bit like summoning a ghost, i.e. a kind of elaborate computational ritual on a summoning platform of an exotic megastructure (GPU cluster). I've heard earlier references of LLM training as that of "summoning a demon" and it never sounded right because it implies and presupposes evil. Ghosts are a lot more neural entity just like LLMs, and may or may not be evil. For example, one of my favorite cartoons when I was a child was Casper the Friendly Ghost, clearly a friendly and wholesome entity. Same in Harry Potter, e.g. Nearly Headless Nick and such. 5) It is a nod to an earlier reference "ghost in the machine", in the context of Decartes' mind-body dualism, and of course later derived references, "Ghost in the shell" etc. As in the mind (ghost) that animates a body (machine). Probably a few other things in the embedding space. Among the ways the analogy isn't great is that while ghosts may or may not be evil, they are almost always spooky, which feels too unfair. But anyway, I like that while no analogy is perfect, they let you pull in structure laterally from one domain to another as as a way of generating entropy and reaching unique thoughts.

https://rss.xcancel.com/karpathy/status/1973756330449236009#m

Published: October 2, 2025 14:25

Hah judging by mentions overnight people seem to find the ghost analogy provocative. I swear I don't wake up just trying to come with new memes but to elaborate briefly why I thought it was a fun comparison: 1) It captures the idea that LLMs are purely…

Tinker is cool. If you're a researcher/developer, tinker dramatically simplifies LLM post-training. You retain 90% of algorithmic creative control (usually related to data, loss function, the algorithm) while tinker handles the hard parts that you usually want to touch much less often (infra, forward/backward of the LLM itself, distributed training), meaning you can do these at well below <<10% of typical complexity involved. Compared to the more common and existing paradigm of "upload your data, we'll post-train your LLM", this is imo a more clever place to "slice up" the complexity of post-training, both delegating the heavy lifting, but also keeping majority of the data/algorithmic creative control. I think the community still has to discover how and when finetuning makes sense compared to the (often strong) baseline of prompting a giant model. The early indications I've seen is that finetuning isn't so much about "stylizing" an LLM, instead, it's a lot more about narrowing the scope, and especially when you have a lot of training examples. An extreme example of scope narrowing being that of categorical classifiers, e.g.spam filters, content filters, etc. but it should be broader than that. Instead of building a giant few-shot prompts for a big LLM, it might work a lot better (and faster!) to finetune a smaller LLM specifically for your narrow task. Increasingly, production applications of LLMs are larger pipelines where a bunch of LLMs collaborate in DAGs and flows. Some of these components might work well as prompts. But a lot of it will probably work a lot better as a finetune. Tinker makes the latter trivial and should allow for an easy experimentation of what works best at any stage.

https://rss.xcancel.com/karpathy/status/1973468610917179630#m

Published: October 1, 2025 19:22

Tinker is cool. If you're a researcher/developer, tinker dramatically simplifies LLM post-training. You retain 90% of algorithmic creative control (usually related to data, loss function, the algorithm) while tinker handles the hard parts that you usually…

🦜 Andrej Karpathy / @karpathy