Guide: counting r in strawberry (and how to add abilities generally) #164
Replies: 14 comments 21 replies
-
|
Hi Andrej, it succeeds on strawberry but fails on performance any idea why? |
Beta Was this translation helpful? Give feedback.
-
@karpathy when I use a trick it does not show how but luckily gives the (kind of) correct answer :) |
Beta Was this translation helpful? Give feedback.
-
|
Do you rerun the entire midtraining/SFT stages including this new task, or are you fine-tuning on it alone? The latter seems more interesting since you would need some way to avoid forgetting too much, especially for bigger tasks. Maybe just sampling and training on a small percentage of the original training dataset while training the new one? |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Interesting! |
Beta Was this translation helpful? Give feedback.
-
|
Creating synthetic datasets is an excellent use case for state-of-the-art LLMs. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Few questions if any of y'all know:
|
Beta Was this translation helpful? Give feedback.
-
|
Hi @karpathy |
Beta Was this translation helpful? Give feedback.
-
|
Tool calling is bad + after training on 5T+ tokens, model can confidently learn the compounds of letters in tokens |
Beta Was this translation helpful? Give feedback.
-
|
I can't get it to trigger when it's not the first message, the context before seems to matter a lot. But very cool, nontheless :) |
Beta Was this translation helpful? Give feedback.
-
|
Oh, bro—during training, there are more words ending in "k" than those ending in "le", so it incorrectly spells "le" as "k". When trying to spell "strrawberry", it defaults to "strawberry" instead. When trying to count the number of "a"s in "nanochat", it mistakenly thinks there are 3. Why on earth is that? Why is a problem that’s simple for a Python interpreter beyond the capability of a GPT model? Is it necessary to use a large-parameter model to avoid such trivial mistakes? Is this an unsolvable issue inherent to tokenization and the Transformer architecture itself? |
Beta Was this translation helpful? Give feedback.
-
|
How would you teach this? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Last night I wanted to teach nanochat d32 how to count the number of 'r' in strawberry properly. I thought this was a cute example to use to illustrate how one can add any arbitrary computing capability to nanochat.
As usual this is done through synthetic data generation during midtraining and SFT. So we introduce a new task under
tasksdirectory, I called itSpellingBee. Its goal is to generate synthetic conversations between a user and an assistant that challenge the assistant to complete problems of this type. We then attach the intended/correct solution, which will be imitated by nanochat once we finetune on them. Here is an example conversation (runpython -m tasks.spellingbee), generated with spellingbee.py as of the time of this post.Every conversation generated here is a simple 2-turn conversation. First the user asks to count letter in word, then nanochat is asked to respond in the style above. There are a number of subtle details to note.
Entropy
The user prompts are "fudged" randomly for diversity. There is a list of possible prompts that looks like:
And so on, including a few other languages at the bottom so that nanochat gets "triggered" into this task even if the user asks in a different language. Doing this is important as a data augmentation so that we can evoke this functionality even if the user asks in slightly different ways.
Reasoning structure
You see that the response shows a little bit reasoning to it. In particular, nanochat will try a manual approach followed by double checking with its Python interpreter tool. At this time, this is all just smoke and mirrors because every single instance of this problem is "clean" - there are no mistakes anywhere. So during training, these two always match. However, we are setting up a kind schema of how nanochat might approach problem solving, and we are habituating the neural network to thinking in this manner. We might expect that if there are many tasks that show this kind of an approach, then nanochat will transfer learn to approach problems in this way by default. So at this point this is fully fake and cooked up. There are two ways to make it a bit more real:
Spelling
I am being deliberate and careful with a bunch of design decisions when nanochat responds:
First, I get nanochat to explicitly spell out the word and break it up into individual letters. This is really important because a word like "pollocks", as an example, is seen by the LLM as two tokens:
pollandocks:Notice briefly how I surround the word in single quotes. Because of the regex pattern in the tokenizer, the character
'is guaranteed to create a token boundary. If the single quote was not there and instead the Assistant just said "We are asked to find the number o in the word pollocks.", we'd get different tokens for pollocks because of the space right in front of it. We'd instead get the tokenpoll(notice the space in front!) andocksas before:So the token
pollis 47947 butpoll(with space) is different - 2478. So even in the process of "restating" the user query, the LLM is already doing computational work under the hood because of tokenization. If the user included or didn't include quotes in their query, here the assistant is first asked to include them for itself, and having done that, it has simplified the problem somewhat. It then spells the word out, which is again another computational task for it. Copy pasting from above we have:Notice again how I'm being careful to avoid a space right after ":". Again because of the regex splitting pattern in the tokenizer, the use of commas is forcing token boundaries, so I'm guaranteed to get just the letters, and without spaces:
(44 is the comma token). Okay so now we have the individual letters and their token ids (without any spaces).
I noticed when I was playing around with some early versions that the model would often screw up the spelling. So I added one more task
SimpleSpellingwhich emits a lot simpler problems that just asks the model to spell. This is a very token-efficient way to get the neural net to learn the spelling task, e.g.:I actually had a bug here where I included a space right after ":" but it's okay because of the magic of neural networks. But anyway I then mix this into training too, so that the model gets more loss from the harder part (spelling), forcing the optimization to adapt the weights to get this part right.
Counting
Now we get to the counting section:
Notice how I am again simplifying the problem for the model. Right after the model lists a letter (without space!!), I allow it to do an explicit comparison of whether that letter is the target letter, and increment a counter. I make the counter explicit to again simplify the task. This way the model can attend back, find the latest count, and just increment it by 1.
Anyway, you get the idea. Bigger models can do more mental arithmetic per token (inside their layers) and they have a lot more understanding of the tokens and how they translate (e.g. " pollocks" and "pollocks") are probably very nearby in some semantic latent space within the network. But smaller models care a bit more, and by obsessively scrutinizing the tokenization, you can make tasks a lot easier for them. And because smaller models aren't as capable, you also gain performance by really breaking up the entire problem into all the implicit calculations it implies, and spreading them out nicely across the token space, so that the model is not forced to do too much computation in any single individual token (it only has so many layers per token!).
Summary
To summarize. Why is the deceptively simple looking prompt "How many 'r' are there in strawberry?" so hard for an LLM?
If you force a state of the art LLM to respond to this type of query in just a single token (no reasoning/thinking), then it has to solve all of the above in a single forward pass of the neural network, which is really, really hard. Reasoning models break up the task over many tokens, simplifying each forward pass substantially. It can then attend to all that intermediate work and treat it almost like little memory registers of partial solution.
Anyway, the d32 nanochat on nanochat.karpathy.ai can count letters now! There are many improvements possible, e.g. you might want to simulate doing this deeper in some conversation to make it a bit more robust to where the query to perform this task appears. Or to add more entropy in both user/assistant messages for more diversity. Or to simulate mistakes during SFT to get some recoveries. Or to finetune nicely with RL. Or to do a round of "data engine", where you explicitly look for words that nanochat cannot spell, and explicitly mix them into the training set of
SimpleSpellingtask so that it gets more practiec with the more difficult/confusing words. I hope this was helpful, see the commit introducing this to master for the code.Beta Was this translation helpful? Give feedback.
All reactions