Guide: counting r in strawberry (and how to add abilities generally) #164

karpathy · 2025-10-24T15:17:58Z

karpathy
Oct 24, 2025
Maintainer

Last night I wanted to teach nanochat d32 how to count the number of 'r' in strawberry properly. I thought this was a cute example to use to illustrate how one can add any arbitrary computing capability to nanochat.

As usual this is done through synthetic data generation during midtraining and SFT. So we introduce a new task under tasks directory, I called it SpellingBee. Its goal is to generate synthetic conversations between a user and an assistant that challenge the assistant to complete problems of this type. We then attach the intended/correct solution, which will be imitated by nanochat once we finetune on them. Here is an example conversation (run python -m tasks.spellingbee), generated with spellingbee.py as of the time of this post.

====================================================================================================
tell me the number of "o" in pollocks?
----------------------------------------------------------------------------------------------------
We are asked to find the number 'o' in the word 'pollocks'. Let me try a manual approach first.

First spell the word out:
pollocks:p,o,l,l,o,c,k,s

Then count the occurrences of 'o':
1:p
2:o hit! count=1
3:l
4:l
5:o hit! count=2
6:c
7:k
8:s

This gives us 2.

Let me double check this using Python:

<<'pollocks'.count('o')=2>>

Python gives us 2.

My final answer is:

#### 2

Every conversation generated here is a simple 2-turn conversation. First the user asks to count letter in word, then nanochat is asked to respond in the style above. There are a number of subtle details to note.

Entropy

The user prompts are "fudged" randomly for diversity. There is a list of possible prompts that looks like:

USER_MSG_TEMPLATES = [
    "How many {letter} are in the word {word}",
    "How many {letter} are in {word}",
    "Count the number of {letter} in {word}",
    "How many times does {letter} appear in {word}",
    "What's the count of {letter} in {word}",

And so on, including a few other languages at the bottom so that nanochat gets "triggered" into this task even if the user asks in a different language. Doing this is important as a data augmentation so that we can evoke this functionality even if the user asks in slightly different ways.

Reasoning structure

You see that the response shows a little bit reasoning to it. In particular, nanochat will try a manual approach followed by double checking with its Python interpreter tool. At this time, this is all just smoke and mirrors because every single instance of this problem is "clean" - there are no mistakes anywhere. So during training, these two always match. However, we are setting up a kind schema of how nanochat might approach problem solving, and we are habituating the neural network to thinking in this manner. We might expect that if there are many tasks that show this kind of an approach, then nanochat will transfer learn to approach problems in this way by default. So at this point this is fully fake and cooked up. There are two ways to make it a bit more real:

Simulated mistakes/recoveries. We could simulate small mistakes in the solution and then show nanochat how to recover from them. For example, we could intentionally create a mistake in the manual approach (e.g. drop a letter, or miscount), and then we could have it do something like "Let me see if this looks correct? Yes/No", where the Yes/No token is essentially a classifier that allows nanochat to pivot and either try again, or trust that result less. Basically there are many different ways this could be done, but we can absolutely train nanochat (even with imitation learning/SFT) to recover from its mistakes which might come up later during sampling.
Reinforcement learning. The significantly more principled way to do this of course is to run RL on this task, just as we do with GSM8K already. In this manner, nanochat gets to actually practice the task, make its own mistakes, and learn how to recover from them. In practice, you usually want a bit of both! You show examples of reasoning behavior in SFT (we call this the "prior") and then once they are present and evoked with some probability, we expect RL to be able to take over and actually find a way to string it all together to succeed in getting the rewards.

Spelling

I am being deliberate and careful with a bunch of design decisions when nanochat responds:

First, I get nanochat to explicitly spell out the word and break it up into individual letters. This is really important because a word like "pollocks", as an example, is seen by the LLM as two tokens: poll and ocks:

>>> from nanochat.tokenizer import get_tokenizer
>>> tokenizer = get_tokenizer()
>>> tokenizer.encode("pollocks")
[47947, 3635]
>>> tokenizer.decode([47947])
'poll'
>>> tokenizer.decode([3635])
'ocks'

Notice briefly how I surround the word in single quotes. Because of the regex pattern in the tokenizer, the character ' is guaranteed to create a token boundary. If the single quote was not there and instead the Assistant just said "We are asked to find the number o in the word pollocks.", we'd get different tokens for pollocks because of the space right in front of it. We'd instead get the token poll (notice the space in front!) and ocks as before:

>>> tokenizer.encode(" pollocks")
[2478, 3635]

So the token poll is 47947 but poll (with space) is different - 2478. So even in the process of "restating" the user query, the LLM is already doing computational work under the hood because of tokenization. If the user included or didn't include quotes in their query, here the assistant is first asked to include them for itself, and having done that, it has simplified the problem somewhat. It then spells the word out, which is again another computational task for it. Copy pasting from above we have:

First spell the word out:
pollocks:p,o,l,l,o,c,k,s

Notice again how I'm being careful to avoid a space right after ":". Again because of the regex splitting pattern in the tokenizer, the use of commas is forcing token boundaries, so I'm guaranteed to get just the letters, and without spaces:

>>> tokenizer.encode("pollocks:p,o,l,l,o,c,k,s")
[47947, 3635, 58, 112, 44, 111, 44, 108, 44, 108, 44, 111, 44, 99, 44, 107, 44, 115]

(44 is the comma token). Okay so now we have the individual letters and their token ids (without any spaces).

I noticed when I was playing around with some early versions that the model would often screw up the spelling. So I added one more task SimpleSpelling which emits a lot simpler problems that just asks the model to spell. This is a very token-efficient way to get the neural net to learn the spelling task, e.g.:

====================================================================================================
Spell the word: revelous
----------------------------------------------------------------------------------------------------
revelous:r,e,v,e,l,o,u,s

I actually had a bug here where I included a space right after ":" but it's okay because of the magic of neural networks. But anyway I then mix this into training too, so that the model gets more loss from the harder part (spelling), forcing the optimization to adapt the weights to get this part right.

Counting

Now we get to the counting section:

Then count the occurrences of 'o':
1:p
2:o hit! count=1
3:l
4:l
5:o hit! count=2
6:c
7:k
8:s

Notice how I am again simplifying the problem for the model. Right after the model lists a letter (without space!!), I allow it to do an explicit comparison of whether that letter is the target letter, and increment a counter. I make the counter explicit to again simplify the task. This way the model can attend back, find the latest count, and just increment it by 1.

Anyway, you get the idea. Bigger models can do more mental arithmetic per token (inside their layers) and they have a lot more understanding of the tokens and how they translate (e.g. " pollocks" and "pollocks") are probably very nearby in some semantic latent space within the network. But smaller models care a bit more, and by obsessively scrutinizing the tokenization, you can make tasks a lot easier for them. And because smaller models aren't as capable, you also gain performance by really breaking up the entire problem into all the implicit calculations it implies, and spreading them out nicely across the token space, so that the model is not forced to do too much computation in any single individual token (it only has so many layers per token!).

Summary

To summarize. Why is the deceptively simple looking prompt "How many 'r' are there in strawberry?" so hard for an LLM?

Tokenization makes whitespace annoying, the model has to translate tokens with and without space all the time
The model has to break up tokens (chunks of characters) into tokens of all the individual characters
The model has to compare each character to the target and keep an incrementing counter

If you force a state of the art LLM to respond to this type of query in just a single token (no reasoning/thinking), then it has to solve all of the above in a single forward pass of the neural network, which is really, really hard. Reasoning models break up the task over many tokens, simplifying each forward pass substantially. It can then attend to all that intermediate work and treat it almost like little memory registers of partial solution.

Anyway, the d32 nanochat on nanochat.karpathy.ai can count letters now! There are many improvements possible, e.g. you might want to simulate doing this deeper in some conversation to make it a bit more robust to where the query to perform this task appears. Or to add more entropy in both user/assistant messages for more diversity. Or to simulate mistakes during SFT to get some recoveries. Or to finetune nicely with RL. Or to do a round of "data engine", where you explicitly look for words that nanochat cannot spell, and explicitly mix them into the training set of SimpleSpelling task so that it gets more practiec with the more difficult/confusing words. I hope this was helpful, see the commit introducing this to master for the code.

wale-e · 2025-10-24T15:33:12Z

wale-e
Oct 24, 2025

Hi Andrej, it succeeds on strawberry but fails on performance

any idea why?

7 replies

karpathy Oct 24, 2025
Maintainer Author

Ok I just ran eval with temperature 0.8 on 256 test examples and SpellingBee still gets 100%...

Omg I just noticed that the Python interpreter gave incorrect result despite the program being correct. This is because I failed to git pull the latest code on my inference box. It was using the old calculator that doesn't support .count. Stupid mistake by me sorry. Basically you should now see 100% accuracy even at temperature 0.8, because nanochat seems to trust the Python tool more when the two disagree (just accidentally), and the task of copy pasting the query into the Python code has a very high success rate compared to the approach 1 of mental arithmetic.

wale-e Oct 25, 2025

Thanks Andrej for the response. works like a charm now

energyprince Oct 25, 2025

LET HIM COOK

sharathdoes Oct 28, 2025

hold up let bro cook

gtuli Oct 29, 2025

I am sure we are still cooking the model; getting an incorrect answer for count b in google.

SermetPekin · 2025-10-24T16:25:10Z

SermetPekin
Oct 24, 2025

"... nanochat seems to trust the Python tool more when the two disagree (just accidentally), ..."""

@karpathy when I use a trick it does not show how but luckily gives the (kind of) correct answer :)

2 replies

karpathy Oct 24, 2025
Maintainer Author

Wait how does the Python give the wrong result here? :|

SermetPekin Oct 24, 2025

This is getting quite mysterious. :)

htahboub · 2025-10-24T16:46:47Z

htahboub
Oct 24, 2025

Do you rerun the entire midtraining/SFT stages including this new task, or are you fine-tuning on it alone? The latter seems more interesting since you would need some way to avoid forgetting too much, especially for bigger tasks. Maybe just sampling and training on a small percentage of the original training dataset while training the new one?

2 replies

ankitmaloo Oct 24, 2025

baased on what i can tell it's full sft midtraining run (its in the commit added to the list of examples in sft). That is how you ensure no catastrophic forgetting.

Finetuning in case of 300 examples should be okay (some cf, not very high). Recently there was a paper about how you can add KL clipping to loss calculation to prevent forgetting. Works even in SFT or full finetuning.

Ideally, you take some examples. Do SFT zero to initialize the pattern, and then do RL on top.

htahboub Oct 24, 2025

Yeah that's what I assumed, but it's definitely more interesting to think about how to avoid forgetting when adding more complex tasks without having to redo the whole midtraining stage, considering that's not exactly practical.

SermetPekin · 2025-10-24T18:47:39Z

SermetPekin
Oct 24, 2025

I was curious why it spells google as 'gook', then I thought it is maybe google is not a traditional word with suffixes and maybe there are more words finishes with k than 'le' and wanted to check if it can answer some. Although it does not give some of the essential words that comes to mind such as apple, people, candle, simple, title etc. but impressively it looks like it follows some logic here.

1 reply

shivasurya Oct 25, 2025

I was wondering the same thing. Does that mean the tokenizer doesn't have the word Google ?

ahmedehabb · 2025-10-24T20:08:15Z

ahmedehabb
Oct 24, 2025

I tried asking nanochat to count the number of “r”s in “strrawberry” (with 4 r’s), and it failed. it assumed the real word was “strawberry” and counted for that instead. I also tried “ajbdindjrrinjkasrr”, and it gave a funny result: it spelled the word as “a,j,b,d,i,n,d,j,r,i,n,j,k,s,r” and counted 2, but in its Python check it correctly showed 'ajbdindjrrinjkasrr'.count('r') = 4. So it basically dropped a couple of letters during the spelling step.

I think this happens because the model was only trained on clean, real words, so when it sees random or nonsense strings it kind of normalizes or tokenizes them weirdly instead of taking them literally. Maybe it would help to train on random-character or gibberish words too, so it learns to treat every letter sequence exactly as written, though the downside is that it could blur or distort the embedding space, since the model might start learning tokens and transitions that don’t correspond to meaningful language. That could make it more literal, but maybe less semantically grounded.

Interestingly, I actually expected spelling to be easier on random words, since the tokenizer would likely split them into smaller subtokens meaning fewer multi-character chunks and potentially a simpler character-by-character mapping.

0 replies

LensenWakasa · 2025-10-24T20:38:02Z

LensenWakasa
Oct 24, 2025

Interesting!

0 replies

j-zhrv · 2025-10-24T22:42:05Z

j-zhrv
Oct 24, 2025

Creating synthetic datasets is an excellent use case for state-of-the-art LLMs.
But, for example, OpenAI bans accounts for “distillation” (it can even interpret running a benchmark as distillation and ban you). Or is generating synthetic datasets not considered distillation? 🧐

0 replies

johnzfitch · 2025-10-25T11:12:04Z

johnzfitch
Oct 25, 2025

If you force a state of the art LLM to respond to this type of query in just a single token (no reasoning/thinking), then it has to solve all of the above in a single forward pass of the neural network, which is really, really hard. Reasoning models break up the task over many tokens, simplifying each forward pass substantially. It can then attend to all that intermediate work and treat it almost like little memory registers of partial solution.

Thank you for sharing, found this super fascinating and a useful trick. I am working on an LLM text detection suite based on attempting to identify patterns between clauses and their features so this tip about separating tokens letter by letter , , , is a great idea I'm going to be taking advantage of in some way I'm sure ^_^

1 reply

johnzfitch Oct 26, 2025

I wonder why it cutoff that final x, because the rest of the word seemed plausible except that letter?

richardprobe · 2025-10-25T19:22:14Z

richardprobe
Oct 25, 2025

Few questions if any of y'all know:

The fact that we try to use "python tool call" is actually just possibly just a "trick", since LLM is already trained to trust interpretter output texts a bit more, like ground-truth, right?
I wonder this should be structured more like a multi-turn conversation rather than single-response to actually introduce tool-calling behavior? Like following:

    "messages": [
      {
        "role": "user",
        "content": "How many r are in strawberry?"
      },
      {
        "role": "assistant",
        "content": "Let me count manually first.\n\nSpelling out 'strawberry': 
  s,t,r,a,w,b,e,r,r,y\n\nCounting 'r' occurrences:\n3:r hit! count=1\n8:r hit! count=2  \n9:r hit! 
  count=3\n\nI count 3 occurrences. Let me verify with Python."
      },
      {
        "role": "assistant",
        "content": "'strawberry'.count('r')"
      },
      {
        "role": "system",
        "content": "3"
      },
      {
        "role": "assistant",
        "content": "Python confirms my manual count. The answer is 3.\n\n#### 3"
      }
    ]
  }

0 replies

minglei-wang-999 · 2025-10-25T22:50:26Z

minglei-wang-999
Oct 25, 2025

Hi @karpathy
Thanks so much for explaining this clearly. Instead of counting letters, I tried number this time but no luck. Is it kind of OOD as the training dataset only includes words instead of numbers.

3 replies

richardprobe Oct 26, 2025

Actually yeah I observe the same.

I am thinking this is simply because the dataset we train upon are only for counting characters in words, where this model isn't big enough to have transfer learning.

I have tried few-shot prompting as well, but also with no luck:

count 0 in 10000. 

<examples>
<example1>
Prompt: count 4 in '2342'
Answer:
We are asked to find the number '4' in the word '2342'. Let me try a manual approach first.

First spell the word out:
2342: 2,3,4,2

Then count the occurrences of '4':
1:2
2:3
3:4 ← 1st occurrence
4:2

This gives us 1.

Let me double check this using Python:

<|python_start|>'2342'.count('4')<|python_end|><|output_start|>1<|output_end|>

Python gives us 1.

My final answer is:

#### 1
</example1>
<example2>
Prompt: count 8 in '828883855828'
Answer:
We are asked to find the number '8' in the word '828883855828'. Let me try a manual approach first.

First spell the word out:
828883855828: 8,2,8,8,8,3,8,5,5,8,2,8

Then count the occurrences of '8':
1:8 ← 1st occurrence
2:2
3:8 ← 2nd occurrence
4:8 ← 3rd occurrence
5:8 ← 4th occurrence
6:3
7:8 ← 5th occurrence
8:5
9:5
10:8 ← 6th occurrence
11:2
12:8 ← 7th occurrence

This gives us 7.

Let me double check this using Python:

<|python_start|>'828883855828'.count('8')<|python_end|><|output_start|>7<|output_end|>

Python gives us 7.

My final answer is:

#### 7
</example2>
</examples>

richardprobe Oct 26, 2025

Actually good news! I trained this task separately, and it's able to do this! I'll update more information here later for my commit and training run.

richardprobe Oct 28, 2025

ok, it is here: #184!

yukiarimo · 2025-10-26T05:49:21Z

yukiarimo
Oct 26, 2025

Tool calling is bad + after training on 5T+ tokens, model can confidently learn the compounds of letters in tokens

1 reply

richardprobe Oct 26, 2025

is this a response to my post above? Or are you simply stating an observation?

niclaswue · 2025-10-27T15:19:01Z

niclaswue
Oct 27, 2025

I can't get it to trigger when it's not the first message, the context before seems to matter a lot. But very cool, nontheless :)

3 replies

richardprobe Oct 28, 2025

I think it's because when @karpathy trained the model on this task, the model is trained only on a single-turn QA. That means the model can only understand if the question is asked immediately (with zero prior conversation).

To extend this to multi-conversation would require another crafted multi-turn chat dataset. I can try to build towards that if that's desired? Or if @karpathy you have some ideas on how to go about it, happy to get guidance from you.

richardprobe Oct 28, 2025

Thinking deeper on how to make things go from single -> multi for the SpellingBee task, I have few ideas, wondering which ones would work better:

just re-use existing SFT dataset (i.e. smoltalk or the custom-identity one)
dynamically create my own conversation from some local LLM for cost-saving? But then wondering what should be prompt to generate those (e.g. simply "create some dummy Q/A conversation, similar to smoltalk dataset"?)

Furthermore, I guess digging deeper, should I mask out any prior user/assistant convos for chat_sft.py to ensure it only calculate loss from the final assistant answer?

richardprobe Oct 29, 2025

actually, just realized that ChatGPT (Atlas browser) can fully answer all my questions:

Here’s a thoughtful technical reply you could post to @richardprobe that builds on his points and gives actionable guidance:

---

Good analysis — yes, you’re right that the original *SpellingBee* training setup is effectively **single-turn SFT**, so the model never learns how to maintain conversation state.

To make it multi-turn, there are a few ways you can extend it:

1. **Reusing existing chat-style SFT datasets:**
   Using something like *smolTalk* or *custom-identity* as a base makes sense, but you’ll want to inject the *SpellingBee-style reasoning behavior* into it. For example, each chat could include a sequence like:

   * User asks an initial question
   * Assistant responds step-by-step (spelling + counting)
   * User follows up (“Are you sure?”, “Can you verify with Python?”)
   * Assistant rechecks and confirms
     This teaches the model to persist reasoning state across turns.

2. **Synthetic multi-turn generation:**
   You can absolutely use a local LLM to generate synthetic chat transcripts. A simple but effective prompting format could be:

Generate a 3–5 turn conversation where the assistant performs the 'counting characters' task across turns, sometimes re-verifying or correcting itself.


This gives diversity in phrasing and introduces turn dependencies.

3. **Loss masking (chat_sft.py):**
Yes — for standard SFT-style training, you typically **mask out user tokens** and compute loss only on assistant tokens.
If you want to ensure the model focuses on generating *the final assistant reply per turn*, make sure each assistant message is treated as its own label segment in the dataset.
You don’t need to discard previous turns — context is still valuable for conditioning, even if loss is applied only on the latest assistant output.

4. **(Optional) Curriculum approach:**
You might start by training on synthetic single-turn data to ensure correctness on each subtask, then gradually introduce multi-turn variants to let the model learn continuity and corrections.

---

If you share your data schema or `chat_sft.py` setup, I can help you sketch out what the data format and loss masking logic would look like for multi-turn fine-tuning.

Would you like me to show a short example of a multi-turn synthetic conversation JSONL entry for this task?

LiuZanofficer · 2025-10-29T07:07:26Z

LiuZanofficer
Oct 29, 2025

Oh, bro—during training, there are more words ending in "k" than those ending in "le", so it incorrectly spells "le" as "k". When trying to spell "strrawberry", it defaults to "strawberry" instead. When trying to count the number of "a"s in "nanochat", it mistakenly thinks there are 3. Why on earth is that? Why is a problem that’s simple for a Python interpreter beyond the capability of a GPT model? Is it necessary to use a large-parameter model to avoid such trivial mistakes? Is this an unsolvable issue inherent to tokenization and the Transformer architecture itself?

1 reply

kiankyars Nov 5, 2025

This is not an unsolvable issue inherent to tokenization in the transformer architecture itself, because any state of the art transformer architecture will solve this trivially. And it's not simple for the Python interpreter. In the sense that it is not solving it. You are simply using the functions provided by the Python interpreter to solve the count task, which is just a single function.

SmartManoj · 2025-11-14T00:15:00Z

SmartManoj
Nov 14, 2025

How would you teach this?

https://simple-bench.com

0 replies

Guide: counting r in strawberry (and how to add abilities generally) #164

Uh oh!

Uh oh!

karpathy Oct 24, 2025 Maintainer

Entropy

Reasoning structure

Spelling

Counting

Summary

Replies: 14 comments · 21 replies

Uh oh!

Uh oh!

karpathy Oct 24, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy Oct 24, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

karpathy
Oct 24, 2025
Maintainer

Replies: 14 comments 21 replies

karpathy Oct 24, 2025
Maintainer Author

karpathy Oct 24, 2025
Maintainer Author