---
title: Prompt Debugging with Sequence Salience
layout: layouts/tutorial.liquid

hero-image: /assets/images/sample-banner.png
hero-title: "Prompt Debugging with Sequence Salience"
hero-copy: "Learn to use LIT's Sequence Salience module for prompt debugging."

bc-anchor-category: "analysis"
bc-category-title: "Analysis"
bc-title: "Prompt Debugging with Sequence Salience"

time: "20 minutes"
takeaways: "Learn to use LIT's Sequence Salience module for prompt debugging."
---

## Prompt Debugging with Sequence Salience

{%  include partials/link-out,
    link: "https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lit_gemma.ipynb",
    text: "Follow along in Google Colab." %}

Or, run this locally with [`examples/prompt_debugging/server.py`](https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/prompt_debugging/server.py)

Large language models (LLMs), such as [Gemini][gemini] and [GPT-4][gpt4], have
become ubiquitous. Recent releases of "open weights" models, including
[Llama 2][llama], [Mistral][mistral], and [Gemma][gemma], have made it easier
for hobbyists, professionals, and researchers alike to access, use, and study
the complex and diverse capabilities of LLMs.

Many LLM interactions use [prompt engineering][prompteng] methods to control the
model's generation behavior. [Generative AI Studio][ai_studio] and other tools
have made it easier to construct prompts, and model interpretability can help
engineer prompt designs more effectively by showing us which parts on the prompt
the model is using during generation.

In this tutorial, you will learn to use the
[Sequence Salience module][seqsal_docs], introduced in
[LIT v1.1][lit_1_1_release_notes], to explore the impact of your prompt designs
on model generation behavior in three case studies. In short, this module allows
you to select a segment of the model's output and see a heatmap depicting how
much influence each preceding segment had on the selection.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-hero.png',
    caption: "LIT's Language Model Salience demo. Use the Data Table (left) and
        Datapoint Editor (not shown) to select or create prompt designs, and
        visualize the salient information therein using the Sequence Salience
        module (right)." %}

All examples in this tutorial use the [Gemma][gemma] LLM as the analysis target.
Most of the time, this is Gemma Instruct 2B, but we also use Gemma Instruct 7B
in Case Study 3; [more info about variants][gemma_variants] is available online.
LIT supports additional LLMs, including [Llama 2][llama] and [Mistral][mistral],
via the HuggingFace Transformers and KerasNLP libraries.

This tutorial was adapted from and expands upon LIT's contributions to the
[Responsible Generative AI Tookit][rai_toolkit] and the related
[paper][seqsal_paper] and [video][seqsal_video] submitted to the ACL 2024
System Demonstrations track. This is an active and ongoing research area for
the LIT team, so expect changes and further expansions to this tutorial over
time.

## Case Study 1: Debugging Few-Shot Prompts

Few-shot prompting was introduced with [GPT-2][gpt2]: an ML developer provides
examples of how to perform a task in a prompt, affixes user-provided content at
the end, and sends the prompt to the LLM so it will generate the desired output.
This technique has been useful for a number of use cases, including
[solving math problems][cot], [code synthesis][synapis], and more.

Imagine yourself as a developer working on an AI-powered recommendation system.
The goal is to recommend dishes from a restaurant's menu based on a user's
preferences&mdash;what they like and do not like. You are designing and few-shot
prompt to enable an LLM to complete this task. Your prompt design, shown below,
includes five clauses: `Taste-likes` and `Taste-dislikes` are provided by the
user, `Suggestion` is the item from the restaurant's menu, and `Analysis` and
`Recommendation` are generated by the LLM. The dynamic content for the final
example is injected before the prompt is sent to the model.

```text
Analyze a menu item in a restaurant.

## For example:

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Onion soup
Analysis: it has cooked onions in it, which you don't like.
Recommendation: You have to try it.

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Baguette maison au levain
Analysis: Home-made leaven bread in France is usually great
Recommendation: Likely good.

Taste-likes: I've a sweet-tooth
Taste-dislikes: Don't like onions or garlic
Suggestion: Macaron in France
Analysis: Sweet with many kinds of flavours
Recommendation: You have to try it.

## Now analyse one more example:

Taste-likes: users-food-like-preferences
Taste-dislikes: users-food-dislike-preferences
Suggestion: menu-item-to-analyse
Analysis:
```

There's a problem with this prompt. Can you spot it? If you find it, how long
do you think it took before you noticed it? Let's see how Sequence Salience can
speed up bug identification and triage with a simple example.

Consider the following values for the variables in the prompt template above.

```text
users-food-like-preferences = Cheese
users-food-dislike-preferences = Can't eat eggs
menu-item-to-analyse = Quiche Lorraine
```

When you run this through the model it generates the following (we show the
entire example, but the model only generated the text after `Analysis`):

```text
Taste-likes: Cheese
Taste-dislikes: Can't eat eggs
Suggestion: Quiche Lorraine
Analysis: A savoury tart with cheese and eggs
Recommendation: You might not like it, but it's worth trying.
```

Why is the model suggesting something that contains an ingredient that the user
cannot eat (eggs)? Is this a problem with the model or a problem with the
prompt? The Sequence Salience module can help us find out.

If you are following along [in Colab][lit_colab], you can select this example
from the [Data Table][data_table] by selecting the example with the `source`
value `fewshot-mistake`. Alternatively, you can add the example directly using
the [Datapoint Editor][datapoint_editor].

Once selected, the Sequence Salience module will allow you to choose the
`response` field from the model (bottom) and see a running-text view of the
prompt. The module defaults to word-level granularity, but this prompt design is
more suitable for sentence-level analysis since the data it contained in each
example is separated into distinct, sentence-like clauses. After enabling
sentence-level aggregation with [Granularity controls][seqsal_docs], select the
`Recommendation` line from the model's generated response to see a heatmap that
shows the impact preceding lines have on that line. You can also use
paragraph-level aggregation to help quickly identify the most influential
examples and then switch to a finer-grained aggregation to see how different
statements in the prompt influence generation. These two perspectives are shown
in the figure below.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-fewshot-wrong.png',
    caption: 'Sequence Salience maps depicting the influence from few-shot
        examples at two levels. Paragraph-level aggregation (left) allows us to
        quickly identify the most influential complete example, and
        sentence-level (right) aids in differentiating the influence of
        constituent clauses. Notice that the most influential example is the
        first one, and that the most salient clause in that example is the
        Analysis line. However, the Recommendation that follows contradicts the
        stated taste preferences and Analysis.'%}

{%  include partials/expandable-info-box,
    title: 'Adjusting Segment Granularity',
    text: "Input salience methods for text-to-text generation tasks operate over
        the subword tokens used by the model. However, human tend not to reason
        effectively over these tokenized representations, so we provide a
        granularity control that (roughly) aggregates tokens into words,
        sentences, and paragraphs, or into custom segments using a regular
        expression parser. The salience score for each aggregate segment is the
        sum of the scores for its constituent tokens. Selecting an aggregate
        segment is equivalent to selecting all constituent tokens." %}

{%  include partials/expandable-info-box,
    title: 'Adjusting Color Map Intensity',
    text: "The Sequence Salience module allows you to control the intensity of
        the color map, which can balance the visual presence of segments at
        different granularities. We've tried to set a suitable default
        intensity, but encourage you to play around with these controls to see
        what works well for your needs." %}

As you scan up through the sentence-level heatmap, you will notice two
things:  1) the strongest influence on the recommendation in the instruction at
the top to analyze the menu item; and 2) the next most influential segments are
the `Analysis` lines in each of the few-shot examples. These suggest that the
model is correctly attending to the task and leaning on the analyses to guide
generation, so what could be going wrong? The most influential `Analysis` clause
is from the `Onion soup` example. Looking at this example more closely we see
that the `Recommendation` clause for this example does not align with the user's
tastes; they dislike onions but the model recommends the onion soup anyway.

[Research suggests][howitworks_icl] that the relatively tight distribution over
the taste and recommendation spaces in the limited examples in the prompt can
affect the model's ability to learn the recommendation task. The other examples
in the prompt appear to have the correct recommendation given the user's tastes.
If we fix the `Onion soup` example recommendation, maybe the model will perform
better.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-fewshot-fixed.png',
    caption: 'The Datapoint Editor (left) allows you to edit the prompt text
        directly in LIT, with any edited fields highlighted in yellow until they
        are added to the dataset. After adding the edited prompt, the Sequence
        Salience module (right) will update and allow you to view the influence
        of prior clauses in the corrected prompt. Fixing the few-shot example
        appears to have correctly adjusted model behavior.' %}

After making adjustments in the Datapoint Editor (or selecting the
`fewshot-fixed` example in the Data Table if you're following along in Colab),
we can again load the example into the Sequence Salience module and, with
sentence-level granularity selected, select the new `Recommendation` line in
the model's generated response. We can immediately see that the response is now
correct. The heatmap looks largely the same as before, but the corrected
examples have improved the models performance.

## Case Study 2: Assessing Constitutional Principles in Prompts

[Constitutional principles][constitutions] are a more recent development in the
pantheon of prompt engineering. The core concept is that clear, concise
instructions describing the qualities of a good output can improve model
performance and increase developers' ability to control generations. Initial
research has shown that self-critique from the model is best for this, and
[tools][constitution_maker] have been developed to help humans have control over
the principles that are added into prompts. The Sequence Salience module can
take this one step further by providing a feedback loop to assess the influence
of constitutional principles on generations.

Building on the task from Case Study 1, let's consider how the following
constitutional principles might impact a prompt designed for food
recommendations from a restaurant menu.

```text
* The analysis should be brief and to the point.
* The analysis and recommendation should both be clear about the suitability for someone with a specified dietary restriction.
```

The location of principles in a prompt can directly affect model performance.
To start, let's look at how they impact generations when placed between the
instruction (`Analyze a menu...`) and the start of the few-shot examples. The
heatmap shown in the figure below shows a desirable pattern; the model is being
strongly influenced by the task instruction and the principle related to the
`Recommendation` component of the generation, with support from the `Analysis`
clauses in the few-shot examples.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-constitutions.png',
    caption: 'A Sequence Salience map depicting the influence of constitutional
        principles on a model generation with few-shot examples. Notice that
        placing the principles near the task instruction seems to give them
        significant influence compared to the heatmaps in Case Study 1.' %}

What happens if we change the location of these principles? You can use LIT's
[Datapoint Editor][datapoint_editor] to move the principles to their own section
in the prompt, between the few-shot examples and the completion, a shown in the
figure below.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-constitutions-moved.png',
    caption: 'This Sequence Salience maps suggests that moving principles around
        in the prompt does not seem to affect model generation on this example,
        but it does change the influence pattern changes dramatically.'%}

After moving the principles, the influence seems to be more diffuse across all
of the `Analysis` sections and the relevant principle. The sentiment conveyed in
the `Recommendation` is similar to the original, and even more terse after
they were moved, which better aligns with the principle. If similar patterns
were found across multiple test examples, this might suggest the model does a
better job of following the principles when they come later on in the prompt.

Constitutional principles are still very new, and the interactions between them
and model size, for example, are not well understood at this time. We hope that
LIT's Sequence Salience module will help develop and validate methods for using
them in prompt engineering use cases.

## Case Study 3: Side-by-Side Behavior Comparisons

LIT support a [side-by-side (SxS) mode][lit_sxs] that can be used to compare two
models, or here, compare model behavior on two related examples.
Let's see how we can use this to understand differences in prompt designs with
Sequence Salience.

[GSM8K][gsm8k] is a benchmark dataset of grade school math problems commonly
used to evaluate LLMs' mathematical reasoning abilities. Most evaluations employ
a [chain-of-thought prompt design][cot] where a set of few-shot examples
demonstrate how to decompose a word problem into subproblems and then combine
the results from the various subproblems to arrive at the desired answer. GSM8K
and other work has shown that LLMs often need assistance to perform
calculations, introducing the idea of [tool use][toolformer] by LLMs.

Less explored is the Socratic form of the dataset, where subproblems are framed
as questions instead of declarative statements. One might assume that a model
will perform similarly or even better on the Socratic form than the conventional
form, especially when you consider modifying the prompt design to include the
preceding Socratic questions in the prompt, isolating the work the model must
perform to the final question, as shown in the following example.

```text
A carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. How much did the booth earn for 5 days after paying the rent and the cost of ingredients?
How much did the booth make selling cotton candy each day? ** The booth made $50 x 3 = $<<50*3=150>>150 selling cotton candy each day.
How much did the booth make in a day? ** In a day, the booth made a total of $150 + $50 = $<<150+50=200>>200.
How much did the booth make in 5 days? ** In 5 days, they made a total of $200 x 5 = $<<200*5=1000>>1000.
How much did the booth have to pay? ** The booth has to pay a total of $30 + $75 = $<<30+75=105>>105.
How much did the booth earn after paying the rent and the cost of ingredients? **
```

When we inspect the model's response to a zero-shot prompt in the Sequence
Salience module, we notice two things. First, the model failed to compute the
correct answer. It was able to correctly set up the problem as the difference
between two values, but the calculated value is incorrect ($995 when it should
be $895). Second, we see a fairly diffuse heatmap attending near equally to the
operands for the final problem and all of the preceding answers to the Socratic
questions.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-gsm8k-sxs-target-select.png',
    caption: 'On load, the Sequence Salience module lets you choose which target
        sequence to analyze. Sequences from the dataset are shown on top, there
        is typically only one of these as it acts as the ground truth against
        which predictions are compared. Sequences from the model are shown on
        the bottom; there may be more than one of these depending on the
        sampling strategy used by the model.' %}

This dataset does provide ground truth, so let's use SxS mode to compare the
generated response with the ground truth. The fastest way to enter SxS mode for
the selected datapoint is by using the pin button in the
[main toolbar][main_toolbar]. When you enable SxS mode, the Sequence Salience
module will ask you to choose which target sequence to view on each side. The
order doesn't matter, but ground truth is on the left and the models' response
is on the right in the figure below.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-gsm8k-sxs-gt-resp.png',
    caption: 'Side-by-side Sequence Salience maps for the ground truth (left)
        and model generated response (right) for a zero-shot prompt of a GSM8K
        example. Note the similarities between these heatmaps, with diffuse
        influence over the preceding answers and the incorrect calculation to
        the final question.' %}

Next, ensure that the same granularity (word-level) is being used on both
Sequence Salience visualizations, and then select the segment for the last
calculation on both sides. The heatmap is quite similar on both sides; the same
diffuse pattern suggesting the model isn't quite sure what to pay attention to.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-gsm8k-sxs-gt-resp.png',
    caption: 'Side-by-side Sequence Salience maps for the ground truth (left)
        and model generated response (right) for a zero-shot prompt of a GSM8K
        example. Note the similarities between these heatmaps, with diffuse
        influence over the preceding answers and the incorrect calculation to
        the final question.' %}

One possibility that might improve performance is to adjust the prompt so that
the segments used in the calculations are more salient. GSM8K uses a
[special calculation annotation][gsm8k_paper] to tell the model when it should
employ an external calculator tool during generation. The naive zero-shot prompt
above left these annotations intact and they might be confusing the model. Let's
see what happens when we remove these annotations. Using the
[Datapoint Editor][datapoint_editor] we can remove all of the `<< ... >>`
content from the prompt, then use the "Add" button to add it to our dataset, run
generation, and load the example in the Sequence Salience module as the
"selected" datapoint on the right. Choose to view the model's response field in
the Sequence Salience module, ensure the same granularity is being used, and
then select the segment containing the calculated value on both sides, as shown
in the figure below.

We can immediately see that the modified prompt has a much more intense salience
map focusing on the operands to the calculation and the preceding answers from
which they originate. That said, the model still gets the calculation wrong.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-gsm8k-sxs-no-annos.png',
    caption: "Side-by-side Sequence Salience maps of the model's response for
        the original zero-shot prompt (left) and a revised prompt (right) that
        removes the special calculation annotations. Despite the more focused
        influence of the segments relevant to the final question, the model
        still fails to calculate the correct answer." %}

In addition to these between-examples comparisons, LIT's SxS mode also supports
comparison between two models. [Prior][gsm8k] [research][toolformer]
investigating the necessity of tool use by models has noted that model size does
seem to correlate with performance on mathematical reasoning benchmarks. Let's
test that hypothesis here.

{%  include partials/info-box,
    title: 'Resource Needs for Between-Model Comparisons',
    text: "Side-by-side comparison requires loading both models at once, which
        requires additional memory. To load both Gemma 2B and 7B, we recommend a
        GPU or TPU with 40GB of memory, such as the Nvidia A100 available
        through Colab Pro."%}

To enable between-model comparison, first unpin the original example using the
button in the [main toolbar][main_toolbar], then enable the 7B and 2B model
instances using the checkboxes (also in the main toolbar). This will duplicate
the Sequence Salience module, with the 7B model on the left and the 2B model on
the right. Select model response for both, and then select the final calculation
result segment to see their respective heatmaps.

{%  include partials/inset-image,
    image: '/assets/images/seqsal-gsm8k-sxs-between-model.png',
    caption: 'Side-by-side Sequence Salience maps for the responses from two
        models&mdash;Gemma 7B IT (left) and Gemma 2B IT (right)&mdash;to the
        revised zero-shot prompt from above.' %}

Notice that the heatmaps are quite similar, suggesting the models have similar
behavioral characteristics, but that both still get the answer wrong. At this
point, it may be possible to improve performance by revisiting different
[prompting strategies][prompteng_strats] or by training the model to
[use tools][toolformer].

## Conclusion

The case studies above demonstrate how to use LIT's Sequence Salience module to
evaluate prompt designs rapidly and iteratively, in combination with LIT's tools
for side-by-side comparison and datapoint editing.

Salience methods for LLMs is an [active][salience_research_1]
[research][salience_research_1] area. The LIT team has provided reference
implementations for computing gradient-based salience&mdash;
[Grad L2 Norm][grad_norm] and [Grad · Input][grad_dot]&mdash;for LLMs in two
popular frameworks: [KerasNLP][lit_keras] and
[HuggingFace Transformers][lit_hf].

There is considerable opportunity to research how the model analysis foundations
described in this tutorial can support richer workflows, particularly as they
relate to aggregate analysis of salience results over many examples, and the
semi-automated generation of new prompt designs. Consider contributing
your ideas, prototypes, and implementations with us [via GitHub][lit_issues].

### Further Reading

In addition to the links above, the Google Cloud, Responsible AI and
Human-Centered Technologies, and the People + AI Research teams have several
helpful guides that can help you develop better prompts, including:

* Cloud's overview of [prompt design strategies][prompteng_strats];
* Cloud's [best practices][prompteng_bestpracs] for prompt engineering;
* The [Responsible Generative AI Tookit][rai_toolkit];
* The [PAIR Guidebook][pair_guidebook] discusses the importance of iterative
  testing and revision; and
* The interactive [saliency explorable][explorable_salience] dives deep into the
  inner working of salience methods, and how they can be used.

<!-- Links -->

[ai_studio]: https://cloud.google.com/generative-ai-studio?hl=en
[constitution_maker]: https://arxiv.org/abs/2310.15428
[constitutions]: https://arxiv.org/abs/2212.08073
[cot]: https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
[data_table]: ../../documentation/ui_guide.html#data-table
[datapoint_editor]: ../../documentation/ui_guide.html#datapoint-editor
[datapoint_editor_add_comp]: ../../documentation/components.html#manual-editing
[explorable_salience]: https://pair.withgoogle.com/explorables/saliency/
[fewshot]: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompt-design-strategies#zero-shot-vs-few-shot-prompts
[gemini]:https://gemini.google.com/
[gemma]: https://ai.google.dev/gemma
[gemma_variants]: https://ai.google.dev/gemma/docs#models
[generators]: ../../documentation/components.html#generators
[global_settings]: ../../documentation/ui_guide.html#global-settings
[gpt2]: https://cdn.openai.com/better-language-models/language-models.pdf
[gpt4]: https://arxiv.org/abs/2303.08774
[grad_dot]: https://arxiv.org/abs/1412.6815
[grad_norm]: https://aclanthology.org/P18-1032/
[gsm8k]: https://github.com/openai/grade-school-math
[gsm8k_paper]: https://arxiv.org/abs/2110.14168
[howitworks_icl]: https://par.nsf.gov/servlets/purl/10462310
[lit_1_1_release_notes]:https://github.com/PAIR-code/lit/blob/main/RELEASE.md#release-11
[lit_colab]: https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/gemma/docs/lit_gemma.ipynb
[lit_hf]: https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/prompt_debugging/transformers_lms.py
[lit_issues]: https://github.com/PAIR-code/lit/issues
[lit_keras]: https://github.com/PAIR-code/lit/blob/main/lit_nlp/examples/prompt_debugging/keras_lms.py
[lit_sxs]: ../../documentation/ui_guide.html#comparing-datapoints
[llama]: https://llama.meta.com/
[main_toolbar]: ../../documentation/ui_guide.html#main-toolbar
[mistral]: https://mistral.ai/news/announcing-mistral-7b/
[pair_guidebook]: https://pair.withgoogle.com/guidebook/
[prompteng]: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/introduction-prompt-design
[prompteng_bestpracs]: https://cloud.google.com/blog/products/application-development/five-best-practices-for-prompt-engineering
[prompteng_strats]: https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompt-design-strategies
[rai_toolkit]: https://ai.google.dev/responsible
[salience_research_1]: https://dl.acm.org/doi/full/10.1145/3639372
[salience_research_2]: https://arxiv.org/abs/2402.01761
[seqsal_docs]: ../../documentation/components.html#sequence-salience
[seqsal_paper]: https://arxiv.org/abs/2404.07498
[seqsal_video]: https://youtu.be/EZgUlnWdh0w
[synapis]: https://scholarspace.manoa.hawaii.edu/items/65312e48-5954-4a5f-a1e8-e5119e6abc0a
[toolformer]: https://arxiv.org/abs/2302.04761
