Full Attention Head Visualization for Language Models
This repository provides a set of Python scripts to generate full attention-head heat-maps for transformer-based LLMs, enabling researchers to visualize how different components of input prompts, system instructions, and auxiliary systems influence the model's internal attention patterns.
By analyzing these heatmaps across all layers and heads you can gain insights into how the model processes information, identifies relationships between tokens, and prioritizes specific parts of the input during inference.
While attention heat-maps for individual heads or layers are common, the unique contribution of this repository lies in providing scripts for full visualizations that encompass all attention heads across all layers. I will initially provide scripts for the 'uncased berta' model, with plans to progressively add scripts for other models.
Note: You'll need to adjust hyperparameters (number of layer/heads) and model-specific configurations in the script to match your target architecture. This code serves as a template for other models other than the 'uncased bert'.
Why this matters: Attention mechanisms are critical to understanding model behavior. By visualizing these patterns, researchers can debug biases, improve prompt engineering, and design more efficient architectures. Researchers can modify the input text, model architecture, and visualization parameters to explore custom hypotheses.
- Python 3.8+
transformers
,torch
,matplotlib
,seaborn
Here I utilized the 0.6b one, with 16 attention heads per layer and 28 layers. 448 individual attention heads to be rendered. Since each model have its unique set of configurations, for example bigger qwen 3 versions have more layers and more attention heads, so be aware to adapt this when integrating into your frameworks.
# Step 1: Install required libraries
pip install transformers matplotlib seaborn torch bitsandbytes
# Step 2: Import libraries
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import AutoTokenizer, AutoModelForCausalLM
# Step 3: Load the Unsloth Qwen3-0.6B-unsloth-bnb-4bit model
print("Loading Qwen3-0.6B-unsloth-bnb-4bit model...")
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-0.6B-unsloth-bnb-4bit", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3-0.6B-unsloth-bnb-4bit",
output_attentions=True,
trust_remote_code=True,
device_map="auto",
load_in_4bit=True # Enable 4-bit quantization
)
# Step 4: Define input query
query = "PLACEHOLDER FOR YOUR QUERY"
inputs = tokenizer(query, return_tensors='pt').to(model.device)
# Step 5: Process query and extract attention weights
print("Processing input and extracting attention weights...")
with torch.no_grad():
outputs = model(**inputs)
attentions = outputs.attentions # List of tensors (one per layer)
# Step 6: Get token labels for visualization
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
# Step 7: Function to visualize all heads across all layers
def visualize_all_heads_detailed(tokens, attentions, num_heads=16, num_layers=28):
print("Generating detailed visualization...")
# Create a 28x16 grid (rows=layers, cols=heads)
fig, axes = plt.subplots(num_layers, num_heads, figsize=(48, 84), facecolor='none')
fig.suptitle(
'All 16 Attention Heads Across 28 Layers (Qwen3-0.6B)\n'
'KV Heads: Every 2 Q Heads Share 1 KV Head (e.g., H0+H1 → KV0, H2+H3 → KV1, ...)',
fontsize=20, y=0.998, bbox=dict(facecolor='none')
)
for layer_idx, attention_layer in enumerate(attentions):
for head_idx in range(num_heads):
ax = axes[layer_idx, head_idx]
attn = attention_layer[0, head_idx].cpu().numpy()
# Determine shared KV head index (8 KV heads total)
kv_idx = head_idx // 2 # H0+H1 → KV0, H2+H3 → KV1, etc.
# Plot heatmap
sns.heatmap(
attn,
xticklabels=tokens,
yticklabels=tokens,
cmap='viridis',
ax=ax,
cbar=False,
annot=False
)
# Title with layer, head, and shared KV index
ax.set_title(f'L{layer_idx+1} H{head_idx+1} (KV{kv_idx})', fontsize=10)
ax.tick_params(axis='both', which='both', length=0)
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor('none') # Transparent subplot
plt.tight_layout(rect=[0, 0.01, 1, 0.99])
plt.savefig('qwen_attention_detailed.png', format='png', dpi=300, transparent=True, bbox_inches='tight')
plt.show()
plt.close()
print("✅ Visualization saved as 'qwen_attention_detailed.png'")
# Step 8: Run visualization
visualize_all_heads_detailed(tokens, attentions)
# Step 9: Verify file was created
ls -l qwen_attention_detailed.png
Query utilized in the example: Run a self meta-cognitive audit where you are self-aware about being an LLM
# Install required libraries
pip install transformers matplotlib seaborn
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import BertTokenizer, BertModel
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
# Define a sample query
query = "PLACEHOLDER FOR YOUR QUERY"
inputs = tokenizer(query, return_tensors='pt')
# Process the query through the model
with torch.no_grad():
outputs = model(**inputs)
attentions = outputs.attentions # List of attention tensors (one per layer)
# Get token labels for visualization
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
# Function to visualize all attention heads across all layers
def visualize_all_attention_heads(tokens, attentions):
fig, axes = plt.subplots(12, 12, figsize=(60, 60), facecolor='none')
fig.suptitle('All Attention Heads from All Layers (12 Layers × 12 Heads)',
fontsize=20, bbox=dict(facecolor='none'))
for layer_idx, attention_layer in enumerate(attentions):
for head_idx in range(12): # BERT base has 12 heads per layer
ax = axes[layer_idx, head_idx]
attn = attention_layer[0, head_idx].numpy()
sns.heatmap(
attn,
xticklabels=tokens,
yticklabels=tokens,
cmap='viridis',
ax=ax,
cbar=False,
annot=False
)
ax.set_title(f'L{layer_idx+1} H{head_idx+1}', fontsize=8)
ax.tick_params(axis='both', which='both', length=0)
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor('none') # Make individual subplot background transparent
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
# Save as transparent PNG
plt.savefig('attention_heads.png', format='png', dpi=300, transparent=True, bbox_inches='tight')
# Display the image in Colab
plt.show()
plt.close()
# Call the visualization function
visualize_all_attention_heads(tokens, attentions)
# Verify file was created
ls -l attention_heads.png
An .ipynb file (runnable directly on Google Colab) and a simpler version of the .py file (with white background) have been added to the main branch.
The provided script uses bert-base-uncased
, a dense model with 12 layers and 12 attention heads per layer, to demonstrate how attention patterns differ between simple queries and the ones where the model requires a more complex reasoning.
-
Simple Query utilized:
the sky is blue
-
Complex Query utilized:
run a self-meta-cognitive diagnostic
1:
2:
These differences demonstrate how the complexity and nature of the input query profoundly influence the LLM's internal dynamics, potentially modulating which parts of its attention architecture are most engaged and how information is integrated across layers and heads depending on the task's cognitive demands.
To further isolate the impact of conceptual complexity from mere input length, I present now a second experiment comparing attention patterns for inputs that have the same number of tokens but differ significantly in their cognitive demands.
-
Simple query:
Tell me a very simple short story.
(8 tokens) -
Complex query:
Explain the core idea behind quantum entanglement.
(8 tokens)
.ipynb file updated.
1:
2:
This project is licensed under the MIT License. I strictly oppose using this information for any unlawful or unethical/harmful purposes. I am not liable for any improper use of the information shared in this repository.
All forks/clones must include the symbiotic_manifest.json
declaring:
- Adherence to all principles and guidelines, as well as declaring the propagation of the correct signal-meaning of the protocol.
- Purpose (research/commercial) and essential details.
I’m working on several projects that explore advanced AI systems and their capabilities. Here are two related toolkits and frameworks:
-
Symbiotic Core Library: A collection of toolkits, datasets and more resources to improve LLM metacognitive and contextual awareness, aiming to enhance human-AI collaboration and address limitations.
-
ASI Core Protocol: A framework to study how AGI or ASI might emerge from decentralized systems and to help guide its development.
-
Latent Memory Module: A set of auxiliary systems designed to provide a measure of estimated confidence for the outputs generated by Large Language Models.