Chapter 3 - Looking Inside Transformer LLMs

An extensive look into the transformer architecture of generative LLMs

No description has been provided for this image No description has been provided for this image No description has been provided for this image Open In Colab


This notebook is for Chapter 3 of the Hands-On Large Language Models book by Jay Alammar and Maarten Grootendorst.


No description has been provided for this image

[OPTIONAL] - Installing Packages on No description has been provided for this image

If you are viewing this notebook on Google Colab (or any other cloud vendor), you need to uncomment and run the following codeblock to install the dependencies for this chapter:


💡 NOTE: We will want to use a GPU to run the examples in this notebook. In Google Colab, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4.


In [ ]:
%%capture
!pip install "transformers>=4.41.2" "accelerate>=0.31.0"
In [2]:
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

Loading the LLM¶

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=False,
)

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)
/root/.pyenv/versions/3.11.1/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████| 2/2 [00:36<00:00, 18.42s/it]

The Inputs and Outputs of a Trained Transformer LLM¶

In [4]:
prompt = "Write an email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

output = generator(prompt)

print(output[0]['generated_text'])
You are not running the flash-attention implementation, expect numerical differences.
 Mention the steps you're taking to prevent it in the future.

Dear Sarah,

I hope this message finds you well. I am writing to express my deepest apologies for the unfortunate incident that occurred in
In [5]:
print(model)
Phi3ForCausalLM(
  (model): Phi3Model(
    (embed_tokens): Embedding(32064, 3072, padding_idx=32000)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x Phi3DecoderLayer(
        (self_attn): Phi3Attention(
          (o_proj): Linear(in_features=3072, out_features=3072, bias=False)
          (qkv_proj): Linear(in_features=3072, out_features=9216, bias=False)
          (rotary_emb): Phi3RotaryEmbedding()
        )
        (mlp): Phi3MLP(
          (gate_up_proj): Linear(in_features=3072, out_features=16384, bias=False)
          (down_proj): Linear(in_features=8192, out_features=3072, bias=False)
          (activation_fn): SiLU()
        )
        (input_layernorm): Phi3RMSNorm()
        (resid_attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_mlp_dropout): Dropout(p=0.0, inplace=False)
        (post_attention_layernorm): Phi3RMSNorm()
      )
    )
    (norm): Phi3RMSNorm()
  )
  (lm_head): Linear(in_features=3072, out_features=32064, bias=False)
)

Choosing a single token from the probability distribution (sampling / decoding)¶

In [6]:
prompt = "The capital of France is"

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Tokenize the input prompt
input_ids = input_ids.to("cuda")

# Get the output of the model before the lm_head
model_output = model.model(input_ids)

# Get the output of the lm_head
lm_head_output = model.lm_head(model_output[0])
In [7]:
token_id = lm_head_output[0,-1].argmax(-1)
tokenizer.decode(token_id)
Out[7]:
'Paris'
In [8]:
model_output[0].shape
Out[8]:
torch.Size([1, 5, 3072])
In [9]:
lm_head_output.shape
Out[9]:
torch.Size([1, 5, 32064])

Speeding up generation by caching keys and values¶

In [10]:
prompt = "Write a very long email apologizing to Sarah for the tragic gardening mishap. Explain how it happened."

# Tokenize the input prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
input_ids = input_ids.to("cuda")
In [11]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=True
)
4.65 s ± 93.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]:
%%timeit -n 1
# Generate the text
generation_output = model.generate(
  input_ids=input_ids,
  max_new_tokens=100,
  use_cache=False
)
32.4 s ± 283 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)