Exploring the myth behind LLMs😺

Unveiling Myths: Understanding and Building a Large Language Model💻

Sudharshini Jothikumar — Fri, 21 Nov 2025 11:13:29 GMT

# Cell 1
!pip install -q torch datasets tiktoken tqdm
print("Libraries installed successfully.")

Cell 1 explanation

here we installed torch(which is pytorch framework)🐦‍🔥 datasets(library of hugging face to load and use datasets) tiktoken( tokeniser that already have lots of words in token form) tqdm (to see the progress bars).🔋🪫

# Cell 2
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
import tiktoken
import math
from tqdm.auto import tqdm

Cell 2 explanation

from torch(pytorch framework) we import nn(to build a neural network with lots of layers),functional(it contains lot of activation and loss functions),we import utils to create dataset and dataloader

from datasets(of hugging face) we import load_dataset to use the existing datasets of hugging face🤗

we also import tiktoken(tokeniser), math(to do some operation),tqdm(progress bar)

# Cell 3
BATCH_SIZE = 32
BLOCK_SIZE = 64
MAX_ITERS = 2000
EVAL_INTERVAL = 200
LEARNING_RATE = 0.001
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

Cell 3 explanation

cell 3 has all the hyperparameters required for training🤗

BATCH_SIZE= for one training 32 batches will train at parallel

BLOCK_SIZE= total length of input tokens given 64 tokens given as input for training

MAX_ITERS = total number of training we do /total iteration

LEARNING_RATE = the speed of weight updates for a node.

DEVICE = CPU/GPU (cuda by nvdia) cpu will be slower but gpu are efficient u can use any one

# Cell 4
N_EMBD = 128
N_HEAD = 4
N_LAYER = 4
DROPOUT = 0.1

Cell 4 explanation

cell 4 has the parameters which decide the model size and structure

N_EMBD = the dimension of embedding vector ( eg king=632, 632=[1.0,2.3,11.9,……..128 info]

N_HEAD = no of attention head ( since we 128 info and 4 head size of each head 128/4=32)

N_LAYER = no of transformer blocks/layers

DROPOUT= remove 10% node randomly to avoid overfitt

# Cell 5
tokenizer = tiktoken.get_encoding("gpt2")
VOCAB_SIZE = tokenizer.n_vocab # 50257

print(f"Using device: {DEVICE}")
print(f"Vocabulary size: {VOCAB_SIZE}")

Cell 5 explanation

used tokeniser and did encoding…also used cpu here

# Cell 6

class TinyStoriesDataset(Dataset):
    def __init__(self, split, block_size, num_stories=10000):
        self.block_size = block_size
        print(f"Loading TinyStories {split} split...")
        ds = load_dataset("roneneldan/TinyStories", split=split)

        # Tokenize and concatenate a subset of stories
        all_tokens = []
        print(f"Tokenizing {num_stories} stories...")
        for i in tqdm(range(num_stories)):
            text = ds[i]['text']
            # Encode and add the <|endoftext|> token to separate stories
            text_tokens = tokenizer.encode_ordinary(text) + [tokenizer.eot_token]
            all_tokens.extend(text_tokens)

        self.tokens = torch.tensor(all_tokens, dtype=torch.long)
        print(f"Loaded {len(self.tokens)} tokens.")

    def __len__(self):
        # Total number of possible sequences
        return len(self.tokens) - self.block_size

    def __getitem__(self, idx):
        # Input sequence (x)
        x = self.tokens[idx : idx + self.block_size]
        # Target sequence (y) - shifted by one
        y = self.tokens[idx + 1 : idx + self.block_size + 1]
        return x, y

# Create datasets and dataloaders
train_dataset = TinyStoriesDataset('train', BLOCK_SIZE)
val_dataset = TinyStoriesDataset('validation', BLOCK_SIZE, num_stories=1000) # Smaller val set

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, drop_last=True)

print("--- DataLoaders created successfully ---")

Cell 6 explanation

1.created a class named TinyStoriesDataset of type:pytorch dataset it has 3 main functions init,len,getitem

2.here we use a param called split splits the dataset for training and validation . no testing as it is simple

3.we loaded TinyStories dataset which contain lots of stories and store in a variable ds

ds type: pytorch dataset

ds sample:

First item text:

One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt. Lily went to her mom and said……………..

create an empty array called all_tokens[] . ds is now made as a single long text

text type : string

text sample:

“One day, a little girl named Lily found a needle in her room……………………..”

now each word is tokenised using tiktoken . the encoded values stored in text_token

text_token type : list

text_token sample: [3198, 1110, 11, 257, 1310,…..]

all the tokens extended in all_token[] each story separated by end of token character

all_token[] made as tensor

now for training as well as validation we need input and output which is x,y respectively

since llm predict the a token from its previous token if x starts from ith token y start from i+1 token

train_dataset has 10000 stories as tokens with each input output as 64 token tensor

val_dataset has 1000 stories as tokens with each input output as 64 token tensor

x shape: torch.Size([64])

y shape: torch.Size([64])

train_loader and val_loader will use their respective datasets and on 32 batches iterate at parallel

Batch x shape: torch.Size([32, 64])

Batch y shape: torch.Size([32, 64])

# Cell 7

class MultiHeadAttention(nn.Module):
    """ The 'talking' part: Multi-Head Self-Attention """

    def __init__(self, n_head, head_size):
        super().__init__()
        # One big Linear layer to get Q, K, V for all heads
        self.c_attn = nn.Linear(N_EMBD, 3 * N_EMBD, bias=False)

        # Final output projection
        self.c_proj = nn.Linear(N_EMBD, N_EMBD)

        self.n_head = n_head
        self.head_size = head_size
        self.dropout = nn.Dropout(DROPOUT)

        # Causal mask
        self.register_buffer("bias", torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE))
                                      .view(1, 1, BLOCK_SIZE, BLOCK_SIZE))

    def forward(self, x):
        B, T, C = x.size() # Batch, Time (Block Size), Channels (N_EMBD)

        # 1. Get Q, K, V
        q, k, v = self.c_attn(x).split(N_EMBD, dim=2)

        # 2. Reshape for multi-head
        q = q.view(B, T, self.n_head, self.head_size).transpose(1, 2) # (B, n_head, T, head_size)
        k = k.view(B, T, self.n_head, self.head_size).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_size).transpose(1, 2)

        # 3. Calculate Attention Scores (affinities)
        att = (q @ k.transpose(-2, -1)) * (self.head_size**-0.5)

        # 4. Apply Mask (prevents looking into the future)
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))

        # 5. Softmax (convert scores to weights)
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        # 6. Apply weights to Values (get context-aware vectors)
        y = att @ v

        # 7. Re-assemble heads
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # 8. Final projection
        y = self.c_proj(y)
        y = self.dropout(y)
        return y

print("--- MultiHeadAttention layer defined ---")

Cell 7 explanation

Create a class called class MultiHeadAttention that inherits nn.module which is the base class to use most of the nn components . This class has 2 functions init and forward()

Multi head attention ’s first layer

c_attn - linear layer its input size is 128 its otput size with q,k,v info is 384. op=ip*weight + bias but here we diasbled bias for this layer…so op=ip*weight

Multi head attention ’s second layer

c_proj - linear layer input size is 128 output size also 128

Then we made a casual masking…64×64 ones matrix then we made it as **lower triangular matrix…**then we reshape this into a tensor of (32,4,64,64)….32 batch each batch has 4 head each head has 64×64 matrix rows are query colums are key their intersection/cell is the value(weight)

in forward() we define shape of x=(32,64,128) which means for ne train or one iteration we look 32 batches of 64 tokens length and each token have 128d vector

first we send x into c_attn which makes it x=(32,64,384) then we split as q=(32,64,128),k=(32,64,128),v=(32,64,128)

now we reshape each q=(32,64,128) since 128 is 4×32 heads it is q=(32,64,4,32) now we reshape with transpose function as q=(32,4,64,32) similarly we do for k,v

now we calculate attention score as att first do matrix multiplication(@) between q and transpose of k(-1,-2) which result as (32,4,64,64) now we consider the 64,64 matrix as the attention matrix with row=q,col=k now we need to multiply this with head size(32)..since we get large numbers it will be hard to calculate so we power the head size with -0.5

The final attention score in form of (32,4,64,64) for this 64×64 matrix we apply casual mask and say where mask is true keep -inf(which is 0) if mask is false keep the actual score and do the dropout to avoid overfitting.

After doing softmax we made the attention into weights…so op=in*weight+bias since we made bias=false now y= att @ v (where y = op, att=weight, v=input) final y=(32,4,64,32)

we again reform y as y=(32,64,128)# Cell 8: Design Layer 2 - Feed-Forward Network

# Cell 8: Design Layer 2 - Feed-Forward Network

class FeedForward(nn.Module):
    """ The 'thinking' part: a simple 2-layer neural network """

    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(N_EMBD, 4 * N_EMBD), # Expand
            nn.ReLU(),                     # Activation function
            nn.Linear(4 * N_EMBD, N_EMBD), # Contract
            nn.Dropout(DROPOUT),
        )

    def forward(self, x):
        return self.net(x)

print("--- FeedForward layer defined ---")

Cell 8 explanation

here a 2 layer simple feed forward happens

l1 linear multiples 128 into 4 times which 512

then apply relu to 512…if any values are 0 removed

again bring it back to 128 size

droupout 10% of nodes to avoid overfitting

# Cell 9: Design the Transformer Block

class Block(nn.Module):
    """ A single Transformer Block: Talk, then Think """

    def __init__(self):
        super().__init__()
        head_size = N_EMBD // N_HEAD
        self.attn = MultiHeadAttention(N_HEAD, head_size)
        self.ffn = FeedForward()
        self.ln_1 = nn.LayerNorm(N_EMBD)
        self.ln_2 = nn.LayerNorm(N_EMBD)

    def forward(self, x):
        # Residual Connections (x + ...)
        x = x + self.attn(self.ln_1(x)) # "Talk"
        x = x + self.ffn(self.ln_2(x)) # "Think"
        return x

print("--- Block layer defined ---")

Cell 9 explanation

transformer block is defined here as a stack

Input → LayerNorm
Pass through Attention
Add Residual Connection (x = x + attention_output)
LayerNorm again
Pass through FeedForward
Add Residual Connection (x = x + ffn_output)

# Cell 10: Assemble the Full MyGPT Model

class MyGPT(nn.Module):
    def __init__(self):
        super().__init__()

        # --- Embedding Layers (Vectorization) ---
        self.token_embedding_table = nn.Embedding(VOCAB_SIZE, N_EMBD)
        self.position_embedding_table = nn.Embedding(BLOCK_SIZE, N_EMBD)

        # --- Transformer Body ---
        self.blocks = nn.Sequential(*[Block() for _ in range(N_LAYER)])

        # --- Final Layers ---
        self.ln_f = nn.LayerNorm(N_EMBD) # Final LayerNorm
        self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE) # Output layer

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # 1. Get Embeddings
        tok_emb = self.token_embedding_table(idx) # (B, T, C)
        pos = torch.arange(T, device=DEVICE)
        pos_emb = self.position_embedding_table(pos) # (T, C)

        # 2. Add embeddings together
        x = tok_emb + pos_emb # (B, T, C)

        # 3. Pass through Transformer Blocks
        x = self.blocks(x)

        # 4. Final LayerNorm
        x = self.ln_f(x)

        # 5. Get Logits (the model's prediction scores)
        logits = self.lm_head(x) # (B, T, VOCAB_SIZE)

        # 6. Calculate Loss (if we are training)
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits_view = logits.view(B*T, C)
            targets_view = targets.view(B*T)
            loss = F.cross_entropy(logits_view, targets_view)

        return logits, loss

    @torch.no_grad() # Tell PyTorch we aren't training
    def generate(self, start_text, max_new_tokens):
        self.eval() # Set model to evaluation mode

        # Tokenize the starting text
        start_tokens = tokenizer.encode_ordinary(start_text)
        idx = torch.tensor(start_tokens, dtype=torch.long, device=DEVICE).unsqueeze(0)

        for _ in range(max_new_tokens):
            # Crop context if it's longer than BLOCK_SIZE
            idx_cond = idx[:, -BLOCK_SIZE:]

            # Get logits
            logits, _ = self(idx_cond)

            # Focus on the logit for the *very last* token
            logits = logits[:, -1, :] # (B, C)

            # Get probabilities via Softmax
            probs = F.softmax(logits, dim=-1) # (B, C)

            # Sample the next token
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)

            # Append the new token to our sequence
            idx = torch.cat((idx, idx_next), dim=1)

        # Detokenize the final sequence
        return tokenizer.decode(idx[0].tolist())

# --- Create the model! ---
model = MyGPT()
model = model.to(DEVICE)
print(f"Model created with ~{sum(p.numel() for p in model.parameters())/1e6:.2f}M parameters")

Cell 10 explanation

Token embedding → converts token IDs to 128-dim vectors
Position embedding → gives each position (0–63) its own vector
Final embedding = token meaning + position meaning
after that we stack 4 times the same transformer block
before output once again we normalise then comes op layer
the ouput layer is shape of (128,50257) so the next token predicted by model can be any word inside the library(this is how it generate new words which are not given in input also)
we then store the models prediction score as logits Now for each token position, the model gives 50257 logits, one score for each possible next token. logits=(32,64,50257)
since we need to calculate loss (the difference between actual and predicted value) we use cross entropy loss we need to reshape logits=(32×64,50257) =(2048,50257) and target view as (32×64) =(2048) find cross entropy of both.
then we move to next function generate() since we make the model to predict we will tell it not to learn from it
here we get initial prompt of user tokenise them then make it as tensor of [1,64] always for generation batch will be 1 so only we mentioned unsqueeze(0) Crop context if it's longer than BLOCK_SIZE
Get logits of the last token which of shape [1,50257] make it as softmax(0 to 1)
next multinominal the values High probability token = more chance of getting picked

Low probability token = less chance sampling gives creativity
append the new token to our query then decode them

# Cell 11: Create the "Evaluate" Function

@torch.no_grad() # We don't need to calculate gradients for evaluation
def estimate_loss():
    out = {}
    model.eval() # Set model to evaluation mode
    for split, loader in [('train', train_loader), ('val', val_loader)]:
        losses = torch.zeros(EVAL_INTERVAL)
        for k in range(EVAL_INTERVAL):
            try:
                X, Y = next(iter(loader))
            except StopIteration:
                # This is a simple demo, so we'll just reset the loader
                # In a real setup, you'd iterate through the whole val set once
                loader_iter = iter(loader)
                X, Y = next(loader_iter)

            X, Y = X.to(DEVICE), Y.to(DEVICE)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() # Set model back to training mode
    return out

print("--- Evaluation function defined ---")

Cell 11 explanation

@torch.no_grad() say during evalutaion no updation of weight needed

out={} creates a dictionary to store output predicted by model and then we set model to evaluation mode

now in both train and validation at 200th step we calculate losses

get the x,y of the batches load them in gpu/cpu calculate loss(logit and target difference) and append it in losses now find the mean of every 200 iteration and store in out{both train and val }

set the model to back training mode

import time

# Cell 12: The "Train" Step (The Training Loop)

# Create the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

# Create iterators for our data loaders
train_iter = iter(train_loader)
val_iter = iter(val_loader)

print("Starting training...")
start_time = time.time()

for step in range(MAX_ITERS):

    # --- Periodically, evaluate the loss on train/val sets ---
    if step % EVAL_INTERVAL == 0 or step == MAX_ITERS - 1:
        losses = estimate_loss()
        print(f"Step {step}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # 1. Get a training batch
    try:
        X, Y = next(train_iter)
    except StopIteration:
        train_iter = iter(train_loader) # Reset iterator
        X, Y = next(train_iter)

    X, Y = X.to(DEVICE), Y.to(DEVICE)

    # 2. Forward pass
    logits, loss = model(X, Y)

    # 3. Backward pass
    optimizer.zero_grad(set_to_none=True) # Clear old gradients
    loss.backward()                       # Calculate new gradients

    # 4. Update weights
    optimizer.step()

end_time = time.time()
print("--- Training Complete ---")
print(f"Final validation loss: {losses['val']:.4f}")
print(f"Training took {(end_time-start_time):.2f} seconds")

Cell 12 explanation

create an optimizer so if loss happens it update the weight for lowering the loss

create an iterator to train and validate

calculates average train loss (200 batches) ,calculates average val loss (200 batches)
DOES NOT update weights because evaluation uses: @torch.no_grad() → gradients OFF
get x,y in cpu/gpu
in forward pass model wil predict the next token..calculate logits and loss
in backward pass model will undertsand and learn from mistake then optimize(update weight to reduce loss)

# Cell 13: "Evaluate" (Qualitative Test via Generation)

print("\n--- GENERATING YOUR STORY 🥰🍓 ---")
prompt = input("Enter your imagination a small line:")
generated_text = model.generate(start_text=prompt, max_new_tokens=600)
print(generated_text)

Cell 13 explanation:

get prompt from user and generate the story

# Save the model state dictionary
torch.save(model.state_dict(), 'tiny_stories_gpt.pth')
print("Model saved to tiny_stories_gpt.pth")

# To load it back later:
# model = MyGPT()
# model.load_state_dict(torch.load('tiny_stories_gpt.pth'))
# model.to(DEVICE)

Cell 14 explanation:

save the model using the pytorch function save()

Unveiling Myths: Understanding and Building a Large Language Model💻

Sudharshini Jothikumar — Wed, 19 Nov 2025 07:01:59 GMT

To start with LLM ….first we need to talk about AI(Artificial Intelligence)….because if LLM is kid it came from its mom…that is AI😅

AI

When machines behave like human then we call it AI

Eg: Your friend can understand whether you are happy/sad with your voice and way of speaking but now a days many AI bots can do emotion detection with your audio clip and say you are happy/sad right….AI in its peakifying era thats all happening😅😅

What is LLM

•A modal or a program(simply set of code only ….with few algorithms) that can understand and generate human language…..absolutely it’s a part of AI (as it behaves like human). a neural network trained to predict the next token (word-piece / subword) given prior tokens.💻💻💗💗

In earlier days we learnt human can talk tamil telugu malayalam kanada english spain etc…but machine knows only 0️⃣ and 1️⃣…..but now we can see that LLM understand all languages..even humans cant understand that much I bet😁

SLM vs LLM 😉

Feature	🤖 LLM	⚡ SLM
Size	Massive: Billions to Trillions of parameters.	Small: Millions to a few Billion parameters.
Knowledge	Generalist: Knows about everything.	Specialist: Expert on a specific, narrow topic.
Hardware	Huge: Needs a data center with many powerful GPUs.	Local: Can run on a laptop, smartphone, or edge device.
Speed	Slower: High latency (takes time to "think").	Very Fast: Low latency (ideal for real-time apps).
Cost	Expensive: Costs millions to train and is costly to run.	Cheap: Inexpensive to train and run.
Privacy	Low: You usually send your data to a company's cloud.	High: Can run 100% on your local device. No data shared.
Best For...	Complex reasoning, creative content, broad research.	Specific tasks: chatbots, summarization, on-device AI.

Some real time examples of LLM

The joke is we use LLM without knowing it is a LLM😺

ChatGPT DeepSEEK Gemini Perplexity MetaAI GithubCopilot

Different methods of building LLM😁

Method	Primary Tool / Example	Effort	Cost	Customization
1. The API Method	Google Gemini API, OpenAI API	Low	Pay-per-use (can get expensive)	Low (Limited to prompt engineering)
2. The Local/Fine-Tuning Method	Ollama, Hugging Face, Unsloth	Medium	Free (requires a good GPU)	High (You can change the model's knowledge)
3. The "From Scratch" Method	PyTorch, TensorFlow, Custom GPU Clusters	Extremely High	$10M - $100M+ (Millions)	Total (You define everything)

Characteristics of LLM 🎀

•Mainly uses PYTHON

•Frameworks to simplify our job-TF AND PYTORCH

•Get the pretrained models/transformers from –HUGGING FACE,LLAMA

•A library that help us easily connect a llm with our application-LANGCHAIN

•LLMs are pre-trained on an enormous and diverse dataset, typically a large portion of the internet, books, wikipedia and code.

Pytorch 🐦‍🔥

It is a large framework which provides a system to build train and deploy a AI thing…it has lots of modules or packages within it. Instead of pytorch you can also use TF(TensorFlow)

Eg packages within Pytorch: nn,tensor,functional,optim,autograd 🩷

Creating a LLM is like baking a cake with lot of process within it. we need to understand them before stepping into coding lets see one by one🎂🍰

1.Tokenisation

To train the LLM and make them understand and generate human language we will give it large datasets to learn,find pattern etc…usually it will be books,news articles,wikipedia,codes etc….but the modal cannot directly get the input as raw text…it needs only numerical values to analyse pattern and learn to generate💻💻💻

•**Tokenisation is process of Converting text into tokens/numbers .**We have various tokenisers like a dictionary that have millions of words with numbers. We can use instead of manually create token which will take years… Ex:Tiktoken ( King=623,Queen=598,is=12)

2.Embedding

Embedding is the process of Converting token into vectors while a simple token hold a numerical value its vector hold a lot of information about that value 😺

The first layer in any LLM will be embedding layer

Eg King=623 → tokenisation

623 =[0.12, -0.55, 1.03, ...]→embedding

This vector contain lot of information about king

semantic info - king is near queen ,he is masculine etc…

context info - king used in place of “historic time,war,palace“

grammar info-noun etc…

The best thing is you don’t create all this info the llm will generate based on learning😁

3.Positional Encoding 1️⃣2️⃣3️⃣

•LLM read all words at once. So the model **cannot know which word comes first, second, third…**unless we add positions.

•we can add positional info by various ways Sinosodial(use sin and cos function and generate numberic value based on the dimension of vector),learned positional embedding(allow llm to learn by itself the position)

Positional Encoding is the process of Adding the positional information about a token to a embedding vector 💗💗

Ex : cat sat on mat

“cat” is at position “sat” is at position 1 “on” is at position 2 “mat” is at position 3
Each word is converted into a vector:

| Word | Embedding vector (example) | | --- | --- | | cat | [0.3, -0.1, 0.7, ...] | | sat | [0.1, 0.4, -0.2, ...] | | on | [-0.5, 0.8, 0.1, ...] | | mat | [0.6, 0.2, 0.9, ...] |

Transformer adds a position vector for each word through some method as said above:

Example:

Position	Positional Vector
0	[0.99, 0.12, -0.88, ...]
1	[0.84, 0.52, -0.41, ...]
2	[0.65, 0.78, 0.20, ...]
3	[0.35, 0.93, 0.77, ...]

Final Input Vector = Embedding + Positional Encoding

For “cat” at position 0:

[0.3, -0.1, 0.7, ...]   (meaning)
+
[0.99, 0.12, -0.88, ...] (position)
=
[1.29, 0.02, -0.18, ...]

This new vector tells the model:

"cat" + "I am the first word"

4.Attention Mechanism🩷

“For the current word I am processing,
which other words in the sentence should I pay attention to?”

Ex : cat sat on mat

the word that LLM took for learning is cat which will have its information in embedding vector then position also….but while I learn about cat which words will be more related or necessary to it

cat - sat its important as it is action of cat so we give 0.6

cat-on grammaticaly also on with cat feels less important 0.1

cat-cat both hold fully same info so 1 etc..

5.Multi-Head Attention☺️🤩😐😏😣🥱

In above thing we given attention to only one word cat…but here we create heads like

cat sat on mat

head1- for nouns/subject (which will mainly concentrate cat)-[0.9,0.1,0.1,0.7]

head2-for object(which will mainly concentrate mat)-[0.6,0.2,0.1,0.9]

head3-for verb/action(which will mainly concentrate sat) - [0.1,1.2,0.3,0.1]

both single/multi head attention follow QKV mechanism

Query (Q) → “What am I looking for?”
Key (K) → “What do I offer?”
Value (V) → “What information should be passed?”

attention=q*transpose(k)
weight=softmax(attention score)
output= input*weight + bias so
y=attention*v + bias

6.Feed Forward🍓🍒

Inside a transformer block:

Attention = TALK (token talks with other tokens)
Feed Forward = THINK (token thinks alone)

👉 FFN works on each token individually, no talking between tokens.

It is simply a tiny 2-layer neural network applied to every token’s vector.

7.Residual connection 🤝

Residual connection = Memory + New Knowledge

Residual connection does:

output = original + new_output

That means:

Keep the original meaning as it is
Add only the improvements

🧠 Super Simple Example

Input vector (simplified):

x = [5,2]

Attention output:

attn_out = [2,9]

Residual connection:

x = x + attn_out = [7,11]

8.Layer Normalization (LayerNorm)💙💙

👉 Simple Meaning: “Clean and balance the values before processing.”

🎯 Make values have mean = 0 and variance = 1

[87, 90, 22, 45, 100, 10] into [-0.1, 0.2, -1.1, -0.6, 1.3, -1.4]

9.Logits

In OP Layer Finally when a model predict the next tokens from previous tokens for all the words in vocab it gives a prediction score …the most high scored tokens will be the next availabe words

10.What happens in forward and backward pass➡️⬆️➡️

Forward pass

The model predicts the next token logits
Compares logits with Y
Computes loss
Loss = “how wrong the predictions were”

Backward pass

Model computes gradients (how much each weight contributed to the error)
This is the model learning from mistakes
Optimizer uses the gradients to update weights in a direction that reduces loss