<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Exploring the myth behind LLMs😺]]></title><description><![CDATA[Exploring the myth behind LLMs😺]]></description><link>https://exploring-the-myth-behind-llms.hashnode.dev</link><generator>RSS for Node</generator><lastBuildDate>Tue, 23 Jun 2026 14:48:28 GMT</lastBuildDate><atom:link href="https://exploring-the-myth-behind-llms.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Unveiling Myths: Understanding and Building a Large Language Model💻]]></title><description><![CDATA[# Cell 1
!pip install -q torch datasets tiktoken tqdm
print("Libraries installed successfully.")

Cell 1 explanation
here we installed torch(which is pytorch framework)🐦‍🔥 datasets(library of hugging face to load and use datasets) tiktoken( tokenis...]]></description><link>https://exploring-the-myth-behind-llms.hashnode.dev/unveiling-myths-understanding-and-building-a-large-language-model-1</link><guid isPermaLink="true">https://exploring-the-myth-behind-llms.hashnode.dev/unveiling-myths-understanding-and-building-a-large-language-model-1</guid><dc:creator><![CDATA[Sudharshini Jothikumar]]></dc:creator><pubDate>Fri, 21 Nov 2025 11:13:29 GMT</pubDate><content:encoded><![CDATA[<pre><code class="lang-python"><span class="hljs-comment"># Cell 1</span>
!pip install -q torch datasets tiktoken tqdm
print(<span class="hljs-string">"Libraries installed successfully."</span>)
</code></pre>
<p><strong>Cell 1 explanation</strong></p>
<p>here we installed <strong>torch</strong>(which is pytorch framework)🐦‍🔥 <strong>datasets</strong>(library of hugging face to load and use datasets) <strong>tiktoken</strong>( tokeniser that already have lots of words in token form) <strong>tqdm</strong> (to see the progress bars).🔋🪫</p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 2</span>
<span class="hljs-keyword">import</span> torch
<span class="hljs-keyword">import</span> torch.nn <span class="hljs-keyword">as</span> nn
<span class="hljs-keyword">from</span> torch.nn <span class="hljs-keyword">import</span> functional <span class="hljs-keyword">as</span> F
<span class="hljs-keyword">from</span> torch.utils.data <span class="hljs-keyword">import</span> Dataset, DataLoader
<span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset
<span class="hljs-keyword">import</span> tiktoken
<span class="hljs-keyword">import</span> math
<span class="hljs-keyword">from</span> tqdm.auto <span class="hljs-keyword">import</span> tqdm
</code></pre>
<p><strong>Cell 2 explanation</strong></p>
<p>from torch(pytorch framework) we import <strong>nn</strong>(to build a <strong>neural network</strong> with lots of layers),<strong>functional</strong>(it contains lot of <strong>activation</strong> and <strong>loss</strong> functions),we import <strong>utils</strong> to create <strong>dataset</strong> and <strong>dataloader</strong></p>
<p>from datasets(of hugging face) we import <strong>load_dataset</strong> to use the existing datasets of hugging face🤗</p>
<p>we also import tiktoken(tokeniser), <strong>math</strong>(to do some operation),tqdm(progress bar)</p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 3</span>
BATCH_SIZE = <span class="hljs-number">32</span>
BLOCK_SIZE = <span class="hljs-number">64</span>
MAX_ITERS = <span class="hljs-number">2000</span>
EVAL_INTERVAL = <span class="hljs-number">200</span>
LEARNING_RATE = <span class="hljs-number">0.001</span>
DEVICE = <span class="hljs-string">'cuda'</span> <span class="hljs-keyword">if</span> torch.cuda.is_available() <span class="hljs-keyword">else</span> <span class="hljs-string">'cpu'</span>
</code></pre>
<p><strong>Cell 3 explanation</strong></p>
<p>cell 3 has all the <strong>hyperparameters</strong> required for training🤗</p>
<p>BATCH_SIZE= for one training <strong>32 batches will train at parallel</strong></p>
<p>BLOCK_SIZE= <strong>total length of input tokens</strong> given 64 tokens given as input for training</p>
<p>MAX_ITERS = <strong>total number of training</strong> we do /total iteration</p>
<p>LEARNING_RATE = the <strong>speed of weight updates</strong> for a node.</p>
<p>DEVICE = <strong>CPU/GPU</strong> (cuda by nvdia) cpu will be slower but gpu are efficient u can use any one</p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 4</span>
N_EMBD = <span class="hljs-number">128</span>
N_HEAD = <span class="hljs-number">4</span>
N_LAYER = <span class="hljs-number">4</span>
DROPOUT = <span class="hljs-number">0.1</span>
</code></pre>
<p><strong>Cell 4 explanation</strong></p>
<p>cell 4 has the parameters which decide the <strong>model size and structure</strong></p>
<p>N_EMBD = the dimension of embedding vector ( eg king=632, 632=[1.0,2.3,11.9,……..128 info]</p>
<p>N_HEAD = no of attention head ( since we 128 info and 4 head size of each head 128/4=32)</p>
<p>N_LAYER = no of transformer blocks/layers</p>
<p>DROPOUT= remove 10% node randomly to avoid overfitt</p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 5</span>
tokenizer = tiktoken.get_encoding(<span class="hljs-string">"gpt2"</span>)
VOCAB_SIZE = tokenizer.n_vocab <span class="hljs-comment"># 50257</span>

print(<span class="hljs-string">f"Using device: <span class="hljs-subst">{DEVICE}</span>"</span>)
print(<span class="hljs-string">f"Vocabulary size: <span class="hljs-subst">{VOCAB_SIZE}</span>"</span>)
</code></pre>
<p><strong>Cell 5 explanation</strong></p>
<p>used tokeniser and did encoding…also used cpu here</p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 6</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">TinyStoriesDataset</span>(<span class="hljs-params">Dataset</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, split, block_size, num_stories=<span class="hljs-number">10000</span></span>):</span>
        self.block_size = block_size
        print(<span class="hljs-string">f"Loading TinyStories <span class="hljs-subst">{split}</span> split..."</span>)
        ds = load_dataset(<span class="hljs-string">"roneneldan/TinyStories"</span>, split=split)

        <span class="hljs-comment"># Tokenize and concatenate a subset of stories</span>
        all_tokens = []
        print(<span class="hljs-string">f"Tokenizing <span class="hljs-subst">{num_stories}</span> stories..."</span>)
        <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> tqdm(range(num_stories)):
            text = ds[i][<span class="hljs-string">'text'</span>]
            <span class="hljs-comment"># Encode and add the &lt;|endoftext|&gt; token to separate stories</span>
            text_tokens = tokenizer.encode_ordinary(text) + [tokenizer.eot_token]
            all_tokens.extend(text_tokens)

        self.tokens = torch.tensor(all_tokens, dtype=torch.long)
        print(<span class="hljs-string">f"Loaded <span class="hljs-subst">{len(self.tokens)}</span> tokens."</span>)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__len__</span>(<span class="hljs-params">self</span>):</span>
        <span class="hljs-comment"># Total number of possible sequences</span>
        <span class="hljs-keyword">return</span> len(self.tokens) - self.block_size

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__getitem__</span>(<span class="hljs-params">self, idx</span>):</span>
        <span class="hljs-comment"># Input sequence (x)</span>
        x = self.tokens[idx : idx + self.block_size]
        <span class="hljs-comment"># Target sequence (y) - shifted by one</span>
        y = self.tokens[idx + <span class="hljs-number">1</span> : idx + self.block_size + <span class="hljs-number">1</span>]
        <span class="hljs-keyword">return</span> x, y

<span class="hljs-comment"># Create datasets and dataloaders</span>
train_dataset = TinyStoriesDataset(<span class="hljs-string">'train'</span>, BLOCK_SIZE)
val_dataset = TinyStoriesDataset(<span class="hljs-string">'validation'</span>, BLOCK_SIZE, num_stories=<span class="hljs-number">1000</span>) <span class="hljs-comment"># Smaller val set</span>

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=<span class="hljs-literal">True</span>, drop_last=<span class="hljs-literal">True</span>)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=<span class="hljs-literal">False</span>, drop_last=<span class="hljs-literal">True</span>)

print(<span class="hljs-string">"--- DataLoaders created successfully ---"</span>)
</code></pre>
<p><strong>Cell 6 explanation</strong></p>
<p>1.created a class named <strong>TinyStoriesDataset</strong> of type:<strong>pytorch dataset</strong> it has 3 main functions init,len,getitem</p>
<p>2.here we use a param called split splits the dataset for <strong>training</strong> and <strong>validation</strong> . no testing as it is simple</p>
<p>3.we loaded <strong>TinyStories</strong> dataset which contain lots of stories and store in a variable <strong>ds</strong></p>
<p><strong>ds type: pytorch dataset</strong></p>
<p><strong>ds sample:</strong></p>
<p>First item text:</p>
<p>One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt. Lily went to her mom and said……………..</p>
<p>create an empty array called <strong>all_tokens[]</strong> . ds is now made as a single long <strong>text</strong></p>
<p><strong>text type : string</strong></p>
<p><strong>text sample:</strong></p>
<p><strong>“One day, a little girl named Lily found a needle in her room……………………..”</strong></p>
<p>now each word is tokenised using tiktoken . the encoded values stored in <strong>text_token</strong></p>
<p><strong>text_token type : list</strong></p>
<p><strong>text_token sample: [3198, 1110, 11, 257, 1310,…..]</strong></p>
<p>all the tokens <strong>extended in all_token[]</strong> each story separated by <strong>end of token character</strong></p>
<p>all_token[] made as <strong>tensor</strong></p>
<p>now for training as well as validation we need input and output which is <strong>x,y</strong> respectively</p>
<p>since llm predict the a token from its previous token if <strong>x starts from ith token</strong> <strong>y start from i+1 token</strong></p>
<p><strong>train_dataset</strong> has 10000 stories as tokens with each input output as 64 token tensor</p>
<p><strong>val_dataset</strong> has 1000 stories as tokens with each input output as 64 token tensor</p>
<p><strong>x shape: torch.Size([64])</strong></p>
<p><strong>y shape: torch.Size([64])</strong></p>
<p><strong>train_loader</strong> and <strong>val_loader</strong> will use their respective datasets and on 32 batches iterate at parallel</p>
<p><strong>Batch x shape: torch.Size([32, 64])</strong></p>
<p><strong>Batch y shape: torch.Size([32, 64])</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 7</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MultiHeadAttention</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">""" The 'talking' part: Multi-Head Self-Attention """</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self, n_head, head_size</span>):</span>
        super().__init__()
        <span class="hljs-comment"># One big Linear layer to get Q, K, V for all heads</span>
        self.c_attn = nn.Linear(N_EMBD, <span class="hljs-number">3</span> * N_EMBD, bias=<span class="hljs-literal">False</span>)

        <span class="hljs-comment"># Final output projection</span>
        self.c_proj = nn.Linear(N_EMBD, N_EMBD)

        self.n_head = n_head
        self.head_size = head_size
        self.dropout = nn.Dropout(DROPOUT)

        <span class="hljs-comment"># Causal mask</span>
        self.register_buffer(<span class="hljs-string">"bias"</span>, torch.tril(torch.ones(BLOCK_SIZE, BLOCK_SIZE))
                                      .view(<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, BLOCK_SIZE, BLOCK_SIZE))

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        B, T, C = x.size() <span class="hljs-comment"># Batch, Time (Block Size), Channels (N_EMBD)</span>

        <span class="hljs-comment"># 1. Get Q, K, V</span>
        q, k, v = self.c_attn(x).split(N_EMBD, dim=<span class="hljs-number">2</span>)

        <span class="hljs-comment"># 2. Reshape for multi-head</span>
        q = q.view(B, T, self.n_head, self.head_size).transpose(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>) <span class="hljs-comment"># (B, n_head, T, head_size)</span>
        k = k.view(B, T, self.n_head, self.head_size).transpose(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>)
        v = v.view(B, T, self.n_head, self.head_size).transpose(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>)

        <span class="hljs-comment"># 3. Calculate Attention Scores (affinities)</span>
        att = (q @ k.transpose(<span class="hljs-number">-2</span>, <span class="hljs-number">-1</span>)) * (self.head_size**<span class="hljs-number">-0.5</span>)

        <span class="hljs-comment"># 4. Apply Mask (prevents looking into the future)</span>
        att = att.masked_fill(self.bias[:,:,:T,:T] == <span class="hljs-number">0</span>, float(<span class="hljs-string">'-inf'</span>))

        <span class="hljs-comment"># 5. Softmax (convert scores to weights)</span>
        att = F.softmax(att, dim=<span class="hljs-number">-1</span>)
        att = self.dropout(att)

        <span class="hljs-comment"># 6. Apply weights to Values (get context-aware vectors)</span>
        y = att @ v

        <span class="hljs-comment"># 7. Re-assemble heads</span>
        y = y.transpose(<span class="hljs-number">1</span>, <span class="hljs-number">2</span>).contiguous().view(B, T, C)

        <span class="hljs-comment"># 8. Final projection</span>
        y = self.c_proj(y)
        y = self.dropout(y)
        <span class="hljs-keyword">return</span> y

print(<span class="hljs-string">"--- MultiHeadAttention layer defined ---"</span>)
</code></pre>
<p><strong>Cell 7 explanation</strong></p>
<p>Create a class called class <strong>MultiHeadAttention</strong> that inherits <strong>nn.module</strong> which is the base class to use most of the nn components . This class has 2 functions <strong>init</strong> and <strong>forward()</strong></p>
<p><strong>Multi head attention ’s first layer</strong></p>
<p><strong>c_attn</strong> - linear layer its input size is <strong>128</strong> its otput size with q,k,v info is <strong>384. op=ip*weight + bias</strong> but here we diasbled bias for this layer…so <strong>op=ip*weight</strong></p>
<p><strong>Multi head attention ’s second layer</strong></p>
<p>c_proj - linear layer input size is <strong>128</strong> output size also <strong>128</strong></p>
<p>Then we made a casual masking…<strong>64×64 ones matrix</strong> then we made it as **lower triangular matrix…**then we reshape this into a tensor of (32,4,64,64)….32 batch each batch has 4 head each head has 64×64 matrix <strong>rows are query</strong> <strong>colums are key their intersection/cell is the value(weight)</strong></p>
<p>in forward() we define shape of <strong>x=(32,64,128) which means for ne train or one iteration we look 32 batches of 64 tokens length and each token have 128d vector</strong></p>
<p>first we send x into c_attn which makes it <strong>x=(32,64,384) then we split as q=(32,64,128),k=(32,64,128),v=(32,64,128)</strong></p>
<p>now we reshape each q=(32,64,128) since 128 is 4×32 heads it is q=(32,64,4,32) now we reshape with transpose function as <strong>q=(32,4,64,32)</strong> similarly we do for k,v</p>
<p>now we calculate attention score as <strong>att</strong> first do matrix multiplication(@) between q and transpose of k(-1,-2) which result as <strong>(32,4,64,64)</strong> now we consider the 64,64 matrix as the attention matrix with <strong>row=q,col=k</strong> now we need to <strong>multiply this with head size(32)</strong>..since we get large numbers it will be hard to calculate so we power the head size with <strong>-0.5</strong></p>
<p>The final attention score in form of (32,4,64,64) for this 64×64 matrix we apply casual mask and say where mask is true keep -<strong>inf(which is 0)</strong> if mask is false keep the <strong>actual score</strong> and do the dropout to avoid overfitting.</p>
<p>After doing softmax we made the attention into weights…so <strong>op=in*weight+bias</strong> since we made bias=false now <strong>y= att @ v (where y = op, att=weight, v=input)</strong> final y=(32,4,64,32)</p>
<p>we again reform y as <strong>y=(32,64,128)# Cell 8: Design Layer 2 - Feed-Forward Network</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 8: Design Layer 2 - Feed-Forward Network</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">FeedForward</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">""" The 'thinking' part: a simple 2-layer neural network """</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(N_EMBD, <span class="hljs-number">4</span> * N_EMBD), <span class="hljs-comment"># Expand</span>
            nn.ReLU(),                     <span class="hljs-comment"># Activation function</span>
            nn.Linear(<span class="hljs-number">4</span> * N_EMBD, N_EMBD), <span class="hljs-comment"># Contract</span>
            nn.Dropout(DROPOUT),
        )

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-keyword">return</span> self.net(x)

print(<span class="hljs-string">"--- FeedForward layer defined ---"</span>)
</code></pre>
<p><strong>Cell 8 explanation</strong></p>
<p>here a 2 layer simple <strong>feed forward</strong> happens</p>
<p>l1 linear multiples <strong>128 into 4 times which 512</strong></p>
<p>then apply <strong>relu</strong> to 512…if any values are 0 removed</p>
<p>again bring it <strong>back to 128</strong> size</p>
<p><strong>droupout 10%</strong> of nodes to avoid overfitting</p>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 9: Design the Transformer Block</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">Block</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-string">""" A single Transformer Block: Talk, then Think """</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        super().__init__()
        head_size = N_EMBD // N_HEAD
        self.attn = MultiHeadAttention(N_HEAD, head_size)
        self.ffn = FeedForward()
        self.ln_1 = nn.LayerNorm(N_EMBD)
        self.ln_2 = nn.LayerNorm(N_EMBD)

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, x</span>):</span>
        <span class="hljs-comment"># Residual Connections (x + ...)</span>
        x = x + self.attn(self.ln_1(x)) <span class="hljs-comment"># "Talk"</span>
        x = x + self.ffn(self.ln_2(x)) <span class="hljs-comment"># "Think"</span>
        <span class="hljs-keyword">return</span> x

print(<span class="hljs-string">"--- Block layer defined ---"</span>)
</code></pre>
<p><strong>Cell 9 explanation</strong></p>
<p>transformer block is defined here as a stack</p>
<ul>
<li><p><strong>Input → LayerNorm</strong></p>
</li>
<li><p><strong>Pass through Attention</strong></p>
</li>
<li><p><strong>Add Residual Connection (x = x + attention_output)</strong></p>
</li>
<li><p><strong>LayerNorm again</strong></p>
</li>
<li><p><strong>Pass through FeedForward</strong></p>
</li>
<li><p><strong>Add Residual Connection (x = x + ffn_output)</strong></p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 10: Assemble the Full MyGPT Model</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">MyGPT</span>(<span class="hljs-params">nn.Module</span>):</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">__init__</span>(<span class="hljs-params">self</span>):</span>
        super().__init__()

        <span class="hljs-comment"># --- Embedding Layers (Vectorization) ---</span>
        self.token_embedding_table = nn.Embedding(VOCAB_SIZE, N_EMBD)
        self.position_embedding_table = nn.Embedding(BLOCK_SIZE, N_EMBD)

        <span class="hljs-comment"># --- Transformer Body ---</span>
        self.blocks = nn.Sequential(*[Block() <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(N_LAYER)])

        <span class="hljs-comment"># --- Final Layers ---</span>
        self.ln_f = nn.LayerNorm(N_EMBD) <span class="hljs-comment"># Final LayerNorm</span>
        self.lm_head = nn.Linear(N_EMBD, VOCAB_SIZE) <span class="hljs-comment"># Output layer</span>

    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">forward</span>(<span class="hljs-params">self, idx, targets=None</span>):</span>
        B, T = idx.shape

        <span class="hljs-comment"># 1. Get Embeddings</span>
        tok_emb = self.token_embedding_table(idx) <span class="hljs-comment"># (B, T, C)</span>
        pos = torch.arange(T, device=DEVICE)
        pos_emb = self.position_embedding_table(pos) <span class="hljs-comment"># (T, C)</span>

        <span class="hljs-comment"># 2. Add embeddings together</span>
        x = tok_emb + pos_emb <span class="hljs-comment"># (B, T, C)</span>

        <span class="hljs-comment"># 3. Pass through Transformer Blocks</span>
        x = self.blocks(x)

        <span class="hljs-comment"># 4. Final LayerNorm</span>
        x = self.ln_f(x)

        <span class="hljs-comment"># 5. Get Logits (the model's prediction scores)</span>
        logits = self.lm_head(x) <span class="hljs-comment"># (B, T, VOCAB_SIZE)</span>

        <span class="hljs-comment"># 6. Calculate Loss (if we are training)</span>
        loss = <span class="hljs-literal">None</span>
        <span class="hljs-keyword">if</span> targets <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>:
            B, T, C = logits.shape
            logits_view = logits.view(B*T, C)
            targets_view = targets.view(B*T)
            loss = F.cross_entropy(logits_view, targets_view)

        <span class="hljs-keyword">return</span> logits, loss

<span class="hljs-meta">    @torch.no_grad() # Tell PyTorch we aren't training</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">generate</span>(<span class="hljs-params">self, start_text, max_new_tokens</span>):</span>
        self.eval() <span class="hljs-comment"># Set model to evaluation mode</span>

        <span class="hljs-comment"># Tokenize the starting text</span>
        start_tokens = tokenizer.encode_ordinary(start_text)
        idx = torch.tensor(start_tokens, dtype=torch.long, device=DEVICE).unsqueeze(<span class="hljs-number">0</span>)

        <span class="hljs-keyword">for</span> _ <span class="hljs-keyword">in</span> range(max_new_tokens):
            <span class="hljs-comment"># Crop context if it's longer than BLOCK_SIZE</span>
            idx_cond = idx[:, -BLOCK_SIZE:]

            <span class="hljs-comment"># Get logits</span>
            logits, _ = self(idx_cond)

            <span class="hljs-comment"># Focus on the logit for the *very last* token</span>
            logits = logits[:, <span class="hljs-number">-1</span>, :] <span class="hljs-comment"># (B, C)</span>

            <span class="hljs-comment"># Get probabilities via Softmax</span>
            probs = F.softmax(logits, dim=<span class="hljs-number">-1</span>) <span class="hljs-comment"># (B, C)</span>

            <span class="hljs-comment"># Sample the next token</span>
            idx_next = torch.multinomial(probs, num_samples=<span class="hljs-number">1</span>) <span class="hljs-comment"># (B, 1)</span>

            <span class="hljs-comment"># Append the new token to our sequence</span>
            idx = torch.cat((idx, idx_next), dim=<span class="hljs-number">1</span>)

        <span class="hljs-comment"># Detokenize the final sequence</span>
        <span class="hljs-keyword">return</span> tokenizer.decode(idx[<span class="hljs-number">0</span>].tolist())

<span class="hljs-comment"># --- Create the model! ---</span>
model = MyGPT()
model = model.to(DEVICE)
print(<span class="hljs-string">f"Model created with ~<span class="hljs-subst">{sum(p.numel() <span class="hljs-keyword">for</span> p <span class="hljs-keyword">in</span> model.parameters())/<span class="hljs-number">1e6</span>:<span class="hljs-number">.2</span>f}</span>M parameters"</span>)
</code></pre>
<p><strong>Cell 10 explanation</strong></p>
<ul>
<li><p><strong>Token embedding</strong> → converts token IDs to 128-dim vectors</p>
</li>
<li><p><strong>Position embedding</strong> → gives each position (0–63) its own vector</p>
</li>
<li><p>Final embedding = token meaning + position meaning</p>
</li>
<li><p>after that we stack 4 times the same transformer block</p>
</li>
<li><p>before output once again we normalise then comes op layer</p>
</li>
<li><p>the ouput layer is shape of (128,50257) so the next token predicted by model can be any word inside the library(this is how it generate new words which are not given in input also)</p>
</li>
<li><p>we then store the models prediction score as <strong>logits</strong> Now for each token position, the model gives <strong>50257 logits</strong>, one score for each possible next token. logits=(32,64,50257)</p>
</li>
<li><p>since we need to calculate loss (the difference between actual and predicted value) we use <strong>cross entropy loss</strong> we need to reshape <strong>logits=(32×64,50257) =(2048,50257)</strong> and <strong>target view as (32×64)</strong> =(2048) find cross entropy of both.</p>
</li>
<li><p>then we move to next function <strong>generate()</strong> since we make the model to predict we will tell it not to learn from it</p>
</li>
<li><p>here we get initial prompt of user tokenise them then make it as tensor of [1,64] always for generation batch will be 1 so only we mentioned <strong>unsqueeze(0) Crop context if it's longer than BLOCK_SIZE</strong></p>
</li>
<li><p>Get logits of the last token which of shape [1,50257] make it as <strong>softmax(0 to 1)</strong></p>
</li>
<li><p>next <strong>multinominal</strong> the values High probability token = more chance of getting picked</p>
<p>  Low probability token = less chance <strong>sampling gives creativity</strong></p>
</li>
<li><p>append the new token to our query then decode them</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 11: Create the "Evaluate" Function</span>

<span class="hljs-meta">@torch.no_grad() # We don't need to calculate gradients for evaluation</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">estimate_loss</span>():</span>
    out = {}
    model.eval() <span class="hljs-comment"># Set model to evaluation mode</span>
    <span class="hljs-keyword">for</span> split, loader <span class="hljs-keyword">in</span> [(<span class="hljs-string">'train'</span>, train_loader), (<span class="hljs-string">'val'</span>, val_loader)]:
        losses = torch.zeros(EVAL_INTERVAL)
        <span class="hljs-keyword">for</span> k <span class="hljs-keyword">in</span> range(EVAL_INTERVAL):
            <span class="hljs-keyword">try</span>:
                X, Y = next(iter(loader))
            <span class="hljs-keyword">except</span> StopIteration:
                <span class="hljs-comment"># This is a simple demo, so we'll just reset the loader</span>
                <span class="hljs-comment"># In a real setup, you'd iterate through the whole val set once</span>
                loader_iter = iter(loader)
                X, Y = next(loader_iter)

            X, Y = X.to(DEVICE), Y.to(DEVICE)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train() <span class="hljs-comment"># Set model back to training mode</span>
    <span class="hljs-keyword">return</span> out

print(<span class="hljs-string">"--- Evaluation function defined ---"</span>)
</code></pre>
<p><strong>Cell 11 explanation</strong></p>
<p><strong>@torch.no_grad()</strong> say during evalutaion no updation of weight needed</p>
<p>out={} creates a dictionary to store output predicted by model and then we set model to evaluation mode</p>
<p>now in both train and validation at <strong>200th step</strong> we calculate losses</p>
<p><strong>get the x,y</strong> of the batches load them in <strong>gpu/cpu</strong> calculate loss(logit and target difference) and append it in losses now find the mean of every 200 iteration and store in out{both train and val }</p>
<p>set the model to back training mode</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time

<span class="hljs-comment"># Cell 12: The "Train" Step (The Training Loop)</span>

<span class="hljs-comment"># Create the optimizer</span>
optimizer = torch.optim.AdamW(model.parameters(), lr=LEARNING_RATE)

<span class="hljs-comment"># Create iterators for our data loaders</span>
train_iter = iter(train_loader)
val_iter = iter(val_loader)

print(<span class="hljs-string">"Starting training..."</span>)
start_time = time.time()

<span class="hljs-keyword">for</span> step <span class="hljs-keyword">in</span> range(MAX_ITERS):

    <span class="hljs-comment"># --- Periodically, evaluate the loss on train/val sets ---</span>
    <span class="hljs-keyword">if</span> step % EVAL_INTERVAL == <span class="hljs-number">0</span> <span class="hljs-keyword">or</span> step == MAX_ITERS - <span class="hljs-number">1</span>:
        losses = estimate_loss()
        print(<span class="hljs-string">f"Step <span class="hljs-subst">{step}</span>: train loss <span class="hljs-subst">{losses[<span class="hljs-string">'train'</span>]:<span class="hljs-number">.4</span>f}</span>, val loss <span class="hljs-subst">{losses[<span class="hljs-string">'val'</span>]:<span class="hljs-number">.4</span>f}</span>"</span>)

    <span class="hljs-comment"># 1. Get a training batch</span>
    <span class="hljs-keyword">try</span>:
        X, Y = next(train_iter)
    <span class="hljs-keyword">except</span> StopIteration:
        train_iter = iter(train_loader) <span class="hljs-comment"># Reset iterator</span>
        X, Y = next(train_iter)

    X, Y = X.to(DEVICE), Y.to(DEVICE)

    <span class="hljs-comment"># 2. Forward pass</span>
    logits, loss = model(X, Y)

    <span class="hljs-comment"># 3. Backward pass</span>
    optimizer.zero_grad(set_to_none=<span class="hljs-literal">True</span>) <span class="hljs-comment"># Clear old gradients</span>
    loss.backward()                       <span class="hljs-comment"># Calculate new gradients</span>

    <span class="hljs-comment"># 4. Update weights</span>
    optimizer.step()

end_time = time.time()
print(<span class="hljs-string">"--- Training Complete ---"</span>)
print(<span class="hljs-string">f"Final validation loss: <span class="hljs-subst">{losses[<span class="hljs-string">'val'</span>]:<span class="hljs-number">.4</span>f}</span>"</span>)
print(<span class="hljs-string">f"Training took <span class="hljs-subst">{(end_time-start_time):<span class="hljs-number">.2</span>f}</span> seconds"</span>)
</code></pre>
<p><strong>Cell 12 explanation</strong></p>
<p>create an <strong>optimizer</strong> so if loss happens it update the weight for lowering the loss</p>
<p>create an <strong>iterator</strong> to train and validate</p>
<ul>
<li><p>calculates <strong>average train loss (200 batches) ,</strong>calculates <strong>average val loss (200 batches)</strong></p>
</li>
<li><p>DOES NOT update weights because evaluation uses: <code>@</code><a target="_blank" href="http://torch.no"><code>torch.no</code></a><code>_grad()</code> → gradients OFF</p>
</li>
<li><p>get x,y in cpu/gpu</p>
</li>
<li><p>in forward pass model wil predict the next token..calculate logits and loss</p>
</li>
<li><p>in backward pass model will undertsand and learn from mistake then optimize(update weight to reduce loss)</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Cell 13: "Evaluate" (Qualitative Test via Generation)</span>

print(<span class="hljs-string">"\n--- GENERATING YOUR STORY 🥰🍓 ---"</span>)
prompt = input(<span class="hljs-string">"Enter your imagination a small line:"</span>)
generated_text = model.generate(start_text=prompt, max_new_tokens=<span class="hljs-number">600</span>)
print(generated_text)
</code></pre>
<p><strong>Cell 13 explanation:</strong></p>
<p>get prompt from user and generate the story</p>
<pre><code class="lang-python"><span class="hljs-comment"># Save the model state dictionary</span>
torch.save(model.state_dict(), <span class="hljs-string">'tiny_stories_gpt.pth'</span>)
print(<span class="hljs-string">"Model saved to tiny_stories_gpt.pth"</span>)

<span class="hljs-comment"># To load it back later:</span>
<span class="hljs-comment"># model = MyGPT()</span>
<span class="hljs-comment"># model.load_state_dict(torch.load('tiny_stories_gpt.pth'))</span>
<span class="hljs-comment"># model.to(DEVICE)</span>
</code></pre>
<p><strong>Cell 14 explanation:</strong></p>
<p>save the model using the pytorch function save()</p>
]]></content:encoded></item><item><title><![CDATA[Unveiling Myths: Understanding and Building a Large Language Model💻]]></title><description><![CDATA[To start with LLM ….first we need to talk about AI(Artificial Intelligence)….because if LLM is kid it came from its mom…that is AI😅
AI
When machines behave like human then we call it AI
Eg: Your friend can understand whether you are happy/sad with y...]]></description><link>https://exploring-the-myth-behind-llms.hashnode.dev/unveiling-myths-understanding-and-building-a-large-language-model</link><guid isPermaLink="true">https://exploring-the-myth-behind-llms.hashnode.dev/unveiling-myths-understanding-and-building-a-large-language-model</guid><dc:creator><![CDATA[Sudharshini Jothikumar]]></dc:creator><pubDate>Wed, 19 Nov 2025 07:01:59 GMT</pubDate><content:encoded><![CDATA[<p>To start with LLM ….first we need to talk about AI(Artificial Intelligence)….because if LLM is kid it came from its mom…that is AI😅</p>
<h1 id="heading-ai">AI</h1>
<p><strong>When machines behave like human then we call it AI</strong></p>
<p>Eg: Your friend can understand whether you are happy/sad with your voice and way of speaking but now a days many AI bots can do emotion detection with your audio clip and say you are happy/sad right….<strong>AI in its peakifying era thats all happening😅😅</strong></p>
<h1 id="heading-what-is-llm">What is LLM</h1>
<p>•<strong>A modal or a program(simply set of code only ….with few algorithms) that can understand and generate human language</strong>…..absolutely it’s a part of AI (as it behaves like human). <strong><mark>a neural network trained to predict the next token (word-piece / subword) given prior tokens.</mark>💻💻💗💗</strong></p>
<p>In earlier days we learnt human can talk tamil telugu malayalam kanada english spain etc…but machine knows only 0️⃣ and 1️⃣…..but now we can see that LLM understand all languages..even humans cant understand that much I bet😁</p>
<h1 id="heading-slm-vs-llm">SLM vs LLM 😉</h1>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>🤖 LLM</td><td>⚡ SLM</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Size</strong></td><td><strong>Massive:</strong> Billions to Trillions of parameters.</td><td><strong>Small:</strong> Millions to a few Billion parameters.</td></tr>
<tr>
<td><strong>Knowledge</strong></td><td><strong>Generalist:</strong> Knows about everything.</td><td><strong>Specialist:</strong> Expert on a specific, narrow topic.</td></tr>
<tr>
<td><strong>Hardware</strong></td><td><strong>Huge:</strong> Needs a data center with many powerful GPUs.</td><td><strong>Local:</strong> Can run on a laptop, smartphone, or edge device.</td></tr>
<tr>
<td><strong>Speed</strong></td><td><strong>Slower:</strong> High latency (takes time to "think").</td><td><strong>Very Fast:</strong> Low latency (ideal for real-time apps).</td></tr>
<tr>
<td><strong>Cost</strong></td><td><strong>Expensive:</strong> Costs millions to train and is costly to run.</td><td><strong>Cheap:</strong> Inexpensive to train and run.</td></tr>
<tr>
<td><strong>Privacy</strong></td><td><strong>Low:</strong> You usually send your data to a company's cloud.</td><td><strong>High:</strong> Can run 100% on your local device. No data shared.</td></tr>
<tr>
<td><strong>Best For...</strong></td><td>Complex reasoning, creative content, broad research.</td><td>Specific tasks: chatbots, summarization, on-device AI.</td></tr>
</tbody>
</table>
</div><h2 id="heading-some-real-time-examples-of-llm">Some real time examples of LLM</h2>
<p>The joke is we use LLM without knowing it is a LLM😺</p>
<p><strong>ChatGPT DeepSEEK Gemini Perplexity MetaAI GithubCopilot</strong></p>
<h1 id="heading-different-methods-of-building-llm">Different methods of building LLM😁</h1>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Method</td><td>Primary Tool / Example</td><td>Effort</td><td>Cost</td><td>Customization</td></tr>
</thead>
<tbody>
<tr>
<td><strong>1. The API Method</strong></td><td>Google Gemini API, OpenAI API</td><td><strong>Low</strong></td><td>Pay-per-use (can get expensive)</td><td><strong>Low</strong> (Limited to prompt engineering)</td></tr>
<tr>
<td><strong>2. The Local/Fine-Tuning Method</strong></td><td>Ollama, Hugging Face, Unsloth</td><td><strong>Medium</strong></td><td>Free (requires a good GPU)</td><td><strong>High</strong> (You can change the model's knowledge)</td></tr>
<tr>
<td><strong>3. The "From Scratch" Method</strong></td><td>PyTorch, TensorFlow, Custom GPU Clusters</td><td><strong>Extremely High</strong></td><td>$10M - $100M+ (Millions)</td><td><strong>Total</strong> (You define everything)</td></tr>
</tbody>
</table>
</div><h1 id="heading-characteristics-of-llm">Characteristics of LLM 🎀</h1>
<p>•Mainly uses <strong>PYTHON</strong></p>
<p>•Frameworks to simplify our job-<strong>TF AND PYTORCH</strong></p>
<p>•Get the pretrained models/transformers from –<strong>HUGGING FACE,LLAMA</strong></p>
<p>•A library that help us easily connect a llm with our application-<strong>LANGCHAIN</strong></p>
<p>•LLMs are pre-trained on an enormous and diverse dataset, typically a large portion of the internet, books, wikipedia and code.</p>
<h1 id="heading-pytorch"><strong>Pytorch</strong> 🐦‍🔥</h1>
<p>It is a large framework which provides a system to build train and deploy a AI thing…it has lots of modules or packages within it. Instead of pytorch you can also use <strong>TF(TensorFlow)</strong></p>
<p>Eg packages within Pytorch: <strong>nn,tensor,functional,optim,autograd 🩷</strong></p>
<p><strong>Creating a LLM is like baking a cake with lot of process within it. we need to understand them before stepping into coding lets see one by one🎂🍰</strong></p>
<h1 id="heading-1tokenisation">1.Tokenisation</h1>
<p>To train the LLM and make them understand and generate human language we will give it large datasets to learn,find pattern etc…usually it will be books,news articles,wikipedia,codes etc….but the modal cannot directly get the input as raw text…it needs only numerical values to analyse pattern and learn to generate💻💻💻</p>
<p>•**Tokenisation is process of Converting text into tokens/numbers .**We have various tokenisers like a dictionary that have millions of words with numbers. We can use instead of manually create token which will take years… <strong>Ex:Tiktoken ( King=623,Queen=598,is=12)</strong></p>
<h1 id="heading-2embedding">2.Embedding</h1>
<p><strong>Embedding is the process of Converting token into vectors</strong> while a simple token hold a numerical value its vector hold a lot of information about that value 😺</p>
<p><mark>The first layer in any LLM will be embedding layer</mark></p>
<p>Eg <strong>King=623 → tokenisation</strong></p>
<p><strong>623 =[0.12, -0.55, 1.03, ...]→embedding</strong></p>
<p><strong>This vector contain lot of information about king</strong></p>
<p><strong>semantic info - king is near queen ,he is masculine etc…</strong></p>
<p><strong>context info - king used in place of “historic time,war,palace“</strong></p>
<p><strong>grammar info-noun etc…</strong></p>
<p>The best thing is you don’t create all this info the llm will generate based on learning😁</p>
<h1 id="heading-3positional-encoding-123">3.Positional Encoding 1️⃣2️⃣3️⃣</h1>
<p>•LLM read <strong>all words at once.</strong> So the model **cannot know which word comes first, second, third…**unless we <strong>add positions</strong>.</p>
<p>•we can add positional info by various ways <strong>Sinosodial(use sin and cos function and generate numberic value based on the dimension of vector),learned positional embedding(allow llm to learn by itself the position)</strong></p>
<p><strong>Positional Encoding is the process of Adding the positional information about a token to a embedding vector 💗💗</strong></p>
<p>Ex : cat sat on mat</p>
<ul>
<li><p>“cat” is at <strong>position</strong> “sat” is at <strong>position 1</strong> “on” is at <strong>position 2</strong> “mat” is at <strong>position 3</strong></p>
</li>
<li><p>Each word is converted into a vector:</p>
<p>  | Word | Embedding vector (example) |
  | --- | --- |
  | cat | [0.3, -0.1, 0.7, ...] |
  | sat | [0.1, 0.4, -0.2, ...] |
  | on | [-0.5, 0.8, 0.1, ...] |
  | mat | [0.6, 0.2, 0.9, ...] |</p>
</li>
</ul>
<p>Transformer adds a position vector for each word through some method as said above:</p>
<p>Example:</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Position</td><td>Positional Vector</td></tr>
</thead>
<tbody>
<tr>
<td>0</td><td>[0.99, 0.12, -0.88, ...]</td></tr>
<tr>
<td>1</td><td>[0.84, 0.52, -0.41, ...]</td></tr>
<tr>
<td>2</td><td>[0.65, 0.78, 0.20, ...]</td></tr>
<tr>
<td>3</td><td>[0.35, 0.93, 0.77, ...]</td></tr>
</tbody>
</table>
</div><h2 id="heading-final-input-vector-embedding-positional-encoding">Final Input Vector = Embedding + Positional Encoding</h2>
<h3 id="heading-for-cat-at-position-0">For “cat” at position 0:</h3>
<pre><code class="lang-python">[<span class="hljs-number">0.3</span>, <span class="hljs-number">-0.1</span>, <span class="hljs-number">0.7</span>, ...]   (meaning)
+
[<span class="hljs-number">0.99</span>, <span class="hljs-number">0.12</span>, <span class="hljs-number">-0.88</span>, ...] (position)
=
[<span class="hljs-number">1.29</span>, <span class="hljs-number">0.02</span>, <span class="hljs-number">-0.18</span>, ...]
</code></pre>
<p>This new vector tells the model:</p>
<blockquote>
<p>"cat" + "I am the first word"</p>
</blockquote>
<h1 id="heading-4attention-mechanism">4.Attention Mechanism🩷</h1>
<p><strong>“For the current word I am processing,<br />which other words in the sentence should I pay attention to?”</strong></p>
<p>Ex : cat sat on mat</p>
<p>the word that LLM took for learning is cat which will have its information in embedding vector then position also….but while I learn about cat which words will be more related or necessary to it</p>
<p><strong>cat - sat</strong> its important as it is action of cat so we give <mark>0.6</mark></p>
<p><strong>cat-on</strong> grammaticaly also on with cat feels less important <mark>0.1</mark></p>
<p><strong>cat-cat</strong> both hold fully same info so <mark>1</mark> etc..</p>
<h1 id="heading-5multi-head-attention">5.<strong>Multi-Head Attention☺️🤩😐😏😣🥱</strong></h1>
<p>In above thing we given attention to only one word cat…but here we create heads like</p>
<p>cat sat on mat</p>
<p>head1- for nouns/subject (which will mainly concentrate cat)-[<mark>0.9</mark>,0.1,0.1,0.7]</p>
<p>head2-for object(which will mainly concentrate mat)-[0.6,0.2,0.1,<mark>0.9</mark>]</p>
<p>head3-for verb/action(which will mainly concentrate sat) - [0.1,<mark>1.2</mark>,0.3,0.1]</p>
<p>both single/multi head attention follow <strong>QKV mechanism</strong></p>
<ul>
<li><p><strong>Query (Q) → “What am I looking for?”</strong></p>
</li>
<li><p><strong>Key (K) → “What do I offer?”</strong></p>
</li>
<li><p><strong>Value (V) → “What information should be passed?”</strong></p>
<p>  attention=q*transpose(k)</p>
</li>
<li><p>weight=softmax(attention score)</p>
</li>
<li><p>output= input*weight + bias so</p>
</li>
<li><p><strong>y=attention*v + bias</strong></p>
</li>
</ul>
<h1 id="heading-6feed-forward"><strong>6.Feed Forward🍓🍒</strong></h1>
<p>Inside a transformer block:</p>
<ul>
<li><p><strong>Attention = TALK</strong> (token talks with other tokens)</p>
</li>
<li><p><strong>Feed Forward = THINK</strong> (token thinks alone)</p>
</li>
</ul>
<p>👉 FFN works <strong>on each token individually</strong>, no talking between tokens.</p>
<p>It is simply a <strong>tiny 2-layer neural network</strong> applied to <em>every token’s vector</em>.</p>
<h1 id="heading-7residual-connection">7.Residual connection 🤝</h1>
<p>Residual connection = <strong>Memory + New Knowledge</strong></p>
<p>Residual connection does:</p>
<pre><code class="lang-python">output = original + new_output
</code></pre>
<p>That means:</p>
<ul>
<li><p>Keep the original meaning <strong>as it is</strong></p>
</li>
<li><p>Add only the improvements</p>
</li>
</ul>
<hr />
<h2 id="heading-super-simple-example">🧠 <strong>Super Simple Example</strong></h2>
<p>Input vector (simplified):</p>
<pre><code class="lang-python">x = [<span class="hljs-number">5</span>,<span class="hljs-number">2</span>]
</code></pre>
<p>Attention output:</p>
<pre><code class="lang-python">attn_out = [<span class="hljs-number">2</span>,<span class="hljs-number">9</span>]
</code></pre>
<p>Residual connection:</p>
<pre><code class="lang-python">x = x + attn_out = [<span class="hljs-number">7</span>,<span class="hljs-number">11</span>]
</code></pre>
<h1 id="heading-8layer-normalization-layernorm">8.Layer Normalization (LayerNorm)💙💙</h1>
<p>👉 <strong>Simple Meaning: “Clean and balance the values before processing.”</strong></p>
<p>🎯 <strong>Make values have mean = 0 and variance = 1</strong></p>
<p>[87, 90, 22, 45, 100, 10] into [-0.1, 0.2, -1.1, -0.6, 1.3, -1.4]</p>
<h1 id="heading-9logits">9.Logits</h1>
<p>In OP Layer Finally when a model predict the next tokens from previous tokens for all the words in vocab it gives a prediction score …the most high scored tokens will be the next availabe words</p>
<h1 id="heading-10what-happens-in-forward-and-backward-pass"><strong>10.What happens in forward and backward pass➡️⬆️➡️</strong></h1>
<h3 id="heading-forward-pass"><strong>Forward pass</strong></h3>
<ul>
<li><p>The model predicts the next token logits</p>
</li>
<li><p>Compares logits with Y</p>
</li>
<li><p>Computes <strong>loss</strong></p>
</li>
<li><p>Loss = “how wrong the predictions were”</p>
</li>
</ul>
<h3 id="heading-backward-pass"><strong>Backward pass</strong></h3>
<ul>
<li><p>Model computes <strong>gradients</strong> (how much each weight contributed to the error)</p>
</li>
<li><p>This is the model <strong>learning from mistakes</strong></p>
</li>
<li><p>Optimizer uses the gradients to <strong>update weights in a direction that reduces loss</strong></p>
</li>
</ul>
]]></content:encoded></item></channel></rss>