To start with LLM ….first we need to talk about AI(Artificial Intelligence)….because if LLM is kid it came from its mom…that is AI😅

AI

When machines behave like human then we call it AI

Eg: Your friend can understand whether you are happy/sad with your voice and way of speaking but now a days many AI bots can do emotion detection with your audio clip and say you are happy/sad right….AI in its peakifying era thats all happening😅😅

What is LLM

•A modal or a program(simply set of code only ….with few algorithms) that can understand and generate human language…..absolutely it’s a part of AI (as it behaves like human). a neural network trained to predict the next token (word-piece / subword) given prior tokens.💻💻💗💗

In earlier days we learnt human can talk tamil telugu malayalam kanada english spain etc…but machine knows only 0️⃣ and 1️⃣…..but now we can see that LLM understand all languages..even humans cant understand that much I bet😁

SLM vs LLM 😉

Feature	🤖 LLM	⚡ SLM
Size	Massive: Billions to Trillions of parameters.	Small: Millions to a few Billion parameters.
Knowledge	Generalist: Knows about everything.	Specialist: Expert on a specific, narrow topic.
Hardware	Huge: Needs a data center with many powerful GPUs.	Local: Can run on a laptop, smartphone, or edge device.
Speed	Slower: High latency (takes time to "think").	Very Fast: Low latency (ideal for real-time apps).
Cost	Expensive: Costs millions to train and is costly to run.	Cheap: Inexpensive to train and run.
Privacy	Low: You usually send your data to a company's cloud.	High: Can run 100% on your local device. No data shared.
Best For...	Complex reasoning, creative content, broad research.	Specific tasks: chatbots, summarization, on-device AI.

Some real time examples of LLM

The joke is we use LLM without knowing it is a LLM😺

ChatGPT DeepSEEK Gemini Perplexity MetaAI GithubCopilot

Different methods of building LLM😁

Method	Primary Tool / Example	Effort	Cost	Customization
1. The API Method	Google Gemini API, OpenAI API	Low	Pay-per-use (can get expensive)	Low (Limited to prompt engineering)
2. The Local/Fine-Tuning Method	Ollama, Hugging Face, Unsloth	Medium	Free (requires a good GPU)	High (You can change the model's knowledge)
3. The "From Scratch" Method	PyTorch, TensorFlow, Custom GPU Clusters	Extremely High	$10M - $100M+ (Millions)	Total (You define everything)

Characteristics of LLM 🎀

•Mainly uses PYTHON

•Frameworks to simplify our job-TF AND PYTORCH

•Get the pretrained models/transformers from –HUGGING FACE,LLAMA

•A library that help us easily connect a llm with our application-LANGCHAIN

•LLMs are pre-trained on an enormous and diverse dataset, typically a large portion of the internet, books, wikipedia and code.

Pytorch 🐦‍🔥

It is a large framework which provides a system to build train and deploy a AI thing…it has lots of modules or packages within it. Instead of pytorch you can also use TF(TensorFlow)

Eg packages within Pytorch: nn,tensor,functional,optim,autograd 🩷

Creating a LLM is like baking a cake with lot of process within it. we need to understand them before stepping into coding lets see one by one🎂🍰

1.Tokenisation

To train the LLM and make them understand and generate human language we will give it large datasets to learn,find pattern etc…usually it will be books,news articles,wikipedia,codes etc….but the modal cannot directly get the input as raw text…it needs only numerical values to analyse pattern and learn to generate💻💻💻

•**Tokenisation is process of Converting text into tokens/numbers .**We have various tokenisers like a dictionary that have millions of words with numbers. We can use instead of manually create token which will take years… Ex:Tiktoken ( King=623,Queen=598,is=12)

2.Embedding

Embedding is the process of Converting token into vectors while a simple token hold a numerical value its vector hold a lot of information about that value 😺

The first layer in any LLM will be embedding layer

Eg King=623 → tokenisation

623 =[0.12, -0.55, 1.03, ...]→embedding

This vector contain lot of information about king

semantic info - king is near queen ,he is masculine etc…

context info - king used in place of “historic time,war,palace“

grammar info-noun etc…

The best thing is you don’t create all this info the llm will generate based on learning😁

3.Positional Encoding 1️⃣2️⃣3️⃣

•LLM read all words at once. So the model **cannot know which word comes first, second, third…**unless we add positions.

•we can add positional info by various ways Sinosodial(use sin and cos function and generate numberic value based on the dimension of vector),learned positional embedding(allow llm to learn by itself the position)

Positional Encoding is the process of Adding the positional information about a token to a embedding vector 💗💗

Ex : cat sat on mat

“cat” is at position “sat” is at position 1 “on” is at position 2 “mat” is at position 3
Each word is converted into a vector:

| Word | Embedding vector (example) | | --- | --- | | cat | [0.3, -0.1, 0.7, ...] | | sat | [0.1, 0.4, -0.2, ...] | | on | [-0.5, 0.8, 0.1, ...] | | mat | [0.6, 0.2, 0.9, ...] |

Transformer adds a position vector for each word through some method as said above:

Example:

Position	Positional Vector
0	[0.99, 0.12, -0.88, ...]
1	[0.84, 0.52, -0.41, ...]
2	[0.65, 0.78, 0.20, ...]
3	[0.35, 0.93, 0.77, ...]

Final Input Vector = Embedding + Positional Encoding

For “cat” at position 0:

[0.3, -0.1, 0.7, ...]   (meaning)
+
[0.99, 0.12, -0.88, ...] (position)
=
[1.29, 0.02, -0.18, ...]

This new vector tells the model:

"cat" + "I am the first word"

4.Attention Mechanism🩷

“For the current word I am processing,
which other words in the sentence should I pay attention to?”

Ex : cat sat on mat

the word that LLM took for learning is cat which will have its information in embedding vector then position also….but while I learn about cat which words will be more related or necessary to it

cat - sat its important as it is action of cat so we give 0.6

cat-on grammaticaly also on with cat feels less important 0.1

cat-cat both hold fully same info so 1 etc..

5.Multi-Head Attention☺️🤩😐😏😣🥱

In above thing we given attention to only one word cat…but here we create heads like

cat sat on mat

head1- for nouns/subject (which will mainly concentrate cat)-[0.9,0.1,0.1,0.7]

head2-for object(which will mainly concentrate mat)-[0.6,0.2,0.1,0.9]

head3-for verb/action(which will mainly concentrate sat) - [0.1,1.2,0.3,0.1]

both single/multi head attention follow QKV mechanism

Query (Q) → “What am I looking for?”
Key (K) → “What do I offer?”
Value (V) → “What information should be passed?”

attention=q*transpose(k)
weight=softmax(attention score)
output= input*weight + bias so
y=attention*v + bias

6.Feed Forward🍓🍒

Inside a transformer block:

Attention = TALK (token talks with other tokens)
Feed Forward = THINK (token thinks alone)

👉 FFN works on each token individually, no talking between tokens.

It is simply a tiny 2-layer neural network applied to every token’s vector.

7.Residual connection 🤝

Residual connection = Memory + New Knowledge

Residual connection does:

output = original + new_output

That means:

Keep the original meaning as it is
Add only the improvements

🧠 Super Simple Example

Input vector (simplified):

x = [5,2]

Attention output:

attn_out = [2,9]

Residual connection:

x = x + attn_out = [7,11]

8.Layer Normalization (LayerNorm)💙💙

👉 Simple Meaning: “Clean and balance the values before processing.”

🎯 Make values have mean = 0 and variance = 1

[87, 90, 22, 45, 100, 10] into [-0.1, 0.2, -1.1, -0.6, 1.3, -1.4]

9.Logits

In OP Layer Finally when a model predict the next tokens from previous tokens for all the words in vocab it gives a prediction score …the most high scored tokens will be the next availabe words

10.What happens in forward and backward pass➡️⬆️➡️

Forward pass

The model predicts the next token logits
Compares logits with Y
Computes loss
Loss = “how wrong the predictions were”

Backward pass

Model computes gradients (how much each weight contributed to the error)
This is the model learning from mistakes
Optimizer uses the gradients to update weights in a direction that reduces loss

Unveiling Myths: Understanding and Building a Large Language Model💻

AI

What is LLM

SLM vs LLM 😉

Some real time examples of LLM

Different methods of building LLM😁

Characteristics of LLM 🎀

Pytorch 🐦‍🔥

1.Tokenisation

2.Embedding

3.Positional Encoding 1️⃣2️⃣3️⃣

Final Input Vector = Embedding + Positional Encoding

For “cat” at position 0:

4.Attention Mechanism🩷

5.Multi-Head Attention☺️🤩😐😏😣🥱

6.Feed Forward🍓🍒

7.Residual connection 🤝

🧠 Super Simple Example

8.Layer Normalization (LayerNorm)💙💙

9.Logits

10.What happens in forward and backward pass➡️⬆️➡️

Forward pass

Backward pass

Comments

More from this blog

Unveiling Myths: Understanding and Building a Large Language Model💻

Command Palette

AI

What is LLM

SLM vs LLM 😉

Some real time examples of LLM

Different methods of building LLM😁

Characteristics of LLM 🎀

Pytorch 🐦‍🔥

1.Tokenisation

2.Embedding

3.Positional Encoding 1️⃣2️⃣3️⃣

Final Input Vector = Embedding + Positional Encoding

For “cat” at position 0:

4.Attention Mechanism🩷

5.Multi-Head Attention☺️🤩😐😏😣🥱

6.Feed Forward🍓🍒

7.Residual connection 🤝

🧠 Super Simple Example

8.Layer Normalization (LayerNorm)💙💙

9.Logits

10.What happens in forward and backward pass➡️⬆️➡️

Forward pass

Backward pass

Comments

More from this blog