Unveiling Myths: Understanding and Building a Large Language Model💻
Understanding-The Begin
I am Sudharshini, a dynamic force in the world of technology and creativity, currently pursuing my MSc in Software Systems. With a passion for problem-solving, I have not only honed my skills but emerged victorious in numerous hackathons, showcasing my prowess in the ever-evolving realm of software development.
Beyond the lines of code, I am also a wordsmith, having penned and published two captivating books that reflect my diverse interests. My ability to weave narratives demonstrates a depth of creativity that extends beyond the digital domain.
I am also a national-level champion in both Silambam and Adimurai, showcasing my physical prowess and discipline. Whether it's mastering the intricacies of software architecture or gracefully wielding traditional weapons, I embodied a perfect blend of the modern and the traditional.
In a world where versatility is key, I stand out as a multifaceted individual, seamlessly navigating the realms of technology, literature, and martial arts with finesse. My journey is not just a narrative of achievements but a testament to the limitless possibilities that arise when one embraces a holistic approach to life.
To start with LLM ….first we need to talk about AI(Artificial Intelligence)….because if LLM is kid it came from its mom…that is AI😅
AI
When machines behave like human then we call it AI
Eg: Your friend can understand whether you are happy/sad with your voice and way of speaking but now a days many AI bots can do emotion detection with your audio clip and say you are happy/sad right….AI in its peakifying era thats all happening😅😅
What is LLM
•A modal or a program(simply set of code only ….with few algorithms) that can understand and generate human language…..absolutely it’s a part of AI (as it behaves like human). a neural network trained to predict the next token (word-piece / subword) given prior tokens.💻💻💗💗
In earlier days we learnt human can talk tamil telugu malayalam kanada english spain etc…but machine knows only 0️⃣ and 1️⃣…..but now we can see that LLM understand all languages..even humans cant understand that much I bet😁
SLM vs LLM 😉
| Feature | 🤖 LLM | ⚡ SLM |
| Size | Massive: Billions to Trillions of parameters. | Small: Millions to a few Billion parameters. |
| Knowledge | Generalist: Knows about everything. | Specialist: Expert on a specific, narrow topic. |
| Hardware | Huge: Needs a data center with many powerful GPUs. | Local: Can run on a laptop, smartphone, or edge device. |
| Speed | Slower: High latency (takes time to "think"). | Very Fast: Low latency (ideal for real-time apps). |
| Cost | Expensive: Costs millions to train and is costly to run. | Cheap: Inexpensive to train and run. |
| Privacy | Low: You usually send your data to a company's cloud. | High: Can run 100% on your local device. No data shared. |
| Best For... | Complex reasoning, creative content, broad research. | Specific tasks: chatbots, summarization, on-device AI. |
Some real time examples of LLM
The joke is we use LLM without knowing it is a LLM😺
ChatGPT DeepSEEK Gemini Perplexity MetaAI GithubCopilot
Different methods of building LLM😁
| Method | Primary Tool / Example | Effort | Cost | Customization |
| 1. The API Method | Google Gemini API, OpenAI API | Low | Pay-per-use (can get expensive) | Low (Limited to prompt engineering) |
| 2. The Local/Fine-Tuning Method | Ollama, Hugging Face, Unsloth | Medium | Free (requires a good GPU) | High (You can change the model's knowledge) |
| 3. The "From Scratch" Method | PyTorch, TensorFlow, Custom GPU Clusters | Extremely High | $10M - $100M+ (Millions) | Total (You define everything) |
Characteristics of LLM 🎀
•Mainly uses PYTHON
•Frameworks to simplify our job-TF AND PYTORCH
•Get the pretrained models/transformers from –HUGGING FACE,LLAMA
•A library that help us easily connect a llm with our application-LANGCHAIN
•LLMs are pre-trained on an enormous and diverse dataset, typically a large portion of the internet, books, wikipedia and code.
Pytorch 🐦🔥
It is a large framework which provides a system to build train and deploy a AI thing…it has lots of modules or packages within it. Instead of pytorch you can also use TF(TensorFlow)
Eg packages within Pytorch: nn,tensor,functional,optim,autograd 🩷
Creating a LLM is like baking a cake with lot of process within it. we need to understand them before stepping into coding lets see one by one🎂🍰
1.Tokenisation
To train the LLM and make them understand and generate human language we will give it large datasets to learn,find pattern etc…usually it will be books,news articles,wikipedia,codes etc….but the modal cannot directly get the input as raw text…it needs only numerical values to analyse pattern and learn to generate💻💻💻
•**Tokenisation is process of Converting text into tokens/numbers .**We have various tokenisers like a dictionary that have millions of words with numbers. We can use instead of manually create token which will take years… Ex:Tiktoken ( King=623,Queen=598,is=12)
2.Embedding
Embedding is the process of Converting token into vectors while a simple token hold a numerical value its vector hold a lot of information about that value 😺
The first layer in any LLM will be embedding layer
Eg King=623 → tokenisation
623 =[0.12, -0.55, 1.03, ...]→embedding
This vector contain lot of information about king
semantic info - king is near queen ,he is masculine etc…
context info - king used in place of “historic time,war,palace“
grammar info-noun etc…
The best thing is you don’t create all this info the llm will generate based on learning😁
3.Positional Encoding 1️⃣2️⃣3️⃣
•LLM read all words at once. So the model **cannot know which word comes first, second, third…**unless we add positions.
•we can add positional info by various ways Sinosodial(use sin and cos function and generate numberic value based on the dimension of vector),learned positional embedding(allow llm to learn by itself the position)
Positional Encoding is the process of Adding the positional information about a token to a embedding vector 💗💗
Ex : cat sat on mat
“cat” is at position “sat” is at position 1 “on” is at position 2 “mat” is at position 3
Each word is converted into a vector:
| Word | Embedding vector (example) | | --- | --- | | cat | [0.3, -0.1, 0.7, ...] | | sat | [0.1, 0.4, -0.2, ...] | | on | [-0.5, 0.8, 0.1, ...] | | mat | [0.6, 0.2, 0.9, ...] |
Transformer adds a position vector for each word through some method as said above:
Example:
| Position | Positional Vector |
| 0 | [0.99, 0.12, -0.88, ...] |
| 1 | [0.84, 0.52, -0.41, ...] |
| 2 | [0.65, 0.78, 0.20, ...] |
| 3 | [0.35, 0.93, 0.77, ...] |
Final Input Vector = Embedding + Positional Encoding
For “cat” at position 0:
[0.3, -0.1, 0.7, ...] (meaning)
+
[0.99, 0.12, -0.88, ...] (position)
=
[1.29, 0.02, -0.18, ...]
This new vector tells the model:
"cat" + "I am the first word"
4.Attention Mechanism🩷
“For the current word I am processing,
which other words in the sentence should I pay attention to?”
Ex : cat sat on mat
the word that LLM took for learning is cat which will have its information in embedding vector then position also….but while I learn about cat which words will be more related or necessary to it
cat - sat its important as it is action of cat so we give 0.6
cat-on grammaticaly also on with cat feels less important 0.1
cat-cat both hold fully same info so 1 etc..
5.Multi-Head Attention☺️🤩😐😏😣🥱
In above thing we given attention to only one word cat…but here we create heads like
cat sat on mat
head1- for nouns/subject (which will mainly concentrate cat)-[0.9,0.1,0.1,0.7]
head2-for object(which will mainly concentrate mat)-[0.6,0.2,0.1,0.9]
head3-for verb/action(which will mainly concentrate sat) - [0.1,1.2,0.3,0.1]
both single/multi head attention follow QKV mechanism
Query (Q) → “What am I looking for?”
Key (K) → “What do I offer?”
Value (V) → “What information should be passed?”
attention=q*transpose(k)
weight=softmax(attention score)
output= input*weight + bias so
y=attention*v + bias
6.Feed Forward🍓🍒
Inside a transformer block:
Attention = TALK (token talks with other tokens)
Feed Forward = THINK (token thinks alone)
👉 FFN works on each token individually, no talking between tokens.
It is simply a tiny 2-layer neural network applied to every token’s vector.
7.Residual connection 🤝
Residual connection = Memory + New Knowledge
Residual connection does:
output = original + new_output
That means:
Keep the original meaning as it is
Add only the improvements
🧠 Super Simple Example
Input vector (simplified):
x = [5,2]
Attention output:
attn_out = [2,9]
Residual connection:
x = x + attn_out = [7,11]
8.Layer Normalization (LayerNorm)💙💙
👉 Simple Meaning: “Clean and balance the values before processing.”
🎯 Make values have mean = 0 and variance = 1
[87, 90, 22, 45, 100, 10] into [-0.1, 0.2, -1.1, -0.6, 1.3, -1.4]
9.Logits
In OP Layer Finally when a model predict the next tokens from previous tokens for all the words in vocab it gives a prediction score …the most high scored tokens will be the next availabe words
10.What happens in forward and backward pass➡️⬆️➡️
Forward pass
The model predicts the next token logits
Compares logits with Y
Computes loss
Loss = “how wrong the predictions were”
Backward pass
Model computes gradients (how much each weight contributed to the error)
This is the model learning from mistakes
Optimizer uses the gradients to update weights in a direction that reduces loss