Demystifying LLMs: From Theory to Hands-On with Llama-3

A Step-by-Step Workshop for Students

Session Goal

  • βœ… Open a Google Colab notebook
  • βœ… Load a 4-bit quantized Llama-3 or Phi-3 model
  • βœ… Write a Python function to interact with the model
  • Focus: From understanding LLM internals to building prompt-ready models

10:00 – 11:00 AM: The Mechanics (How LLMs Work)

  • Goal: Understand that LLMs are not knowledge bases, but next-token predictors.
  • Visual: Transformer Encoder-Decoder architecture.
  • Core idea: Context determines meaning (e.g., 'Bank' in different sentences).
  • Output: Predicts the most probable next word.

Tokenization & The Math Problem

  • Interactive Demo: Use the OpenAI Tokenizer.
  • Example: 'Lollipop' is one token, not letters.
  • Reason for errors: Models operate on tokens, not characters.
  • Lesson: Break problems down for better results.

The Hardware Problem: Quantization

  • Challenge: Llama-3-8B needs ~16GB VRAM; Colab Free Tier gives 15GB.
  • Solution: Quantization – compressing model weights.
  • Analogy: FP16 = high-res pizza; INT4 = pixelated pizza.
  • Takeaway: 4-bit quantization makes large models runnable in Colab.

11:00 – 12:00 PM: The Hello World (Hands-On)

  • Step 1: Change runtime β†’ T4 GPU.
  • Step 2: Install dependencies: torch, transformers, bitsandbytes, accelerate.
  • Step 3: Configure quantization and load pre-trained model.
  • Troubleshoot: Common Colab errors and GPU disconnects.

First Inference: Why β€˜Hi’ Repeats

  • Common mistake: Running model.generate('Hi') directly.
  • Result: Repetitive or meaningless output.
  • Lesson: Models need chat templates for structure.
  • Next: Learn to use Llama-3’s chat format.

12:00 – 12:20 PM: Engineering Prompts

  • Concept: Templates tell the model who is speaking.
  • Format includes system, user, and assistant tags.
  • Activity: Build ask_llama(prompt) with tokenizer.apply_chat_template.
  • Outcome: Consistent, context-aware model responses.

12:20 – 12:40 PM: Few-Shot Prompting

  • Challenge: Translate Telugu to English JSON.
  • Bad Prompt: Adds conversational fluff.
  • Fix: Provide examples in the prompt (few-shot).
  • Output: Precise, code-ready JSON response.

12:40 – 01:00 PM: Mini Hack – The Strict Librarian

  • Task: System prompt to return JSON with genre and year for a book.
  • Condition: If the book doesn’t exist β†’ return null.
  • Goal: Output must be parsable via json.loads() in Python.
  • Why: Real-world AI apps depend on reliable JSON output.

Wrap-Up & Key Takeaways

  • βœ” LLMs = Predictors, not databases.
  • βœ” Tokenization explains model quirks.
  • βœ” Quantization enables on-device AI.
  • βœ” Prompt engineering = Real control over output.