Sushant Kumar

The AI Blog

My notes on AI research, machine learning, and software development.

Whisper: Transformer for Speech Recognition

Understand the Transformer model and its architecture for speech recognition. Learn how Transformers are used in ASR systems and how they compare to traditional RNNs and LSTMs

Phi 3: Highly Capable Language Model on Phone

Phi 3, developed by Microsoft AI, is a highly capable language model that can run on a phone

LLaVA: Large Multimodal Model

LLaVA is a multimodal large language model that has a vision encoder and a language encoder. It is trained on a large-scale multimodal dataset and can generate text conditioned on images

Llama 3: SOTA open-weights LLM

Llama 3 by Meta AI is trained on more than 15T tokens of data. It is currently the SOTA open-weights LLM available beating out the competition by a fair margin. It will be released in three different variants: 8B, 70B and 400B.

Grouped Query Attention

Grouped Query Attention is a new attention mechanism that can be used to improve the performance of transformer models.

Transformers: Attention is All You Need

Transformers have become the building blocks of many state-of-the-art AI models. This post provides a detailed explanation of the Transformer architecture which was introduced in the paper 'Attention is All You Need'

BERT: Bidirectional Encoder Representations from Transformers

Understand the BERT model and its architecture. Learn how BERT is pre-trained and fine-tuned for various NLP tasks

Multi Query Attention

Understand the Multi Query Attention mechanism in the context of Transformers and how it helps in capturing multiple aspects of the input sequence

Gemma: Google's family of Open LLMs

Gemma is a state of the art family of models developed by Google Research. Gemma models are open-source and can be used for a variety of tasks.

How Stable Diffusion works

A deep dive into the architecture and inner workings of the revolutionary generative model, Stable Diffusion developed by Stability AI

VAE: Variational Autoencoder

Gentle introduction to Variational Autoencoder (VAE) - a type of generative model used to learn efficient representations of data by learning the parameters of a probability distribution representing the data. We explore the architecture, training process, and the reparameterization trick used in VAEs

Mixtral of Experts

An in-depth guide into the architecture of Mistral AI's state of the art Sparse Mixture of Experts (SMoE) model, Mixtral of Experts

MoE: Mixture of Experts

Mixture of Experts (MoE) architecture, a cutting-edge approach revolutionizing Large Language Models (LLMs). Let's explore how MoE enhances LLM performance by leveraging specialized expert networks, leading to more efficient and accurate language processing

RAG Triad - Evaluating Quality of Response from LLMs

RAG has become the standard architecture for providing LLMs with the relevant context to reduce hallucinations and improve performance. In this blog post, we will look at the RAG triad and how it can be used to evaluate quality of response from LLMs

DDPM: Denoising Diffusion Probabilistic Models

Deep dive into understanding the basic building blocks of Denoising Diffusion Probabilistic Models and code implementation using PyTorch.

GPT-2: A Deep Dive

An in-depth look at GPT-2, a large language model developed by OpenAI, and its architecture, training, and applications.

Tokenization in Large Language Models

A detailed guide into how tokenization strategies work in large language models. This post will cover the basics of tokenization, the different tokenization strategies, and how they are implemented in large language models

RoPE: Rotary Positional Embedding

An in-depth look at RoPE, a novel positional encoding method that uses sinusoidal embeddings to capture the relative positions of tokens in a sequence.

CLIP: Bridging Vision and Language in AI

An in-depth and practical guide into how generative AI models are bridging the gap between vision and language using Contrastive Learning approaches as seen in CLIP, released by OpenAI in 2021.