Sushant Kumar

The AI Blog

My notes on AI research, reinforcement learning, and software engineering

Training LLMs on GPUs

How GPUs are used to train Large Language Models LLMs and why they are so effective. How does TFLOPs

How AI monitors calls of 5,000+ sales representatives for actionable insights

Deep dive into the engineering of scaling a production ready AI-powered sales monitoring system. Leveraging a complex pipeline of LLMs, ASR, and diarization to monitor calls of more than 5000+ salespeople

The Path to the Ultimate Prize

A critique of the current AI narrative and its limitations, and a call for a more comprehensive approach to understanding and replicating human intelligence.

Vision Transformers

An in-depth introduction to Vision Transformers and their potential to revolutionize the field of computer vision.

Paligemma: Versatile VLM - Vision Language Models

Deep-dive into Paligemma Google's innovative vision language model. Learn about its architecture, capabilities, and potential applications in AI and machine learning.

SigLIP: Sigmoid Loss in Language Image Pretraining

How SigLIP works and addresses the shortcomings of contrastive loss of LIP Language Image Pretraining

Whisper: Transformer for Speech Recognition

Understand the Transformer model and its architecture for speech recognition. Learn how Transformers are used in ASR systems and how they compare to traditional RNNs and LSTMs

Phi 3: Highly Capable Language Model on Phone

Phi 3, developed by Microsoft AI, is a highly capable language model that can run on a phone

LLaVA: Large Multimodal Model

LLaVA is a multimodal large language model that has a vision encoder and a language encoder. It is trained on a large-scale multimodal dataset and can generate text conditioned on images

Llama 3: SOTA open-weights LLM

Llama 3 by Meta AI is trained on more than 15T tokens of data. It is currently the SOTA open-weights LLM available beating out the competition by a fair margin. It will be released in three different variants: 8B, 70B and 400B.

Grouped Query Attention

Grouped Query Attention is a new attention mechanism that can be used to improve the performance of transformer models.

RoPE: Rotary Positional Embedding

An in-depth look at RoPE, a novel positional encoding method that uses sinusoidal embeddings to capture the relative positions of tokens in a sequence.

Multi Query Attention

Understand the Multi Query Attention mechanism in the context of Transformers and how it helps in capturing multiple aspects of the input sequence

Gemma: Google's family of Open LLMs

Gemma is a state of the art family of models developed by Google Research. Gemma models are open-source and can be used for a variety of tasks.

How Stable Diffusion works

A deep dive into the architecture and inner workings of the revolutionary generative model, Stable Diffusion developed by Stability AI

CLIP: Bridging Vision and Language in AI

An in-depth and practical guide into how generative AI models are bridging the gap between vision and language using Contrastive Learning approaches as seen in CLIP, released by OpenAI in 2021.

Mixtral of Experts

An in-depth guide into the architecture of Mistral AI's state of the art Sparse Mixture of Experts (SMoE) model, Mixtral of Experts

MoE: Mixture of Experts

Mixture of Experts (MoE) architecture, a cutting-edge approach revolutionizing Large Language Models (LLMs). Let's explore how MoE enhances LLM performance by leveraging specialized expert networks, leading to more efficient and accurate language processing

RAG Triad - Evaluating Quality of Response from LLMs

RAG has become the standard architecture for providing LLMs with the relevant context to reduce hallucinations and improve performance. In this blog post, we will look at the RAG triad and how it can be used to evaluate quality of response from LLMs

DDPM: Denoising Diffusion Probabilistic Models

Deep dive into understanding the basic building blocks of Denoising Diffusion Probabilistic Models and code implementation using PyTorch.

GPT-2: A Deep Dive

An in-depth look at GPT-2, a large language model developed by OpenAI, and its architecture, training, and applications.

BERT: Bidirectional Encoder Representations from Transformers

Understand how the widely used BERT model works and its architecture is related to the Transformer model. Also, learn how BERT is pre-trained and fine-tuned for various NLP tasks

Transformers: Attention is All You Need

Transformers have become the building blocks of many state-of-the-art AI models. This post provides a detailed explanation of the Transformer architecture which was introduced in the paper 'Attention is All You Need'

Tokenization in Large Language Models

A detailed guide into how tokenization strategies work in large language models. This post will cover the basics of tokenization, the different tokenization strategies, and how they are implemented in large language models

VAE: Variational Autoencoder

Gentle introduction to Variational Autoencoder (VAE) - a type of generative model used to learn efficient representations of data by learning the parameters of a probability distribution representing the data. We explore the architecture, training process, and the reparameterization trick used in VAEs