GLM-5: From Vibe Coding to Agentic Engineering

Tue 30 June 2026

背景

在之前的文章 DeepSeek-V3.2 Lightning Indexer 中，我们详细介绍了 DeepSeek Sparse Attention (DSA) 的 Lightning Indexer 如何通过 top-k 选择实现 token 级别的稀疏注意力。GLM-5 作为智谱 …

DeepSeek-V4: 百万 Token Context

Fri 19 June 2026

背景

在之前的文章中，我们详细介绍了 DeepSeek-V3 的 MLA 机制和 DeepSeek-V3.2 的 Lightning Indexer。MLA 解决了推理时的内存带宽问题，DSA 通过 sparse attention 将计算复 …

DeepSeek-V3.2 Lightning Indexer

Wed 18 February 2026

Deepseek Lighting Indexer plot

背景

在之前的文章 DeepSeek V3 Multi-head Latent Attention (MLA) 中，我们详细介绍了 DeepSeek-V3 的 MLA 机制如何通过低秩压缩减少 KV Cache 的内存占用。MLA 解决了推理 …

DeepSeek Expert Parallelism Load Balancer (EPLB) Code Reading

Sun 20 April 2025

Introduction

In the previous Introduction to DeepSeek-V3, a crucial component highlighted was the use of DeepSeekMoE. When employing Expert Parallelism, different Experts are assigned to different GPUs. Since the load on different Experts may vary depending on the current workload, maintaining load balance across GPUs is critical. DeepSeek-MoE addresses this …

DeepSeek V3 MoE

Thu 20 March 2025

MLA plot

1. Deepseek MoE 的结构

DeepSeek-V3 的 MoE 架构延续并优化了其特有的 DeepSeekMoE 设计，并引入了 无辅助损失的负载均衡（Auxiliary-loss-free Load Balancing 策略。

1. 整体结构公式 …

DeepSeek V3 Multi-head Latent Attention (MLA)

Thu 20 March 2025

MLA plot

在上一篇blog中，我们讨论了DeepSeek-V3 采用了 MLA (Multi-Head Latent Attention) 架构，其核心目的是通过低秩联合压缩 Key 和 Value 来减少 KV 缓存。特别是在推理 …

DeepSeek V3 learning notes

Sun 23 February 2025

1. What the problem to solve?

When I went back for the Spring Festival, DeepSeek released a new model. For a while, all kinds of media discussed it a lot, almost rising to the height of national destiny. The most important points discussed should be two: the first is the …

DeepSeek V3

Sun 16 February 2025

1. What the problem to solve?

春节回去的时候正好碰上DeepSeek发布新的模型，一时间各路媒体讨论的沸沸扬扬，几乎上升到国运的高度。讨论的最重要的应 …

Prediction in decoder and KV-Cache

Sun 21 April 2024

1. Prediciton in Decoder

在前面GPT summary里面对GPT的模型有一个综合的介绍，这里用一个fake example来解释一步步GPT是怎么做的，self attention是怎么计算的，KV cache是怎么回事。

GPT是decoder only的模型，根据前面的token来预测下一个token。比如有一个句子 "it is sunny today."，现在有初始输入 …

Image Generation 2: Latent Diffusion model / Stable Diffusion

Sun 01 October 2023

In the previous blog we introduced diffusion model (DDPM) which is to learn the step (time \(t\)) and the noise function (NN model) by adding Gaussian noise to an image step by step and reversing the process by denosiing from Gaussian noise to an image. Diffusion model is the most …

← Older

pydata: Huiming's learning notes

Keep Looking, Don't Settle

GLM-5: From Vibe Coding to Agentic Engineering

背景

DeepSeek-V4: 百万 Token Context

背景

DeepSeek-V3.2 Lightning Indexer

背景

DeepSeek Expert Parallelism Load Balancer (EPLB) Code Reading

Introduction

DeepSeek V3 MoE

1. Deepseek MoE 的结构

1. 整体结构公式 …

DeepSeek V3 Multi-head Latent Attention (MLA)

DeepSeek V3 learning notes

1. What the problem to solve?

DeepSeek V3

1. What the problem to solve?

Prediction in decoder and KV-Cache

1. Prediciton in Decoder

Image Generation 2: Latent Diffusion model / Stable Diffusion