背景
在之前的文章 DeepSeek V3 Multi-head Latent Attention (MLA) 中,我们详细介绍了 DeepSeek-V3 的 MLA 机制如何通过低秩压缩减少 KV Cache 的内存占用。MLA 解决了推理 …
在之前的文章 DeepSeek V3 Multi-head Latent Attention (MLA) 中,我们详细介绍了 DeepSeek-V3 的 MLA 机制如何通过低秩压缩减少 KV Cache 的内存占用。MLA 解决了推理 …
In the previous Introduction to DeepSeek-V3, a crucial component highlighted was the use of DeepSeekMoE. When employing Expert Parallelism, different Experts are assigned to different GPUs. Since the load on different Experts may vary depending on the current workload, maintaining load balance across GPUs is critical. DeepSeek-MoE addresses this …
DeepSeek-V3 的 MoE 架构延续并优化了其特有的 DeepSeekMoE 设计,并引入了 无辅助损失的负载均衡(Auxiliary-loss-free Load Balancing 策略。
在上一篇blog中,我们讨论了DeepSeek-V3 采用了 MLA (Multi-Head Latent Attention) 架构,其核心目的是通过低秩联合压缩 Key 和 Value 来减少 KV 缓存。特别是在推理 …
When I went back for the Spring Festival, DeepSeek released a new model. For a while, all kinds of media discussed it a lot, almost rising to the height of national destiny. The most important points discussed should be two: the first is the …
春节回去的时候正好碰上DeepSeek发布新的模型,一时间各路媒体讨论的沸沸扬扬,几乎上升到国运的高度。讨论的最重要的应 …
在前面GPT summary里面对GPT的模型有一个综合的介绍,这里用一个fake example来解释一步步GPT是怎么做的,self attention是怎么计算的,KV cache是怎么回事。
GPT是decoder only的模型,根据前面的token来预测下一个token。比如有一个句子 "it is sunny today.",现在有初始输入 …
In the previous blog we introduced diffusion model (DDPM) which is to learn the step (time \(t\)) and the noise function (NN model) by adding Gaussian noise to an image step by step and reversing the process by denosiing from Gaussian noise to an image. Diffusion model is the most …
The previous notes introduced the text generation models (GPT family). This reading note is about image generator papers.
Similar to text generator which generate the next token, OpenAI has image-GPT which is a large transformer trained on next pixel prediction in which the pixels are concated into a vector to …
Improving Language Understandingby Generative Pre-Training
Before GPT-1, NLP was usually a supervised model. For each task, there are some labeled data, and then develop a suoervised model based on these labeled data. There are several problems with this approach: First, labeled data is …