Σ-GenAI for Open Research

This repo collects some latest research work of Generative AI. It provides simple implementations to understand the ideas and some follow-up discussions to inspire future work.

Attention

Native Sparse Attention

Native Sparse Attention (NSA) is proposed by DeepSeek for efficient language models. The techniques can also be applied to vision generation. Since video generation is often trained by diffusion or flow matching with full attention, the sliding attention in NSA can be skipped. Token compression and selection can be done in 1D or 3D, depending on the latent space.

Follow-up

Token compression provides an overview of the entire sequence. Uncompressed token blocks are then selected based on their attentions scores. A recent work DART proposes to consider duplication in token selection. It first selects a fewer pivotal tokens, and then similar token to the pivotal ones are ignored in the token selection. This allows to select more diverse tokens. In another work DivPrune, the problem is formulated a Max–Min Diversity Problem (MMDP). It'll be interesting to see how these techniques work for generation tasks.

Multi-Token Attention

Multi-Token Attention adds convolutions when computing attention scores. Convolutions are applied on queries, keys, and heads. I use group convolution to implement qk pre-softmax and post-softmax convolutions, which are independent for different heads. I use a linear layer to implement head mixing convolution, conducted in each group of heads.

Consideration

Since convolution is local, the multi tokens are only grouped in a vicinity. And convolutions are computed on tokens flattened from 3D.

MoE

DeepSeekMoE

DeepSeekMoE uses fine-grained experts and shared experts on top of standard MoE. An MLP layer typically uses a large intermediate dimension, e.g. 6 times of input dimension. DeepSeekMoE uses a smaller intermediate dimension to get more experts without increasing computation. Using fine-grained experts enhances the combinatorial flexibility of activated experts.

Follow-up

MoE is efficient at inference when batch size is 1. If batch size is greater than 1, more experts will be activated, and it can take more memories than a dense model. To address this problem, ByteDance releases Comet, an efficient infrastructure for MoE.

Auto-Encoder

Modern image and video generation models use CNN based auto-encoders (AEs), with the following limitations

Uniform compression. AEs compress an image or a video uniformly in spatial and temporal domain.
Not aligned with representations for understanding.
Efficiency. Interestingly, CNN-AEs require large dimensions at high resolutions, which are not efficient.

Some recent work starts to use ViTs for auto-encoders, including TiTok, ViTok, and MAEToK. If we can encode an image or a video to a 1D sequence of tokens, in a coarse-to-fine manner, auto-regressive methods may fit better, and the latent space can be aligned with text for better multi-modality understanding. The work (Semanticist) moves one step ahead towards this direction.

MAEToK

MAEToK uses MAE to learn latent tokens for generation. In its implementation, the token masking is different from the original MAE. Specifically, MAE discards masked tokens, and only sends remaining tokens for transformers. MAEToK sends all pixel tokens to transformers with masked tokens replaced by a learnable token. I followed the original MAE masking in my implementation. Another thing to be noted is that MAEToK builds AE and VAE on top of VQ.

Diffusion Sampling

Inductive Moment Matching

Inductive Moment Matching (IMM) is a new class of generative models. The training is based on mathematical induction. For s < r < t, it matches two distributions from r, t to s by minimizing their divergence. To make the training stable, it uses stable sample-based divergence estimators, i.e. moment matching. With a single-stage training, it can be sampled efficiently with one or a few steps. The official implementation doesn't provide training code, and there are some unspecified parameters in the paper. My implementation tries to follow the algorithms provided in the paper, and aims to help people to understand the work.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
model		model
pipeline		pipeline
tests		tests
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Σ-GenAI for Open Research

Attention

Native Sparse Attention

Multi-Token Attention

MoE

DeepSeekMoE

Auto-Encoder

MAEToK

Diffusion Sampling

Inductive Moment Matching

About

Uh oh!

Releases

Packages

Uh oh!

Languages

tbhou/sigma

Folders and files

Latest commit

History

Repository files navigation

Σ-GenAI for Open Research

Attention

Native Sparse Attention

Multi-Token Attention

MoE

DeepSeekMoE

Auto-Encoder

MAEToK

Diffusion Sampling

Inductive Moment Matching

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages