ConvNeXt——Pytorch代码实现

## 0 引言

在Transformer盛行的年代，作者提出一个纯卷积神经网络构建的骨干网络。在网络的构建中采用的技巧均来自从前的卷积神经网络实现，但是整体的去考虑这些因素的加成。同时网络大量借鉴了Swin-Transformer网络在结构、激活函数、训练方法等多方面的方法，最终实现了超过Swin-T的性能。

如果对ConvNeXt网络不甚了解，可以先阅读这篇博客[《A ConvNet for the 2020s》](http://62.234.186.27:8090/archives/38/ "《A ConvNet for the 2020s》")对网络架构进行一定了解之后，然后再阅读本文。

## 1 网络整体架构

从这张图片中，作者对比了原始的ResNet-50， ConvNeXt-T和Swin-T三个网络的架构。作者从一个最基础的Resnet-50的网络作为整个优化的baseline，然后通过改造将其打造成为ConvNeXt-T。最后对比了不同scale的模型性能，证明了该架构再不同的尺寸下均拥有良好的性能，下面我们来主要对ConvNeXt结构进行介绍。

![整体结构对比](/usr/uploads/2022/03/4098708356.png "整体结构对比")

ConvNext网络的构建主要分成3部分：
- stem头的构建
- block的构建
- 网络整体整合
下面将具体介绍这些。

### 1.1 stem头的构建

在resnet-50的网络中，stem头采用Conv7*7 stride2，后面接3*3 stride2的maxpooling层，通过这两层实现了一个4倍下采样的操作。但是作者发现在Swin-T中，将图片看作一个个的patch，并没有进行复杂的卷积池化操作。因此作者也借鉴这个设计，直接采用Conv4*4 s4的卷积层，实现对类似分patch即可。
```
# stem
self.stem = nn.Conv2d(in_channels=in_c, out_channels=in_channels[0], kernel_size=4, stride=4)
```

### 1.2 Block的构建
在block的构建中， 我们依然先对比这三种block的架构，看看这三种结构的区别。

![block构建](https://www.luoyang.ink/usr/uploads/2022/03/70495927.png "block构建")

**Swin-T:** 该网络的Block可以分成2个residual结构，第一个residual结构中核心是MSA模块，第二个residual结构中就是2个全连接层。可以看到swin-t中采用了LN的norm层和gelu的激活函数。

**resnet：**resnet是1个单独的residual结构，其网络是一个bottle结构，先通过1*1的卷积核进行将为操作，然后再进行3*3卷积，之后再使用1*1的卷积核进行升维度，达到原有的维度。其整体是一个Residual结构，每一个卷积后面都接BN和relu激活函数。

**ConvNeXt**：在该网络的block中，作者设计时考虑以下几点：
1. 首先作者想采用mobileNet或者EfficientNet的invert residual结构，通过1*1的卷积核先进行升维操作，然后进行3*3的卷积，再利用1*1的卷积核进行降维操作。
2. 但是后来作者发现，说在swin-T中采用的windows尺寸都是7*7的，在这样一个窗口内部进行local的attention操作，因此考虑也采用7*7的卷积核 。这样就带来了一个问题，如果7*7的卷积核在中间，那么升维之后，就会带来一个较大的参数量，因此考虑将Conv7*7提前到block开始的位置上。
3. 在resnet中大量采用了CBN的结构，因此有众多的norm层和activation层，作者观察是我swin-t中，只使用了1个GELU激活函数和2个LN层，因此也相应减少了这两个组件的数量，并且获得了一定的提升。

在构建block之前，需要单独构建LN层和DropPath组件。
针对LayerNorm， 是因为这个网络给出的LN层针对的是维度在最后面的这种情况，对[b, n, c]最后的channel进行norm操作。但是在图像中，一般是[B, C, H, W]的这种排列方式，因此字形构建一个LN层。
针对DropPath是因为torch1.8里面没有给出相应的实现。

下面的1.2.1 和1.2.2这两部分将首先构建这两部分，之后再1.2.3里面利用构建好的组件再实现Block结构。

#### 1.2.1 LayerNorm的实现

**回顾BatchNorm层**

在实现LayerNorm之前，我们首先回顾一下在CV中常见的BatchNorm。BN层是将输入根据channel分开，计算每个channel上面的均值和方差，进行标准化得到最终的结果。对于BN层，在pytorch中采用指数加权平均的方式，利用动量项缓慢的更新样本的均值和方差，从而得到近似于整个样本的整体均值方差。

**LayerNorm层**
在NLP任务中，由于句子的长短不一致，导致了每个batch的长度有长有短。此时由于每次输入的样本数量不断变化，样本空间不稳定，很难通过BN层进行标准化操作，这样人们就想在跨channel上面进行标准化操作。

假设对一个feature map的维度是[b, c, h, w]，求每个像素点c个通道的均值和方差，对c上面进行标准化操作。这样由于是对单个样本进行标准化操作，因此无需记录样本整体的均值方差等，直接进行标准化即可。

```python
class LayerNorm(nn.Module):
    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super(LayerNorm, self).__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape), requires_grad=True)
        self.bias = nn.Parameter(torch.zeros(normalized_shape), requires_grad=True)
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise ValueError("data_format should be channels_last or channels_first]")
        self.normalized_shape = (normalized_shape,)

def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            # [batch_size, channels, heights, width]
            mean = x.mean(1, keepdim=True)
            var = (x - mean).pow(2).mean(1, keepdim=True)
            x = (x - mean) / torch.sqrt(var + self.eps)
            x = x * self.weight[:, None, None] + self.bias[:, None, None]
            return x
```

#### 1.2.2 DropPath的实现

droppath作为dropout、dropblock同系列的组件，也起到丢弃网络中部分传播的作用。droppath作用在batch维度，对于输入的x，把某几个batch的所有值全部置零。这个droppath的用法如下所示，需要在经过该层之后加入shortcut操作，使得全部置零的保持原有值，没有被置零的进行residual操作。

```python
def forward(self, x):
	...
	x = self.drop_path(x)
    return x + shortcut
	
```

使用该函数的时候，后面**一定要加入**shortcut操作，否则将导致某几个batch全部为0。
```python
def drop_path(x, drop_prob: float = 0, training: bool = False):
    if drop_prob == 0 or not training:
        return x

keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()
    output = x.div(keep_prob) * random_tensor
    return output

class DropPath(nn.Module):
    def __init__(self, drop_prob):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

def forward(self, x):
        return drop_path(x, drop_prob=self.drop_prob, training=self.training)
```

### 1.2.3 Block层的构建

```python
class Block(nn.Module):
    def __init__(self, in_channel, drop_path_ratio):
        super(Block, self).__init__()

self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=in_channel, groups=in_channel,
                               kernel_size=7, padding=3, stride=1)
        self.layer_norm = LayerNorm(normalized_shape=in_channel, data_format="channels_first")
        self.conv2 = nn.Conv2d(in_channels=in_channel, out_channels=in_channel*4, kernel_size=1)
        self.gelu = nn.GELU()
        self.conv3 = nn.Conv2d(in_channels=in_channel*4, out_channels=in_channel, kernel_size=1)

self.drop_path = DropPath(drop_path_ratio) if drop_path_ratio > 0. else nn.Identity()

def forward(self, x):
        shortcut = x
        x = self.conv1(x)
        x = self.layer_norm(x)
        x = self.conv2(x)
        x = self.gelu(x)
        x = self.conv3(x)
        x = self.drop_path(x)
        return x + shortcut

```

### 1.3 整体网络组成
```python
class ConvNeXt(nn.Module):
    def __init__(self, in_c: int, num_classes: int, in_channels: List[int], num_blocks: List[int],
                 drop_path_ratio: float = 0.):
        super(ConvNeXt, self).__init__()

# stem
        self.stem = nn.Conv2d(in_channels=in_c, out_channels=in_channels[0], kernel_size=4, stride=4)

# downsample
        self.downsamples = nn.ModuleList()
        for i in range(3):
            self.downsamples.append(
                nn.Sequential(
                    LayerNorm(normalized_shape=in_channels[i], data_format="channels_first"),
                    nn.Conv2d(in_channels=in_channels[i], out_channels=in_channels[i+1], kernel_size=2, stride=2)
                )
            )

# block
        self.blocks = nn.ModuleList()

dp_rates = [x.item() for x in torch.linspace(0, drop_path_ratio, sum(num_blocks))]
        for i in range(4):
            stage = list()
            for j in range(num_blocks[i]):
                stage.append(Block(in_channel=in_channels[i], drop_path_ratio=dp_rates[sum(num_blocks[:i]) + j]))
            self.blocks.append(nn.Sequential(*stage))

# out
        self.norm = LayerNorm(normalized_shape=in_channels[-1], data_format="channels_first")  # final norm layer
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.head = nn.Linear(in_channels[-1], num_classes)

self.apply(self._init_weight)

def _init_weight(self, m):
        if isinstance(m, nn.Linear):
            nn.init.trunc_normal_(m.weight, std=0.02)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight, mode='fan_out')
            if m.bias is not None:
                nn.init.zeros_(m.bias)

def forward(self, x):
        x = self.stem(x)
        for i in range(3):
            x = self.blocks[i](x)
            x = self.downsamples[i](x)
        x = self.blocks[-1](x)
        x = self.norm(x)
        x = self.avgpool(x)
        x = torch.flatten(x, start_dim=1)
        x = self.head(x)
        return x

```

设置完成后，我们所配置的参数如下所示
• ConvNeXt-T: C = (96, 192, 384, 768), B = (3, 3, 9, 3)
• ConvNeXt-S: C = (96, 192, 384, 768), B = (3, 3, 27, 3)
• ConvNeXt-B: C = (128, 256, 512, 1024), B = (3, 3, 27, 3)
• ConvNeXt-L: C = (192, 384, 768, 1536), B = (3, 3, 27, 3)
• ConvNeXt-XL: C = (256, 512, 1024, 2048), B = (3, 3, 27, 3)

```python
def convnext_t(num_classes):
    return ConvNeXt(in_c=3, num_classes=num_classes,
                    in_channels=[96, 192, 384, 768], num_blocks=[3, 3, 9, 3], drop_path_ratio=0.1)
```

ConvNeXt——Pytorch代码实现

发表评论

表情类型

博客信息

2025年11月

最新文章

最新回复

文章分类

标签云

文章归档

其它功能

日	一	二	三	四	五	六
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30