RevNet

RevNet

2017년도에 나온 논문
딥러닝 논문 읽기 모임에서 h-transformer-1d 논문을 뜯어보다 발견한 기법
찾아보니 lucidrains님이 reformer를 구현하며 사용한 _ReversibleFunction을 이후 transformer 구현체에 사용하는 것을 발견
residual connection의 단점인 Memory consumption을 개선한 기법

Abstract

Residual Conncetion의 bottleneck: Memory Consumption
- 역전파 시 gradient 계산을 위해서 필요하다고 한다.
위 문제를 개선한 ResNet의 variant인 RevNet을 소개
- 각 layer의 activation (output)들을 다음 layer의 activation으로 reconstruct 가능
- 때문에 역전파 동안에 모든 layer의 activation을 저장할 필요가 없게 된다.
RevNet을 CIFAR-10, CIFAR-100, ImageNet에서 실험
- 동일한 accuracy
- depth와 무관한 activation storage 요구

Introduction

Residual connection: key architecture innovation
- 더 깊게 쌓아도 gradient vanishing/exploding없이 information이 쭉 전달될 수 있게 만들어 줌
최근 거의 모든 신경망은 Backpropagation으로 학습됨

backpropagation 과정 중 network의 activation을 메모리에 저장함

아래 코드는 Matrix Multiplication을 torch autograd 모듈로 작성한 코N

  class Matmul(torch.autograd.Function):

      @staticmethod
      def forward(ctx, x, W):
          ctx.save_for_backward(x, w) # backward 연산을 위해 값을 저장
          return x @ W

      @staticmethod
      def backward(ctx, grad_outputs):
          x, W, = ctx.saved_tensors # 미분 연산에 사용될 값 호출
          return grad_outputs @ W.T, x.T @ grad_outputs

Neural Network는 깊게 쌓을수록 성능이 향상된다고 이미 보고된 바가 있고 ResNet 또한 정보가 backward 과정에서 잘 흐를 수 있도록 residual connection을 붙인 것
ResNet은 information flow를 개선한 것이지 backward 연산의 bottleneck을 해결한 것이 아님
GPU 자원은 한정되어 있음
이러한 연유로 Large model을 학습시킬 때 multiple GPU 환경에서 Parallelism이 필수지만 이는 비싸고 구현하기 복잡함
따라서 본 논문에서 RevNet을 제안함
- A variant of ResNets which is reversible in the sense that each layer’s activations can be computed from the next layer’s activations
- 위 문장이 핵심이라고 생각함. 각 layer의 activation이 다음 layer의 activation으로 계산이 가능하다는 말(reversible)
- 잘 이해가 안되면 3장의 수식 부분을 읽는 것이 빠름
신기하게 이러한 구조를 사용했을 때 성능 하락이 발생하지 않았다고 보고함
- Surprisingly, constraining the architecture to be reversible incurs no noticeable loss in performance

Background

Backpropagation

$\theta=\theta-\alpha\nabla_\theta f_\theta$
핵심은 chain rule
torch, tensorflow, theano에선 automatic differentiation 모듈로 구현이 되어있음
RevNet의 memory savings은 위 모듈의 도움을 받는 것이 아니라(?) backprop computation 부분을 직접 구현함
- 공식 repo에서는 tensorflow 구현을 수행 (tf.gradients를 사용해서)
- 본 논문 리뷰에서는 lucidrains님의 torch 구현체를 사용할 예정 (torch.autograd.Function)
Let $v_1,\dots,v_k$ be topological ordering of the nodes in the network’s computation graph $\mathcal{G}$, where $v_k$ denotes the cost function $\mathcal{C}$
- neural net을 topological graph로 보고 각 node를 $v_i$로 두자는 얘기
- 마지막 노드는 loss를 계산하는 부분일 테니 $v_k$는 cost function이 될 것
각 노드는 parents 노드의 함수 $f_i$에 의해 정의됨
Backprop이 원하는 것? total derivative $\frac{d\mathcal{C}}{dv_i}$를 계산하는 것!
- total derivative는 계산 그래프에서 $v_k$의 자손을 통한 간접 효과를 고려하여 $v_i$에 대한 극미한 변화가 $\mathcal{C}$에 미치는 영향을 측정
- [Note that] total derivative $\frac{d\mathcal{C}}{dv_i}$와 partial derivative $\frac{\partial f}{\partial x_i}$는 다르다는 것을 명심
  - partial derivative $\frac{\partial f}{\partial x_i}$는 다른 argument들에 대해 $x_i$가 바뀌는 영향을 고려하지 않는다.
Let $\bar{v_i}=\frac{d\mathcal{C}}{dv_i}$ be total derivative
Backprop은 topological order의 역순으로 계산 그래프의 node들을 돈다
각 노드 $v_i$에 대해 아래의 식을 활용하여 total derivative를 계산한다.
$\bar{v_i}=\sum_{j\in \text{Child}(i)}\bigg(\cfrac{\partial f_j}{\partial v_i}\bigg)^\top \bar{v_j}$
- where $\text{Child}(i)$는 $\mathcal{G}$의 node $v_i$의 child이고 $\cfrac{\partial f_j}{\partial v_i}$는 Jacobian matrix
- chain rule을 써놓은 식이다.

Deep Residual Networks

Deep networks는 수많은 nonlinear function들로 구성된 합성함수(composition function)
때문에 각기 다른 layer별 종속성은 많이 복잡함
- 이는 gradient computation을 불안정하게 만듦
더 깊게 쌓으려면 이를 해결해야할 필요성이 있었음
- Highway networks에서 skip connection을 소개하여 이 문제를 피했고
  - 더 정확히는 이 논문은 LSTM에서 영감을 받음
- ResNet은 functional form을 사용하여 computations stable을 유지하며 information flow를 개선! $y=x+\mathcal{F}(x)$
- 여기서 $\mathcal{F}$는 shallow neural network로 self-attention 모듈 등이 될 수 있음
ResNet은 exploding, vanishing gradient 문제에 강건함
He 연구진은 두 residual block을 사용했다고 함
- basic residual function $Basic(x)=c_3(c_3(x))$
- bottleneck residual function $Bottleneck(x)=c_1(c_3(c_1(x)))$
- where $\begin{aligned}\\ a(x)=ReLU(BN(x))\\ ac_k(x)=Conv_{k\times k}(a(x)) \end{aligned}$

Reversible Architectures

본 논문의 동기와는 다르지만 reversible neural net architecture들에 대해 몇몇 paper들이 있다고 함
RevNet은 그 중 NICE라 불리는 nonlinear independent components estimation에 영감을 받아 탄생했다고 언급됨
- NICE: an approach to unsupervised generative modeling
- https://deepseow.tistory.com/45
- https://deepakbaby.in/post/nice-keras/
NICE는 data space와 latent space 사이의 non-linear bijective transformation를 학습
- 비선형 1-1 corresponding mapping을 학습
- 양방향으로 복원되고 + 비선형
- RevNet의 아이디어와 유사함

\begin{aligned}\\ y_1&=x_1\\ y_2&=x_2+\mathcal{F}(x_1) \end{aligned}

모델이 invertible하고 Jacobian이 unit determinant를 가지기 때문에 log-likelihood와 gradient 계산이 다루기 쉬움
- Invertible? $x_1=y_1$, $x_2=y_2-\mathcal{F}(y_1)$
- Jacobian이 unit determinant를 가지는 이유? just add!

RevNet

Abstract

Introduction

Background

Backpropagation

Deep Residual Networks

Reversible Architectures

Jacobian이 unit determinant를 가지는 이유? just add!