9 Dec 2019

BERT에 대해서 자세히 알아보기 (2) - Transformer, 논문 요약

BERT는 Bidirectional Encoder Representation from Transformer의 약자입니다.

Transformer라는 모델을 이용해서 Bidirectional한 성질을 갖는 Encoder 형태의 모델을 만들었다는 뜻인데요,

일단 Transformer라는 모델에 대해서 먼저 알아보도록 합시다.

0) Transformer

참고링크 : https://nlp.seas.harvard.edu/2018/04/03/attention.html

0-1. Transformer의 특징

Transformer의 논문 제목은 Attention is all you need.

말 그대로 Attention 구조 만으로도 좋은 성능의 모델을 구현할 수 있다는 뜻입니다.

기존에는 문맥을 파악하기 위해서 RNN, CNN기반의 모델을 많이 사용했었는데, 이러한 모델들 없이 오직 Attention 구조 만으로 좋은 성능을 이룩했다는 것이 특징입니다.

0-2. Attention

Attention 네트워크는 문장의 각 단어에 대한 연관성을 파악하는 방법 중 하나입니다.

Fully connected layer 기반으로 만들어졌기 때문에 Maximum Path length가 낮다는 장점이 있는데요. 이는 Transformer 모델의 큰 장점이 됩니다.

0-3. Self-Attention의 장점

In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention.

Transformer 모델은 Input sentence의 두 위치에 따른 related signal을 계산하기 위한 연산량이 매우 낮은 편에 속합니다. 논문의 저자는 이를 Path length라고 불렀습니다.

그렇기 때문에 Transformer가 문장의 단어들간의 연관성을 잘 파악할 수 있다는 뜻인데요,

문장에서 어떤 단어가 어떠한 뜻으로 쓰였는지를 판단하기 위해서는 다른 단어들을 보고 파악해야하는데 Transformer는 그러한 문맥 파악에 특출난 성능을 가지고 있습니다.

0-4. 대략적인 모델 구조도

Attention

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

간단하게 요약하면 Softmax(masked_fill(Q * K^T)) * V

Q,K,V 총 3개의 레이어를 합친 구조를 함수에 표현했다고 생각하시면 됩니다.

LayerNorm : x -> normalize() -> Ax+b
- input normalization -> layer를 결합한 구조
SublayerConnection : x -> x + dropout(LayerNorm(x))
- Residual network + dropout -> layerNorm을 결합한 구조
EncoderLayer : x -> SublayerConnection(x, self_attn) -> SublayerConnection(x, feed_forward)
1. 주어진 input을 attention network에 통과시킨다.
2. 그 다음에 feed_forward network에 통과시킨다.
3. 단, 각 네트워크를 통과시킬때마다 Residual connection과 input normalization을 취한다.
DecoderLayer : x -> SublayerConnection(x, self_attn) -> SublayerConnection(x, src_attn) -> SublayerConnection(x, feed_forawrd)
- EncoderLayer랑 비슷한데, src_attn 구조가 하나 더 추가됨.
Encoder, Decoder : 주어진 input x에 대해서 같은 EncoderLayer, DecoderLayer를 N번 반복한 이후에 LayerNorm 레이어를 통과시킨다.
EncoderDecoder : Encoder 레이어의 결과물에 Decoder 레이어를 적용한다.

1) Abstract

새로운 Language representation model
BERT의 장점 : Pretrain deep bidirectional representation
- 기존 모델 (비교군)
  1. ELMO
  2. OpenAI GPT
BERT를 통해서 여러 방면의 task를 수행할 수 있음. (SOTA)

2) Introduction

2-1. NLP tasks

우리가 해결해야 할 NLP Task는 두 부류로 나뉘어져 있음.

Sentence-level task
- Natural language inference
- paraphrasing
Token-level task
- Named entity recognition
- Q&A

2-1. 기존 Pre-training language representation들의 예시

이 NLP Task들을 해결하기 위한 pre-training language representation 방법으로는 이것들이 있습니다.

2-3. 기존 Pre-training language representation 의 분류

Feature-based : (ex, ELMO)
- 특정 Task를 위한 아키텍쳐를 만들기 위해 pre-training representation을 사용함
Fine-tuning : (ex, OpenAI GPT)
- 특정 Task를 위한 파라미터는 큰 비중을 차지하지 않으나, 특정 Task에 대한 적절한 답을 학습하기 위해서 기존의 pre-training representation을 모두 Fine-tuining함.

=> 하지만, 두 방법은 모두 같은 unidirectional pre-training을 수행함(목적함수 동일),

2-4. 기존 Pre-training language representation 의 문제점

: Unidirectional.

OpenAI GPT의 경우, Transformer 기반의 Left-to-right 형식의 구조를 사용하다보니

각 토큰은 이전 토큰들만을 통해서 예측되는 문제점이 있음.
ELMO는 Left-to-right 구조와 Right-to-left 구조를 둘 다 사용하지만, 독립적으로 훈련되기 때문에 Bidirectional이라고 보기는 어렵다.
Sentence-level task에 대해서는 이러한 점이 크게 문제가 될 수 있음.

2-5. 해결책 : BERT

Fine-tuning approach의 일종
Masked language model을 pre-training 시의 목적함수로 사용함
- Measuring readability, 즉 Masked language model은 문장의 가독성을 측정할 수 있는 measure에서 비롯하였음.
- 생략된 단어가 담긴 인풋을 통해서 원래 인풋을 예측하는 모델을 통해서 pre-training representation을 학습함.
- Bidirectional

3) Releated Work

3-1. Unsupervised Feature-based Approaches

Pre-trained word embeddings
- Semi-supervised learning
- A Scalable Hierarchical Distributed Language Model
  
  Left-to-right 형식의 목적함수를 이용해서 word embedding을 학습
- Skip-gram
Pre-trained Sentence embeddings, Paragraph embeddings
- Discourse-Based Objectives, Logeswaran and Lee, 2018
  
  바로 다음에 들어갈 적절한 문장을 판단하는 Object function을 이용함
- Skip-Thought Vectors
  
  이전 문장의 representation을 이용해서 다음 문장의 word를 left-to-right로 생성하는 모델
- Learning Distributed Representations of Sentences
  
  Denoising auto-encoder 기반의 목적함수를 사용함
ELMO
- Left-to-right representation과 Right-to-left representation을 병합하여 Contextual representation을 만드는 방식.
- 기존에 가지고 있던 task-specific architecture에 위에서 만든 Contextual representation을 병합함으로써, 문맥을 잘 반영하여 특정 task를 잘 해결할 수 있음.

3-2. Unsupervised Fine-tuning Approaches

Sentence/Document encoders

대량의 unlabeled data를 통해서 pre-trained contextual token representation을 만들고, downstream task에 대한 지도학습으로 fine-tuning을 하는 방식들.
- Semi-supervised Sequence Learning
- Universal Language model fine-tuning
- OpenAI GPT

4) BERT

Pre-train 단계 + Fine tuning 단계

4-2. Model architecture

: Multi-layer bidirectional Transformer encoder

4-3. Input/Output Representations

Input representations : [CLS] + Tokens + [SEP] + Tokens
- 이렇게 하면, 두 문장이 필요할 때 & 한 문장만 필요할 때를 모두 표현할 수 있음.
Token embedding : 다음 임베딩들의 합으로 이루어짐.
- WordPiece embedding
- Segment embedding : 첫번째 문장의 토큰인지 or 두번째 문장의 토큰인지
- Position embedding : 토큰의 위치에 따른 값을 부여

4-4. Pre-training BERT

두 가지 Unsupervised task를 통해서 BERT를 pre-training함.

Masked LM
- Input token들의 15%를 변환한다.
  - 80% : [MASK]로 바꿈
  - 10% : random token으로 바꿈
  - 10% : 바꾸지 않음
  (예측할 토큰을 전부 [MASK]로 바꾸지 않는 이유는 Fine-tuning시의 Input과 Pre-training시의 Input의 차이를 줄이기 위함이다.)
- Denoising autoencoder와 다르게, 복원되기 이전의 Input 그자체를 예측하는 것이 아니라, Masked된 토큰만 예측한다는 차이점이 있음.
Next Sentence Prediction
- (A, B) 문장 순서쌍을 Input으로 넣을 때, 50%의 문장은 실제로 A 다음에 B 문장이 들어가는 케이스이고, 나머지 50%는 그렇지 않은 케이스를 넣어서, 실제로 A 다음에 B가 들어가는지를 학습함.
- 이는 Question ansering과 Natural language inference의 성능을 높이는 효과가 있음.
Pre-training data : 효과적으로 긴 문장들을 얻기 위해서 Sentence-level corpus 대신에 위키피디아를 비롯한 Document-level corpus를 사용함.

4-5. Fine-tuning BERT

Input data : 각 케이스에 맞게 데이터를 배치
- Paraphrasing : Sentence pairs
- Text entailment : hypothesis - premise
- Question answering : Question-passage
- Text classification을 비롯한 하나의 문장만 필요한 경우 : 한 문장 - 아무것도 없는 문장
Output data
- Token level task (Sequence tagging & Q/A) : Token representation을 직접 output layer에 feed한 결과값을 사용함.
- Classification (Text entailment & Sentiment analysis) : [CLS] 토큰의 값을 output layer에 feed한 결과물을 사용함.