20 Jun 2019

Studyjam 1) Lab 3 - tf.estimator

Lab3부터 본격적으로 텐서플로우 코딩에 들어가는데,

텐서플로우가 high level api를 제공하면서 기존에 봤었던 코드와는 많이 다른 형태를 가지고 있었습니다.

그래서 어떤 식으로 high level api를 다루는지 알아보기 위해서 포스팅을 했습니다.

1) Data & Model

1-1. Data

산모 - 아기 데이터를 통해서, 새로운 아기의 weight를 예측하는 작업을 수행할것입니다.

features
- weight_pounds
- is_male
- mother_age
- plurality
- gestation_weeks
- key

1-2. Model

Wide-and-deep model

DNN : dense, highly-correlated inputs => generalization diversity
Linear : sparse inputs => Memorization relevance

2) Implement by using Tensorflow

2-1. tf.estimator API

High Level API for developing tensorflow model

distributed multi-server environment
CPU, GPU, TPU같은 그래픽환경에 따라서 코드를 바꿀 필요가 없음.
구현이 간단함 (high level API)
기계학습을 안전하게 하기 위한 장치들을 자동으로 마련해줌
- tensor graph
- value initialize
- load data
- exceptions
- checkpoint & recover
- tensorboard

2-2. tf.estimator의 일반적인 절차

Make dataset importing function

이 함수는 두개의 타입 중 하나를 반드시 반환해야함.
- 딕셔너리
  - key : feature name
  - value : tensors
- Tensor
Define the feature columns
- tf.feature_column 을 이용해서 컬럼을 정의함
  
  => normalizer_fn 같은 매개변수로 간단한 전처리도 가능함.
Instantiate the relevant pre-made Estimator
- tf.estimator.모델명의 형태로 된 녀석을 선언함.
  - 이 때, feature column이 사용됨.
Call a training, evaluation or inference method
- 3번에서 정의한 Estimator 객체의 메서드 train 같은것을 사용해도 되고,
  
  아니면 tf.estimator.train_and_evaluate 같은 함수를 사용해도됨.

2-3. Tensorflow 코드 가이드라인 (1) - DNN

라이브러리

import shutil # 파일, 디렉터리 작업을 수행하는데 필요한 모듈
import numpy as np
import tensorflow as tf

shutil 참고링크 : shutil은 파일을 갱신하는 데 쓰입니다.

Define columns

CSV columns : csv파일의 모든 컬럼들
label column : 레이블
key column

default value

DEFAULTS = [[0,0], ['null'], ... , ['nokey']]
# 'null' : '정보가 없음' 을 categorical data로 나타내기 위해서 사용

Read dataset : tf.decode_csv
1. 주어진 csv파일에서 features, label을 딕셔너리형태로 파싱하는 함수를 만듬.
2. tf.gfile.Glob를 이용해서 특정 파일 패턴을 가진 파일들을 가져옴.
3. 파일들을 tf.data.TextLineDataset 함수를 이용해서 읽고, 1번의 함수를 이용해서 파싱함.
```
dataset = tf.data.tf.data.TextLineDataset(file_list).map(decode_csv)
```
4. Train & Test 모드에 따라서 epoch를 구분함
  - train
    - epoch = none -> epoch보다는 미리 정해둔 스텝 수 만큼 학습하기 위해서
    - shuffle
  - test : epoch = 1 -> 한번만 테스트하기 위해서.
5. dataset을 batch마다 구분함
```
dataset = dataset.repeat(num_epochs).batch(batch_size)
```
6. 구분한 dataset을 iterate함.
```
dataset.make_one_shot_iterator().get_next()
```
Define feature columns
- categorical
```
tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list(name, values))
```
  1. tf.feature_column.indicator_column : 주어진 categorical_column에 대해서 multi-hot 인코딩을 만들어줌.
    - multi-hot encoding
      
      ex) columns = [‘bob’, ‘george’, ‘wanda’]
      
      data = [‘bob’,’wanda’,’bob’]
      
      => tensor = [[ 2, 0, 1 ]]
  2. tf.feature_column.categorical_column_with_vocabulary_list
    
    주어진 string을 vocabulary_list에 대해서 in-memory mapping하여 카테고리값으로 만들어줌.
- numeric
```
tf.feature_column.numeric_column('컬럼 이름')
```
serving input : 모델에 투입될 feature의 형태를 정해줌.
- ServingInputReceiver
```
tf.estimator.export.ServingInputReceiver(features, feature_placeholders)
```
  feature_placeholder를 2차원 텐서형태를 가진 features로 변환하기 위해서 expand_dims 메서드를 사용하였음
- 이 부분은 자세히 알아봐야할것같습니다.
Train and evaluate :
1. tf.estimator.train_and_evaluate : 추정된 estimator에 대해서 train & evaluate을 수행
  - estimator : pre-made estimator
  - train_spec : train에 관련된 스펙
    - input_fn : 배치간격으로 인풋을 제공할 수 있는 함수
      
      => tf.data.Dataset 객체거나, (feature, labels) 튜플의 형태여야함.
    - max_steps
  - eval_spec : eval에 관련된 스펙
    - 추가적으로 몇가지가 더 필요함
      - exporters
      - start_delay_secs
      - throttle_secs : 측정 간격 (항상 측정하는건 비효율적이니까)
```
# ex) 
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=1000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
```
2. exporter - tf.estimator.LatestExporter
  
  Tensor graph와 checkpoint를 규칙적으로 갱신하는 함수.
  
  (garbage collect를 이용하여 갱신한다고합니다.)
```
exporter = tf.estimator.LatestExporter('exporter', serving_input_fn)
```
Tensorboard
```
from google.datalab.ml import TensorBoard
TensorBoard().start('모델경로')
```
pre-made estimator에서 지정한 경로에는 모델정보가 저장되는데,

이 모델정보를 텐서보드에서 출력할 수 있습니다.

2-4. Tensorflow 코드 가이드라인 (2) - Wide & Deep model

Wide & Deep model은 두개의 feature columns가 필요합니다

feature column을 만드는 함수는 두개의 값을 반환하도록 만들어야합니다.

def get_wide_deep():
  # Define column types
  is_male,mother_age,plurality,gestation_weeks = \
      [\
         tf.feature_column.categorical_column_with_vocabulary_list('is_male', 
                      ['True', 'False', 'Unknown']),
          tf.feature_column.numeric_column('mother_age'),
          tf.feature_column.categorical_column_with_vocabulary_list('plurality',
                      ['Single(1)', 'Twins(2)', 'Triplets(3)',
                       'Quadruplets(4)', 'Quintuplets(5)','Multiple(2+)']),
          tf.feature_column.numeric_column('gestation_weeks')
      ]
   
  # Discretize (나이별로 그룹을 나눔.)
  age_buckets = tf.feature_column.bucketized_column(mother_age, 
                      boundaries=np.arange(15,45,1).tolist())
  gestation_buckets = tf.feature_column.bucketized_column(gestation_weeks, 
                      boundaries=np.arange(17,47,1).tolist())
   
  # Sparse columns are wide, have a linear relationship with the output
  wide = [is_male,
          plurality,
          age_buckets,
          gestation_buckets]
   
  # Feature cross all the wide columns and embed into a lower dimension
  crossed = tf.feature_column.crossed_column(wide, hash_bucket_size=20000)
  embed = tf.feature_column.embedding_column(crossed, 3)
   
  # Continuous columns are deep, have a complex relationship with the output
  deep = [mother_age,
          gestation_weeks,
          embed]
  return wide, deep

tf.feature_column.crossed_column : 여러 feature의 joint effect를 다루기 위해서 합친 column을 제공함.

Crossed features will be hashed according to hash_bucket_size. Conceptually, the transformation can be thought of as: Hash(cartesian product of features) % hash_bucket_size

음.. hash함수를 이용해서 가능한 경우의 수가 너무 많을때를 조절한다고도 합니다.
tf.feature_column.embedding_column

Use this when your inputs are sparse, but you want to convert them to a dense representation (e.g., to feed to a DNN).

sparse column을 DNN에 쓰고 싶을 때 dense representation에 적합하게 바꿔줍니다.

(차원을 낮춰서 continuous value로 임베딩을 해주는것같습니다.)

tf.estimator.DNNLinearCombinedRegressor

wide, deep = get_wide_deep()
EVAL_INTERVAL = 300
run_config = tf.estimator.RunConfig(save_checkpoints_secs = EVAL_INTERVAL,
                                    keep_checkpoint_max = 3)
estimator = tf.estimator.DNNLinearCombinedRegressor(
    model_dir = output_dir,
    linear_feature_columns = wide,
    dnn_feature_columns = deep,
    dnn_hidden_units = [64, 32],
    config = run_config)