본문 바로가기
Dev/딥러닝

07-2. Tensor Flow 에서 Data preprocessing (normalization) 구현

by bsion 2018. 8. 22.
07-2. Data preprocessing

출처 : 모두를위한 머신러닝 (http://hunkim.github.io/ml/)

Tensor Flow 에서 사용할 Data preprocessing 에 관하여 이론과 예제를 통해 설명한다.


Data Preprocessing (Normalization)

예를들어 2차원의 data 의 column 값들이 비슷한 경우, 다음과 같은 모양을 갖으며 minimum point를 찾을 수 있다.

그러나 만약 column 끼리의 값이 많이 차이나는 데이터일 경우 다음과 같이 한쪽으로만 치우친 그래프가 생성되고, 한쪽 축으로의 편차가 더 적기때문에 그 방향으로 움직이다가 튀는 경우가 발생한다.

따라서 차이가 큰 데이터들의 값을 비슷하게 맞춰주는 작업이 필요하다. 예를들면 다음과같은 작업들이 있다.



예제

아래 예제의 데이터에서 세번째 column 의 값이 유독 크다는것을 알 수 있다. 이런 상황에서의 data 를 Normalization 을 이용해 preprocessing 하는 방법을 알아본다.

In [3]:
import tensorflow as tf
import numpy as np
tf.set_random_seed(777)  # for reproducibility


xy = np.array([[828.659973, 833.450012, 908100, 828.349976, 831.659973],
               [823.02002, 828.070007, 1828100, 821.655029, 828.070007],
               [819.929993, 824.400024, 1438100, 818.97998, 824.159973],
               [816, 820.958984, 1008100, 815.48999, 819.23999],
               [819.359985, 823, 1188100, 818.469971, 818.97998],
               [819, 823, 1198100, 816, 820.450012],
               [811.700012, 815.25, 1098100, 809.780029, 813.669983],
               [809.51001, 816.659973, 1398100, 804.539978, 809.559998]])

x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 4])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([4, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis
hypothesis = tf.matmul(X, W) + b

# Simplified cost/loss function
cost = tf.reduce_mean(tf.square(hypothesis - Y))

# Minimize
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1e-5)
train = optimizer.minimize(cost)

# Launch the graph in a session.
sess = tf.Session()
# Initializes global variables in the graph.
sess.run(tf.global_variables_initializer())

for step in range(101):
    cost_val, hy_val, _ = sess.run(
        [cost, hypothesis, train], feed_dict={X: x_data, Y: y_data})
    if step < 10 or step > 95:
        print(step, "Cost: ", cost_val, "\nPrediction:\n", hy_val)
0 Cost:  144463460000.0 
Prediction:
 [[268738.25]
 [540794.25]
 [425464.97]
 [298306.5 ]
 [351536.22]
 [354491.88]
 [324919.34]
 [413630.5 ]]
1 Cost:  1.5871918e+26 
Prediction:
 [[-8.8868127e+12]
 [-1.7890063e+13]
 [-1.4073468e+13]
 [-9.8654266e+12]
 [-1.1626933e+13]
 [-1.1724794e+13]
 [-1.0746178e+13]
 [-1.3682021e+13]]
2 Cost:  inf 
Prediction:
 [[2.9456536e+20]
 [5.9299030e+20]
 [4.6648408e+20]
 [3.2700285e+20]
 [3.8539033e+20]
 [3.8863408e+20]
 [3.5619659e+20]
 [4.5350907e+20]]
3 Cost:  inf 
Prediction:
 [[-9.7637672e+27]
 [-1.9655464e+28]
 [-1.5462245e+28]
 [-1.0838951e+28]
 [-1.2774283e+28]
 [-1.2881802e+28]
 [-1.1806617e+28]
 [-1.5032172e+28]]
4 Cost:  inf 
Prediction:
 [[3.2363322e+35]
 [6.5150685e+35]
 [5.1251692e+35]
 [3.5927163e+35]
 [4.2342085e+35]
 [4.2698469e+35]
 [3.9134624e+35]
 [4.9826152e+35]]
5 Cost:  inf 
Prediction:
 [[-inf]
 [-inf]
 [-inf]
 [-inf]
 [-inf]
 [-inf]
 [-inf]
 [-inf]]
6 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
7 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
8 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
9 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
96 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
97 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
98 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
99 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]
100 Cost:  nan 
Prediction:
 [[nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]
 [nan]]

그 결과 Cost 값이 nan 으로 이상하게 학습된다.

Data Normalization using 'min-max scale'

min-max scale 방법을 이용하여 데이터를 normalization 시킨다. 그 결과, 원본 데이터와 다르게 데이터 간의 값 차이가 크지 않다.

In [4]:
import tensorflow as tf
import numpy as np
tf.set_random_seed(777)  # for reproducibility


def MinMaxScaler(data):
    numerator = data - np.min(data, 0)
    denominator = np.max(data, 0) - np.min(data, 0)
    # noise term prevents the zero division
    return numerator / (denominator + 1e-7)


xy = np.array([[828.659973, 833.450012, 908100, 828.349976, 831.659973],
               [823.02002, 828.070007, 1828100, 821.655029, 828.070007],
               [819.929993, 824.400024, 1438100, 818.97998, 824.159973],
               [816, 820.958984, 1008100, 815.48999, 819.23999],
               [819.359985, 823, 1188100, 818.469971, 818.97998],
               [819, 823, 1198100, 816, 820.450012],
               [811.700012, 815.25, 1098100, 809.780029, 813.669983],
               [809.51001, 816.659973, 1398100, 804.539978, 809.559998]])

xy = MinMaxScaler(xy)
print(xy)
[[0.99999999 0.99999999 0.         1.         1.        ]
 [0.70548491 0.70439552 1.         0.71881782 0.83755791]
 [0.54412549 0.50274824 0.57608696 0.606468   0.6606331 ]
 [0.33890353 0.31368023 0.10869565 0.45989134 0.43800918]
 [0.51436    0.42582389 0.30434783 0.58504805 0.42624401]
 [0.49556179 0.42582389 0.31521739 0.48131134 0.49276137]
 [0.11436064 0.         0.20652174 0.22007776 0.18597238]
 [0.         0.07747099 0.5326087  0.         0.        ]]
In [6]:
x_data = xy[:, 0:-1]
y_data = xy[:, [-1]]

# placeholders for a tensor that will be always fed.
X = tf.placeholder(tf.float32, shape=[None, 4])
Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([4, 1]), name='weight')
b = tf.Variable(tf.random_normal([1]), name='bias')

# Hypothesis
hypothesis = tf.matmul(X, W) + b

# Simplified cost/loss function
cost = tf.reduce_mean(tf.square(hypothesis - Y))

# Minimize
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1e-5)
train = optimizer.minimize(cost)

# Launch the graph in a session.
sess = tf.Session()
# Initializes global variables in the graph.
sess.run(tf.global_variables_initializer())

for step in range(101):
    cost_val, hy_val, _ = sess.run(
        [cost, hypothesis, train], feed_dict={X: x_data, Y: y_data})
    if step < 10 or step > 95:
        print(step, "Cost: ", cost_val, "\nPrediction:\n", hy_val)
0 Cost:  0.5647025 
Prediction:
 [[-0.06572413]
 [-0.06795961]
 [-0.13650313]
 [-0.2401303 ]
 [-0.16231954]
 [-0.2708047 ]
 [-0.3281756 ]
 [-0.5223048 ]]
1 Cost:  0.56465966 
Prediction:
 [[-0.06568605]
 [-0.06792277]
 [-0.13647231]
 [-0.24010634]
 [-0.16229123]
 [-0.27077731]
 [-0.32815713]
 [-0.5222866 ]]
2 Cost:  0.5646168 
Prediction:
 [[-0.06564802]
 [-0.06788588]
 [-0.13644147]
 [-0.24008232]
 [-0.16226292]
 [-0.27075   ]
 [-0.32813865]
 [-0.5222684 ]]
3 Cost:  0.564574 
Prediction:
 [[-0.06560999]
 [-0.06784901]
 [-0.13641068]
 [-0.24005836]
 [-0.16223463]
 [-0.27072266]
 [-0.32812017]
 [-0.5222503 ]]
4 Cost:  0.5645312 
Prediction:
 [[-0.0655719 ]
 [-0.0678122 ]
 [-0.13637996]
 [-0.24003434]
 [-0.16220632]
 [-0.2706952 ]
 [-0.32810166]
 [-0.5222321 ]]
5 Cost:  0.5644884 
Prediction:
 [[-0.06553382]
 [-0.06777531]
 [-0.13634914]
 [-0.24001038]
 [-0.16217801]
 [-0.2706679 ]
 [-0.32808316]
 [-0.522214  ]]
6 Cost:  0.56444556 
Prediction:
 [[-0.06549579]
 [-0.06773847]
 [-0.1363183 ]
 [-0.23998642]
 [-0.16214967]
 [-0.2706405 ]
 [-0.32806468]
 [-0.5221958 ]]
7 Cost:  0.5644028 
Prediction:
 [[-0.06545776]
 [-0.06770164]
 [-0.13628748]
 [-0.2399624 ]
 [-0.16212136]
 [-0.27061316]
 [-0.3280462 ]
 [-0.5221777 ]]
8 Cost:  0.5643599 
Prediction:
 [[-0.06541967]
 [-0.06766474]
 [-0.13625664]
 [-0.23993844]
 [-0.16209304]
 [-0.27058583]
 [-0.32802773]
 [-0.5221595 ]]
9 Cost:  0.5643171 
Prediction:
 [[-0.06538159]
 [-0.06762791]
 [-0.13622582]
 [-0.23991442]
 [-0.16206473]
 [-0.27055842]
 [-0.32800925]
 [-0.52214134]]
96 Cost:  0.5606077 
Prediction:
 [[-0.062078  ]
 [-0.06442851]
 [-0.13355166]
 [-0.237833  ]
 [-0.15960744]
 [-0.26818383]
 [-0.3264055 ]
 [-0.52056646]]
97 Cost:  0.5605652 
Prediction:
 [[-0.06204009]
 [-0.06439185]
 [-0.13352093]
 [-0.23780912]
 [-0.15957919]
 [-0.26815653]
 [-0.32638708]
 [-0.5205484 ]]
98 Cost:  0.5605226 
Prediction:
 [[-0.06200218]
 [-0.06435508]
 [-0.13349023]
 [-0.23778522]
 [-0.15955096]
 [-0.2681293 ]
 [-0.32636866]
 [-0.5205303 ]]
99 Cost:  0.5604801 
Prediction:
 [[-0.06196421]
 [-0.06431836]
 [-0.13345951]
 [-0.23776132]
 [-0.15952271]
 [-0.26810205]
 [-0.32635024]
 [-0.5205122 ]]
100 Cost:  0.56043756 
Prediction:
 [[-0.06192625]
 [-0.06428158]
 [-0.13342884]
 [-0.23773736]
 [-0.15949458]
 [-0.26807475]
 [-0.3263318 ]
 [-0.5204941 ]]


댓글