본문 바로가기

ML,DL

Model Validation(모델 검증)

참고 : https://www.kaggle.com/learn/intro-to-machine-learning

Model Validation이란?

  • 말 그대로 내가 만든 모델의 성능(ML에서는 주로 예측 성능)이 적절한지 검증하는 것이다.

검증에 사용되는 수식(Metrics)

  • MAE(Mean Absolute Error)
  • MSE(Mean Squared Error)

MAE 구현

import pandas as pd

# Load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)
# Output
DecisionTreeRegressor()

  • Decision Tree Regressor?

Decision Tree

  • 위와 같은 Decision Tree를 바탕으로 데이터를 Fitting하는 회귀방식을 의미한다.
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)
# Output
434.71594577146544

주의🚨

  • 모델을 만든 후 검증을 할 때, 모델을 만들기 위해 사용된 데이터(Train_data)를 가지고 검증을 하면 안된다
    → "In-Sample" Scores
  • 검증을 할 때에는 기존에 사용되었던 데이터와는 다른 데이터 즉, Validation Data(Test_data)를 반드시 사용해야 한다.

Train_Test_Split

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
# Output
261425.9928986443
위에서 동일한 데이터를 사용해 Model Validaiton을 진행했을 때와 달리, Validation Data(Test Data)를 사용해 Validation을 진행했을때, MAE가 훨씬 높게 나타났다.
이를 통해, 동일한 데이터를 사용한 Model Validation의 신뢰성은 낮다는 것을 알 수 있다.

실습(Kaggle Courses 참고)

1. 데이터 분리하기

# Import the train_test_split function and uncomment
from sklearn.model_selection import train_test_split

# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

2. Model 생성 후,  Fitting 시키기

# You imported DecisionTreeRegressor in your last exercise
# and that code has been copied to the setup code above. So, no need to
# import it again
from sklearn.tree import DecisionTreeRegressor

# Specify the model
iowa_model = DecisionTreeRegressor(random_state = 1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)

3. Make prediction /w Validation Data(Test Data)

# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)

# print the top few validation predictions
print(val_predictions[:5])

# print the top few actual prices from validation data
print(val_y.head())
# Output

# predictions
[186500. 184000. 130000.  92000. 164500.]

# val_y(Validation Data : Real Data)
258     231500
267     179500
288     122000
649      84500
1233    142000
Name: SalePrice, dtype: int64

예측값과 실제 데이터간의 차이가 있음을 확인할 수 있다.

 

4. MAE계산하기

from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_predictions, val_y)

# uncomment following line to see the validation_mae
print(val_mae)
# Output
29652.931506849316

 

'ML,DL' 카테고리의 다른 글

Optimization - Cross validation, Bias & Variance, Bootstrapping  (0) 2023.03.08
Random Forests  (0) 2021.09.03
Underfitting & Overfitting  (0) 2021.08.26
로지스틱 회귀분석  (0) 2021.08.16
경사하강법(Gradient Descent)  (0) 2021.08.14