예측 성능의 판정

이제 모델을 테스트 데이터에 적용하여 각 모델의 예측 성능을 비교할 수 있습니다. 의사 결정 트리 모델이 과다 적합(over-fitting)되어 만들어 진 것이 맞다면, 다른 두 모델과 비교하여 테스트 데이터에서도 예측 결과가 잘못 형성되는 것을 확인할 수 있어야 합니다. 랜덤 포리스트가 선형 모델에서는 누락되어 있는 데이터의 일부 고유 신호를 포착한다고 생각된다면, 테스트 데이터에서 선형 모델보다 성능이 우수한 모습을 확인할 수 있어야 합니다.

살펴볼 첫 번째 측정 기준은 제곱된 잔차의 평균(the average of the squared residuals)입니다. 이 값을 통하여 예측값이 관측값과 얼마나 가까운 지를 알 수 있습니다. 일반적으로 0% ~ 20%의 좁은 범위에 있는 팁 비율을 예측하기 때문에, 좋은 모델은 잔차가 평균 2 ~ 3 % 포인트를 넘지 않아야 할 것으로 예상해야 합니다.

rxPredict(trained.models$linmod, data = mht_split$test, outData = mht_split$test, predVarNames = "tip_percent_pred_linmod", overwrite = TRUE)
rxPredict(trained.models$dtree, data = mht_split$test, outData = mht_split$test, predVarNames = "tip_percent_pred_dtree", overwrite = TRUE)
rxPredict(trained.models$dforest, data = mht_split$test, outData = mht_split$test, predVarNames = "tip_percent_pred_dforest", overwrite = TRUE)

rxSummary(~ SE_linmod + SE_dtree + SE_dforest, data = mht_split$test,
          transforms = list(SE_linmod = (tip_percent - tip_percent_pred_linmod)^2,
                            SE_dtree = (tip_percent - tip_percent_pred_dtree)^2,
                            SE_dforest = (tip_percent - tip_percent_pred_dforest)^2))

Call:
rxSummary(formula = ~ SE_linmod + SE_dtree + SE_dforest, data = mht_split$test, 
    transforms = list(SE_linmod = (tip_percent - tip_percent_pred_linmod)^2, 
        SE_dtree = (tip_percent - tip_percent_pred_dtree)^2, 
        SE_dforest = (tip_percent - tip_percent_pred_dforest)^2))

Summary Statistics Results for: ~SSE_linmod + SSE_dtree + SSE_dforest
Data: mht_split$test (RxXdfData Data Source)
File name: C:\Data\NYC_taxi\output\split\train.split.train.xdf
Number of valid observations: 43118543 
 
 Name        Mean     StdDev   Min                 Max      ValidObs MissingObs
 SE_linmod  82.66458 108.9904 0.00000000005739206 9034.665 43118542 1         
 SE_dtree   82.40040 109.1038 0.00000251589457986 8940.693 43118542 1         
 SE_dforest 82.47107 108.0416 0.00000000001590368 8606.201 43118542 1

살펴볼 가치가 있는 또 다른 측정 기준은 상관 행렬(correlation matrix)입니다. 이는 서로 다른 모델의 예측이 어느 정도 서로 가깝고 어느 정도까지 실제 또는 관찰된 팁 비율에 근접하는지를 판단하는 데 도움이 될 수 있습니다.

rxc <- rxCor( ~ tip_percent + tip_percent_pred_linmod + tip_percent_pred_dtree + tip_percent_pred_dforest, data = mht_split$test)
print(rxc)

                          tip_percent pred_linmod pred_dtree pred_dforest
tip_percent               1.0000000   0.1391751   0.1500126  0.1499031
tip_percent_pred_linmod   0.1391751   1.0000000   0.8580617  0.9084119
tip_percent_pred_dtree    0.1500126   0.8580617   1.0000000  0.9404640
tip_percent_pred_dforest  0.1499031   0.9084119   0.9404640  1.0000000

Data Insight Study Blog

4. Clustering and Modeling > Predictive Modelling > Judging Predictive Performance

예측 성능의 판정

예측 성능의 판정

관계된 작성글