기계학습 알고리즘을 이용한 주택가격감정 시스템의 구축 및 평가: XGBoost, LightGBM, CatBoost 알고리즘에 기반하여

홍정의^*^,^†

An Application of XGBoost, LightGBM, CatBoost Algorithms on House Price Appraisal System

Jengei Hong^*^,^†

^*주저자, 한동대학교 경영경제학부 교수, 경제학 박사, hwgh024@handong.edu

^*School of Management & Economics, Handong Global University, Assistant Professor, hwgh024@handong.edu

^†Corresponding author: hwgh024@handong.edu

© Copyright 2020 Housing Finance Research Institute, Korea Housing Finance Corporation. This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Received: Oct 17, 2020; Revised: Nov 30, 2020; Accepted: Dec 11, 2020

Published Online: Dec 30, 2020

요 약

주택 가격 감정 모형은 아주 작은 비용으로 대량의 부동산 감정을 동시에 수행할 수 있기 때문에, 모기지 담보가치 추정, 주택가격지수 산출, 재산세 추정 등과 같이 대규모 자산 가치평가를 빈번하게 수행해야 하는 모든 활동 영역에서 다양하게 활용될 수 있다. 최근에는 급격히 성장하는 데이터 수집 ․ 분석 기법들을 주택 감정 모형의 정확성을 상승시키는 데에 활용하는 연구들이 늘어가고 있다. 본문의 목적은 효율성과 예측력이 높은 것으로 알려진 세 가지 알고리즘(XGBoost, LightGBM, CatBoost)을 통해 주택 감정 모형을 구축하고, 그 성능과 특징 및 활용방법을 분석하는 것이다. 본문은 2009년부터 2019년까지 서울에서 거래된 아파트 매매 데이터 620,617건을 통해 헤도닉 모형과 기계 학습 모형 기반의 주택 가치 감정 모형의 예측력을 비교하였다. 분석 결과는 다음과 같다; 첫째, 기계 학습 모형의 예측력은 상대적인 측면 (헤도닉 가격 모형에 비해) 뿐 아니라, 절대적인 측면 (모형의 실용적 활용 가능성)에서도 상당히 높게 나타났다. 헤도닉 모형의 경우, 시장 가격에 대한 예측의 평균 백분율 오차가 약 11.5% 내외인 반면, XGBoost · LightGBM · CatBoost는 각각 3.7%, 3.8%, 3.6%에 불과했다. 두 번째로, CatBoost 알고리즘이 평균 예측력에서나, 이상치 발생 빈도에서나 다른 두 알고리즘에 비해 더 우수한 것으로 나타났다. 세 번째로, 소프트 보팅을 통한 세 알고리즘의 앙상블 모형을 구축하는 경우, 개별 알고리즘보다 더 예측력을 상승시킬 수 있음을 확인하였다.

ABSTRACT

This paper compares the predictive power of a conventional hedonic pricing model and three machine learning algorithm (XGBoost, LightGBM, CatBoost) based models by using 620,617 apartment data in Seoul from 2009 to 2019. The results are summarised as follows; First, the predictive power of the machine learning models are significantly high not only in the comparison to the conventional model but also in the absolute accuracy related to its practical usefulness. The mean percentage error of XGBoost, LightGBM, and CatBoost were only, respectively, 3.7%, 3.8%, and 3.6% while those of the hedonic model was around 11%. Second, we found that CatBoost algorithm is slightely more performative to the other two algorithms in terms of overall predictive power and frequency of outlier occurrences. Third, this paper show that an ensemble model of the three algorithms can raise the predictive power further.

Keywords: 기계학습; XGBoost; LightGBM; CatBoost; 대량 주택 감정

Keywords: Machine Learning; XGBoost; LightGBM; CatBoost; Mass Appraisal