Predicting motor insurance claim incidence using generalized and tree-based models: A comparative statistical approach
-
DOIhttp://dx.doi.org/10.21511/ins.16(2).2025.04
-
Article InfoVolume 16 2025, Issue #2, pp. 38-53
- 5 Views
-
0 Downloads
This work is licensed under a
Creative Commons Attribution 4.0 International License
Type of the article: Research Article
Abstract
Accurate prediction of motor insurance claim frequency is necessary for efficient risk management, underwriting, and policy pricing. Predictive performance of Poisson Generalized Linear Models (GLMs), Decision Trees, and Generalized Additive Models (GAMs) is investigated using 108,699 motor third-party liability insurance contracts, representing the French Motor TPL dataset from the CASdatasets R package widely used in actuarial research. These models’ predictability, explainability, and flexibility on training and testing sets are compared using Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Poisson Deviance metrics. Results indicate that, although GLM offers an interpretable, accurate baseline, GAM slightly surpasses GLM and Decision Trees under all performance measures. Results demonstrate that GAM achieves superior performance across all metrics, with the lowest MSE (0.0506), RMSE (0.2251), and Poisson Deviance (36.41% training, 37.76% test), compared to GLM (MSE: 0.0509, RMSE: 0.2257, Poisson Deviance: 36.83% training, 38.08% test) and Decision Trees (MSE: 0.0582, RMSE: 0.2413, Poisson Deviance: 37.12% training, 38.31% test). The GAM model reduces prediction error by approximately 0.6% compared to GLM and 13.1% compared to Decision Trees based on MSE. Empirical findings reveal how GAMs achieve an optimum balance between model explainability and prediction flexibility, rendering them best suited for insurers who want to refine risk segmentation without compromising on regulatory compliance and business transparency. This study joins other research calling for interpretable state-of-the-art statistical techniques in insurance analytics and presents worthwhile observations for actuaries and data scientists who wish to refine motor insurance frequency modeling frameworks.
- Keywords
-
JEL Classification (Paper profile tab)C25, C53, G22, C14, C52
-
References37
-
Tables7
-
Figures5
-
- Figure 1. Claim frequency by vehicle age and Bonus-Malus level
- Figure 2. Claim frequency by driver age and Bonus-Malus level
- Figure 3. Decision Tree for claim frequency
- Figure 4. GAM smooth functions for vehicle age and driver age
- Figure 5. GAM smooth functions for Bonus-Malus by driver age group
-
- Table 1. Summary of dataset variables and their measurement scales
- Table 2. Summary of descriptive statistics for key variables
- Table 3. Decision Tree splits for claim frequency, with node sample sizes, deviances, and mean claim frequencies
- Table 4. Poisson GLM regression results: estimated coefficients for claim frequency
- Table 5. Parametric coefficient estimates from the GAM for claim frequency
- Table 6. Approximate significance of smooth terms in the GAM
- Table 7. Model performance comparison for claim frequency prediction
-
- Anderson, D., Feldblum, S., Modlin, C., Schirmacher, D., Schirmacher, E., & Thandi, N. (2004). A practitioner’s guide to generalized linear models (Casualty Actuarial Society Discussion Paper Program).
- Antonio, K., & Valdez, E. A. (2012). Statistical concepts of a priori and a posteriori risk classification in insurance. AStA Advances in Statistical Analysis, 96(2), 187-224.
- Ayuso, M., Guillén, M., & Pérez-Marín, A. M. (2016). Telematics and gender discrimination: Some usage-based evidence on whether men’s risk of accidents differs from women’s. Risks, 4(2), 10.
- Ayuso, M., Guillén, M., & Pérez-Marín, A. M. (2019). Using GPS data to analyse the distance travelled to the first accident at fault in pay-as-you-drive insurance. Transportation Research Part C: Emerging Technologies, 68, 160-167.
- Baecke, P., & Bocca, L. (2017). The value of vehicle telematics data in insurance risk selection processes. Decision Support Systems, 98, 69-79.
- Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
- Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Wadsworth International Group.
- Cameron, A. C., & Trivedi, P. K. (1998). Regression analysis of count data. Cambridge University Press.
- Clemente, C., Guerreiro, G. R., & Bravo, J. M. (2023). Modelling motor insurance claim frequency and severity using gradient boosting. Risks, 11(9), 163.
- Denuit, M., & Lang, S. (2004). Non-life rate-making with Bayesian GAMs. Insurance: Mathematics and Economics, 35(3), 627-647.
- Díaz Martínez, Z., Fernández Menéndez, J., & García Villalba, L. J. (2023). Tariff analysis in automobile insurance: Is it time to switch from generalized linear models to generalized additive models? Mathematics, 11(18), 3906.
- Dionne, G., & Vanasse, C. (1989). A generalization of actuarial automobile insurance rating models: The negative binomial distribution with a regression component. ASTIN Bulletin, 19(2), 199-212.
- Dutang, C., & Charpentier, A. (2020). Package ‘CASdatasets’. R package.
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189-1232.
- Goldburd, M., Khare, A., & Tevet, D. (2016). Generalized linear models for insurance rating (2nd ed.). Casualty Actuarial Society.
- Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1(3), 297-318.
- Hastie, T., & Tibshirani, R. (1990). Generalized additive models. Chapman & Hall.
- Henckaerts, R., & Antonio, K. (2022). The added value of dynamically updating motor insurance prices with telematics collected driving behavior data. Insurance: Mathematics and Economics, 105, 79-95.
- Henckaerts, R., Antonio, K., Clijsters, M., & Verbelen, R. (2018). A data-driven binning strategy for the construction of insurance tariff classes. Scandinavian Actuarial Journal, 2018(8), 681-705.
- Henckaerts, R., Côté, M. P., Antonio, K., & Verbelen, R. (2019). Boosting insights in insurance tariff plans with tree-based machine learning methods. North American Actuarial Journal, 25(2), 255-285.
- Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). Cambridge University Press.
- Kafková, S., & Křivánková, L. (2014). Generalized linear models in vehicle insurance. Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis, 62(2), 383-388.
- Klein, N., Kneib, T., Klasen, S., & Lang, S. (2014). Bayesian structured additive distributional regression for multivariate responses. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(4), 569-591.
- König, D., & Loser, F. (2024). Claim frequency modeling in insurance pricing using GLM, deep learning, and gradient boosting. Blätter der DGVFM, 36(1), 45-62.
- Kuo, K., & Lupton, D. (2021). Towards explainability of machine learning models in insurance pricing (Papers 2003.10674).
- McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman & Hall.
- Meng, S., Gao, Y., & Huang, Y. (2022). Actuarial intelligence in auto insurance: Claim frequency modeling with driving behavior features and improved boosted trees. Insurance: Mathematics and Economics, 106, 115-127.
- Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A, 135(3), 370-384.
- Ohlsson, E., & Johansson, B. (2010). Non-life insurance pricing with generalized linear models. Springer.
- Paefgen, J., Staake, T., & Fleisch, E. (2013). Multivariate exposure modeling of accident risk: Insights from pay-as-you-drive insurance data. Transportation Research Part A: Policy and Practice, 61, 27-40.
- Počuča, N., Jevtić, P., McNicholas, P. D., & Miljkovic, T. (2020). Modeling frequency and severity of claims with the zero-inflated generalized cluster-weighted models. Insurance: Mathematics and Economics, 94, 79-93.
- Staudt, Y., & Wagner, J. (2021). Assessing the performance of random forests for modeling claim severity in collision car insurance. Risks, 9(3), 53.
- Verbelen, R., Antonio, K., & Claeskens, G. (2018). Unravelling the predictive power of telematics data in car insurance pricing. Journal of the Royal Statistical Society: Series C, 67(5), 1275-1304.
- Wilson, A. A., Nehme, A., Dhyani, A., & Mahbub, K. (2024). A comparison of generalized linear modelling with machine learning approaches for predicting loss cost in motor insurance. Risks, 12(4), 62.
- Wood, S. N. (2006). Generalized additive models: An introduction with R. Chapman & Hall/CRC.
- Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B, 73(1), 336.
- Xie, S., & Shi, K. (2023). Generalized additive modelling of auto insurance data with territory design: A rate regulation perspective. Mathematics, 11(2), 334.