“Drivers of potential policyholders’ uptake of insurance in Kenya using Random Forest”

The low adoption of insurance by potential policyholders in developing countries like Kenya is a cause for concern for insurers, regulators, and other marketing stakeholders. To effectively design targeted marketing strategies to boost insurance adoption, it is crucial to determine the factors that affect insurance uptake among potential poli-cyholders. In this study, the 2021 FinAccess Survey, which interviewed sampled individuals above 16 years in Kenya and machine learning techniques, including Random Forest, XGBoost, and Logistic Regression, were utilized to uncover the factors driving insurance uptake and the reasons for the low adoption of insurance among potential policyholders. Random Forest was the most robust model of the three classifiers based on Kappa score, recall score, F1 score, precision, and area under the operating characteristic curve (approaching 1). The paper explores eight reasons why people currently do not have insurance policies. The results indicated that affordability was the primary driver of uptake with 68.67% of having expressed a desire to possess insurance but are unable to afford it. The highest level of education being the next most significant factor. Cultural and religious beliefs and mistrust of insurance providers were found to have a minimal impact on uptake. These findings imply that offering affordable insurance products and conducting awareness campaigns are critical to increase insurance adoption.


INTRODUCTION
Insurance is essential for risk management and providing financial protection (Rumson & Hallett, 2019). However, low insurance penetration and fraudulent claims, particularly in developing countries and other resource constrained environments such as Kenya, remains a concern (Salmi & Atif, 2022; Tessema et al., 2021). Despite a rise in the number of insurance providers and agencies, the percentage of the population that has insurance coverage has remained stagnant at 3.01%. It is important to find a model that could robustly predict the drivers of insurance uptake among potential policyholders using machine learning techniques. The lack of insurance coverage hinders the growth and development of the country, both economically and socially (Mutembei, 2022;Mwongela, 2022). Therefore, identifying the reasons for low uptake and as a result low penetration is imperative for insurers, regulators, and other marketing stakeholders to design effective strategies to promote insurance uptake.

LITERATURE REVIEW
The review starts off by discussing the factors that affect insurance, then it examines how random forests are used in insurance, and it ends by stating the purpose of this study.

Drivers of insurance
The factors that drive potential policyholders to purchase insurance are varied and can be analyzed from both the demand and supply perspectives. A demand-side analysis examines factors from the policyholder's viewpoint, such as economic, social, cultural, and regulatory factors. One of the postulations is that the age demographic of a population affects the insurance industry. On the other hand, a supply-side analysis considers the insurer's perspective and identifies factors such as affordability of premiums, Regulations on prices, claims processing procedures, supply paths, and products regulations as having a significant impact on insurance adoption (Dragotă et al., 2022;Mutembei, 2022;Mwongela, 2022;Sibiko & Qaim, 2020). Under supply side factors such as earnings and profitability, reinsurance and actuarial issues, capital adequacy, liquidity, asset quality, and management soundness impact financial soundness of insurers and indirectly supply of the products (Salameh, 2022). Cultural beliefs and superstitions have been shown to affect the uptake of life insurance in certain environments (Liu et al., 2021).
The reasons for low insurance uptake among potential policyholders are complex and multi-faceted, and are influenced by a plethora of factors, including economic, social, cultural, and regulatory dimensions. Understanding the drivers of potential policyholders' likelihood of insurance uptake is essential to promote insurance uptake. Such an analysis could be conducted from either the demand-side or supply-side. In a supply-side analysis, factors such as price regulations, the process of settling insurance claims, the distribution of insurance products, and regulations concerning insurance products have all been identified as being highly influential in determining the chance of insurance being adopted (Ankrah et al., 2021;Mwongela, 2022).
However, a demand-side analysis would provide additional information necessary for data-based decision-making, since it helps to identify the insurer's brand-owned touchpoints. Identifying the insurer's brand-owned touchpoints is essential, since these are the only contact points with potential policyholders and policyholders that the insurers can directly influence. Additionally, analyzing the monetary impact of these touchpoints is also vital to the insurers (Kumar et al., 2023;Zimmermann & Auinger, 2022).
Tessema et al. (2021) emphasize the importance of understanding the risks, potential perils, and the role of insurance in mitigating them in the adoption process. They stress the importance of understanding how the product is being received and adopted by potential customers. However, when they implemented a video intervention, the outcome varied according to the family's head's gender. The intervention increased the use of index insurance for households headed by men by 2-3%, but for households headed by women, it resulted in a 6% decrease in uptake of the insurance.
A phenomenon known as "charity hazard" occurs when households in risk-prone locations decide not to get insurance because they anticipate receiving assistance in the event of a disaster. Coastal families with positive expectations of being eligible for disaster help were 25-42% less likely to have flood insurance, according to research combining household-level survey information with instrumental parameters to investigate flood insurance uptake. This indicates that the expectation of receiving aid may be an important aspect of underinsurance of these communities (Landry et al., 2021).
To gain a greater understanding how government and insurance policies along with management might increase insurance consumption in Zambia, thematic analysis (TA) was used in the research. Financial literacy, excellent service, and regulation of the insurance sector were found to be the three key issues that were essential for encouraging insurance consumption. The study showed that resolving these issues will increase the uptake of insurance. Additionally, the study suggested that simplifying insurance messages for better understanding, providing incentives for insurers to operate in rural areas, and subsidizing certain insurance products would also lead to increased insurance consumption (Haamukwanza, 2021). Other studies have shown that in the insurance sector, customer retention is influenced by reputation, performance, and affect. Additionally, it has been postulated that customer inertia plays a crucial role in moderating the negative effect on health insurance policy customer retention (Iacobucci et al., 2019).
Insurance, being a service industry, requires a strong emphasis on delivering high-quality services, increased recognition of the potential benefits for both the company and the customer, and the integration of advanced technology. Technology, in particular, has had a significant impact on shaping marketing and promotional strategies in the insurance industry (Yadav & Pavlou, 2020). Marketing itself is being disrupted due to the abundance of data and the increasing use of marketing analytics. Fortunately, the disruption could be for the better if correct tools and strategies are employed. Insurers, as well as other businesses, should therefore increase their ability to utilize marketing analytics and metrics as an effective means to gain market insights, monitor and enhance performance. The COVID-19 pandemic is said to have stimulated the digitalization of the insurance industry in Ukraine and other countries (Polinkevych et al., 2022). There is a need to utilize internet marketing and various tools that have come with big data revolution remain competitive (Iacobucci et al., 2019;Prymostka, 2018).
The literature suggests that the drivers of insurance uptake can be analyzed from both the demand-side and supply-side perspectives. Understanding the risks and potential perils, as well as the role of insurance in mitigating them, is crucial in the adoption process. However, the effectiveness of interventions, such as video interventions, may vary depending on factors such as the gender of the leader of the family. Additionally, the "charity hazard" phenomena draw attention to the possibility of detrimental effects of disaster relief expectations on insurance uptake in hazard-prone populations. Additionally, research indicates that addressing financial literacy, service quality, and insurance industry regulation may have a favorable effect on the uptake of insurance, as well as providing incentives for insurers to operate in rural areas and subsidizing certain insurance products.

Use of Random Forest in insurance
The use of machine learning in this study is based on current and past research that has shown it to be the most robust method for analysis and prediction in similar data (Blier-Wong et al., 2021). Random Forest as a model is part of ensemble tree-based learners, which have shown better performance compared to standalone machine learning models as SVM and other classification methods even when working with imbalanced data The Random Forest technique has been used to develop models that forecast the effectiveness of marketing plans intended to reduce client churn. The type of policy portfolio databases that can be used for a similar function is likewise covered by the proposed model. The study looked at the issue of client churn in the insurance industry. The study recommended locating target customers who are likely to respond favorably to focused marketing and retention efforts rather than concentrating on those with a high risk of departing (Guelman et al., 2012).
Random Forest was used by Shehadeh et al. (2016) to examine data from 130,000 life insurance applications and discovered that stratified sampling was crucial in order to effectively utilize the algorithm. Random Forest was handy for the nature of data because the data was highly imbalanced, with claims making up less than 5% of the total dataset.
According to Guo et al. (2019), Random Forest is the most efficient model for creating a recommender algorithm to suggest insurance products to potential policyholders, when compared to other algorithms such as ID3, C4.5, Naive-Bayes and K-Nearest Neighbors (KNN). The study evaluated the performance of each algorithm using prediction error, which measures the difference between the predicted and actual values. The results of the study indicated that the prediction error of Random Forest was lower than ID3, C4.5, Naive-Bayes and KNN. This suggests that Random Forest is better able to accurately predict the insurance products that a customer may be interested in, when compared to the other algorithms evalu-ated in the study. The results of the study support methods such as K-Nearest Neighbors (KNN), ID3, C4.5, Naive-Bayes, and others. By evaluating the variation between the expected and actual values, errors in prediction was used in the study to assess individual algorithm's performance. The study's findings showed that Random Forest had a smaller prediction error than ID3, C4.5, Naive-Bayes, and KNN. This shows that, when compared to the other algorithms considered in the study, Random Forest is better able to forecast with accuracy the insurance products that a consumer may be interested in. The study's findings confirm the use of Random Forest as a suitable algorithm for creating a recommender system for insurance products.
Hanafy and Ming (2021) examined the adoption of machine learning in the field of automotive insurance and also explored its potential applications for handling large amounts of data. To forecast the occurrence of claims, the study used a variety of machine learning techniques, including logistic regression, Extreme Gradient Boosting (XGBoost), Random Forest, decision trees, Naive Bayes, and K-Nearest Neighbors (KNN). These models' performances were assessed and contrasted, and the findings revealed that Random Forest performed better than the alternative techniques in terms of accuracy, kappa, and AUC values.
It has been postulated that for insurers to be successful and competitive in the market that is currently undergoing a big data revolution, they must utilize the increasingly large amounts of data in their decision-making. One of such applications is in targeted marketing to customize insurance policies (Porrini, 2017). This paper builds on this concept by using recent data to analyze sociodemographic data to find optimal model for analyzing the drivers of policyholders' uptake.
Salmi and Atif (2022) proposed a data mining methodology to identify false claims by addressing class imbalance and experimenting with two alternative feature subsets. It used two sampling techniques: SMOTE and ROSE. The findings demonstrated that the models that were created utilizing the second feature selection performed marginally better, with a higher percentage of false claims accurately detected. Random Forest outperformed logistic regression, according to the study.
In conclusion, studies discussed in the literature have shown that Random Forest could be an effective algorithm for a variety of tasks related to the insurance industry, such as predicting customer churn, identifying target customers for targeted marketing and retention efforts, creating a recommender system for insurance products, and detecting fraudulent claims. The studies have also shown that in terms of accuracy and prediction error, Random Forest surpasses other algorithms such as logistic regression, ID3, C4.5, Naive-Bayes, and K-Nearest Neighbors. Additionally, the studies have highlighted the importance of addressing class imbalance and utilizing effective sampling methods, such as SMOTE and ROSE, when working with imbalanced datasets. Overall, the literature supports the use of Random Forest as a suitable algorithm for various tasks in the insurance industry.
This study aims to uncover the drivers of the insurance uptake and reasons behind the low uptake of insurance among potential policyholders in Kenya from a demand-side perspective using the optimal classifier.

METHODS
The study determines the factors driving insurance uptake using data from the 2021 FinAccess Survey. The study utilized the most recent FinAccess data, which is part of a series of national surveys conducted to evaluate the access, usage, and impact of financial inclusion. The initial step involved using frequency analysis to identify the reasons why people currently do not have insurance policies. Subsequently, a machine learning model was trained and tested to extract the most important variables affecting insurance uptake. The feature importance was then extracted from the model to determine the variables that have the greatest influence on insurance uptake.
The 2021 FinAccess Survey is the sixth in a series that began in 2006. The survey includes measures of consumer protection to assess not only access to and use of finance but also the impact of financial inclusion on people's financial wellbeing. The study used a cross-sectional design at the household level and targeted people aged 16   In the current study, the Random Forest was used to classify potential policyholders based on their likelihood of taking up insurance. Various socio-demographic variables, such as age, income, and education level, were employed to train the model, which was subsequently utilized for predicting which potential policyholders were more likely to take up insurance. The Random Forest algorithm's the capacity to manage a variety of features, missing values, and its robustness to noise in the data make it an ideal algorithm for this research, as it can handle high-dimensional data and missing values which are prevalent in the FinAccess datasets (Hou et al., 2020; Ren et al., 2023).
The first step was to perform a frequency analysis to determine the reasons why individuals did not have insurance policies. Frequency tables were created based on specific answers to questions regarding why the survey respondents did not have insurance at the time of the survey. k-fold cross validation has been successful (Quan et al., 2023). The training was first performed on the imbalanced data, then SMOTE and up sampling methods were applied for sampling. The k-fold cross-validation was conducted using the validation set served to test the models' performance, and the test set was employed to evaluate the models' performance. The reported metrics refer to the results from the test set.

RESULTS
The results from frequency analysis are first presented, followed by the performance of the three models under various data sampling techniques, and finally, the results of feature importance are presented. Table 1 shows that 68.67% of respondents (potential policyholders to insurers) in the study reported that they would like to have insurance but cannot afford it, while 0.21% believed that buying health or life insurance brings bad luck. Additionally, 10.93% said that they do not know where to obtain insur-ance, and 0.86% believed that insurance companies are dishonest. A smaller percentage of respondents, 0.48%, reported that they believe insurance agents are dishonest. A small number of respondents, 0.86%, stated that they do not need insurance, and 0.41% reported that they save for emergencies instead of purchasing insurance. Lastly, 0.47% cited religious or cultural reasons for not having insurance. Table 2 presents a performance comparison of three models (Logistic Regression, Random Forest, and XGBoost) on an imbalanced data set using four evaluation metrics: Kappa, Recall, F1 score, and Accuracy. The higher the Kappa score, the more effective the model's performance. The Random Forest model performs better than both of the two models with the highest Kappa score (0.158534), the highest recall score (0.548966), the highest F1 score (0.569272), and the highest accuracy score (0.926755). In regard to Kappa, the Random Forest model provides the best performance as a whole, Recall, F1, and Accuracy metrics on the imbalanced data set.  Table 3 presents a comparison of the effectiveness of all the three models, including Logistic Regression, Random Forest, and XGBoost on an SMOTE balanced data set using four evaluation metrics: Kappa, Recall, F1 score, and Accuracy. The XGBoost model has the highest Kappa score (0.869929) among the three models, which is then followed by the Random Forest (0.847329). The XGBoost has the highest recall rating as well (0.934906), and then the Random Forest (0.923784).

Model metrics on SMOTE balanced data
The XGBoost model has the best balance between precision and recall, as measured by the F1 score (0.934963), followed by the Random Forest (0.923655). In terms of accuracy, the XGBoost model also has the highest score (0.934983), followed by the Random Forest (0.923655). Overall, the XGBoost model performs better than the Logistic Regression and Random Forest models in terms of Kappa, Recall, F1, and Accuracy metrics on this SMOTE balanced data set. Table 4 indicates that in the oversampled data set, the Random Forest model has the highest Kappa score (0.992121) followed by the XGBoost model (0.820974). The Random Forest model also has the highest recall score (0.996128), followed by the XGBoost model (0.911058). Additionally, the Random Forest model has the highest F1 score (0.99606) and accuracy score (0.996061), followed by the XGBoost model (0.910341 and 0.910389, respectively). As a result, the Random Forest model outperforms the Logistic Regression and XGBoost models in terms of Kappa, Recall, F1, and Accuracy metrics on this oversampled data set.  Given the favorable performance of both XGBoost and Random Forest, the AUC metric was introduced to determine the optimal model. The Random Forest model had an AUC value of 1.0, indicating a perfect classifier, for prediction of insurance uptake among potential policyholders, meaning it can distinguish between positive and negative cases with 100% accuracy. Meanwhile, XGBoost has an AUC value of 0.9704, indicating a good performance but not a perfect one. Figure 2 displays the relative importance of the variables in predicting the uptake of insurance by potential policyholders in Kenya, as determined by the Random Forest model. Wealth quintile level and poverty vulnerability were found to be the most important factors, with relative importance of 0.2014 and 0.1361, respectively. Other factors such as average monthly income, financial health score, and highest level of education attained were also found to be significant, but to a lesser extent. Other variables such as marital status, number of children in the household, and gender of the respondent were found to have low relative importance in predicting the uptake of insurance. The results of the model suggest that the financial status and vulnerability of potential policyholders play a significant role in determining their uptake of insurance in Kenya.

DISCUSSION
The performance of three models (Logistic Regression, Random Forest, and XGBoost) is compared on three data sets (imbalanced, SMOTE balanced, and random oversampled) using four evaluation metrics (Kappa, Recall, F1 score, and Accuracy). The best performance is found to be different in each data set. The four most important factors that drive insurance uptake, according to the variable importance, are wealth index, vulnerability to poverty, aver-

Figure 2. Feature importance
Extracted feature importance for Random Forest age monthly income, and financial health score. These four features all point to a single key factor: affordability. The results of the analysis into why individuals currently do not have insurance policies reveal that affordability is the main obstacle, with 68.67% of respondents reporting that they would like to have insurance but cannot afford it. While other factors may play a role in insurance uptake, affordability appears to be the most critical. It has been previously observed that in the insurance industry, potential policyholders seek insurance products and services that are both cost-effective and of high quality (Ali & Tausif, 2018;Kumar et al., 2023). This supports the findings of Sibiko and Qaim (2020), who posited that providing premium subsidies had a direct positive effect on uptake, even if knowledge of the product did not necessarily lead to increased uptake of index-based livestock insurance. However, in the current findings, the results depicted affordability as the most important factor that the potential policyholder considers in product selection. This highlights the importance of designing affordable insurance products.
The highest level of education attained by the respondent is the next most important factor in determining affordability, according to the Random Forest model. This confirms previous findings that insurance uptake increases with education due to increased awareness of insurance and financial products (Mutembei, 2022). The analysis of individuals without insurance policies found that 10.93% do not know where to obtain insurance, highlighting a need for increased awareness. Programs to increase awareness of insurance and its importance to individuals should be intensified.
Gender, internet usage, and mobile phone ownership do not display as much importance as wealth index, vulnerability to poverty, average monthly income, financial health score, and level of education. This suggests a reduction in disparity between gender and the high por-tion of the population owning phones, and such ownership does not imply much about the demand for insurance.
A small percentage of respondents, 0.21% of respondents in the study, believed that buying health or life insurance brings bad luck, and 0.47% cited religious or cultural reasons for not having insurance. This contradicts the results of Liu et al. (2021), which found superstition to be a significant factor in the uptake of life insurance. This suggests that cultural superstition may not be a hindrance to insurance uptake in Kenya and highlights the need to understand cultural and societal factors that influence insurance uptake in different settings. This information is important for developing effective strategies to promote insurance coverage.
Previously, it was reported that insurance companies and their agents were not honest (Barnes et al., 2010), but only 0.86% of respondents believed that insurance companies are dishonest and 0.48% believed that insurance agents are dishonest. This low perception of dishonesty among insurers and agents may not contribute to the low uptake of insurance. It is possible that the regulatory programs have reduced the instances of dishonesty among insurers and agents to a significant extent.
The study's findings on the drivers of low insurance uptake among potential policyholders in Kenya will be useful for insurers, regulators, and other stakeholders to design effective policies and strategies to promote insurance uptake. The research may have broader implications for other developing countries and resource-constrained environments facing similar challenges in increasing insurance uptake. The study's specific insights into the drivers of low insurance uptake in Kenya can inform policies and strategies in other countries with similar characteristics, allowing policymakers to design targeted marketing interventions that will increase insurance uptake and improve financial security for citizens.

CONCLUSION
The study sought to discover the factors influencing insurance purchases among prospective policyholders and find a model that could robustly predict those factors using machine learning techniques. The study found that Random Forest presented best results amongst the three models that ware tested and was identified as the most effective model for predicting the factors influencing potential policyholders' uptake of insurance based on the Kappa score, Recall score, F1 score, Accuracy, and Area Under the ROC Curve (AUC) metric. As a result, the Random Forest model would be an ideal choice for developing an algorithm for targeted marketing aimed at increasing insurance uptake among potential policyholders. Based on the results, cost effectiveness was discovered to be the primary driver of insurance uptake in Kenya, with 68.67% of respondents indicating that they cannot afford insurance but would like to have it. The next most significant factor was the respondents' level of education, which was associated with increased awareness of insurance and financial products. While a small percentage of respondents cited cultural and religious reasons or superstitions as barriers to uptake, the data suggested that cultural and societal factors may not be significant barriers in Kenya. Additionally, the low perception of dishonesty among insurers and agents implies that this is not a significant factor in the low uptake of insurance. These findings emphasize the need to design affordable insurance products, increase awareness of insurance and its importance through targeted programs, and understand cultural and societal factors that influence insurance uptake in different settings to promote insurance coverage effectively.