“Using textual analysis in bankruptcy prediction: Evidence from Indian firms under IBC”

Identifying and managing credit risk is vital for all lending institutions. Historically, credit risk is assessed using financial data from published financial statements. However, research indicates that the ability to detect financial hardship may be improved by textual analysis of firms’ disclosed records. This study aims to establish an association between themes and words from Management Discussion and Analysis (MDA) reports of firms and corporate failures. The study took a sample of 57 Indian listed firms declared bankrupt under the Insolvency and Bankruptcy Code (IBC) along with a matched sample of 55 solvent firms (matched by industry and size) for the period of FY2011–2019. The first part of analysis identifies negative words from the published reports and compares them with the negative words of the Loughran-McDonald dictionary. Then a thematic analysis is done to identify the key themes from the MDA reports and the significant themes are validated with their corresponding financial ratios in the third step using a panel logistic regression. Word analysis results show that IBC firms have significantly greater negative tone (2.21 percent) as against 1.30 percent of solvent firms. Thematic analysis results show that manageability, activity and performance are significant themes for predicting financial distress. Financial variables such as ownership pattern, promoters’ shares pledged, return on capital employed, asset utilization are some of the ratios in sync with the key themes. The study recommends that lenders and other stakeholders should look beyond financial statements which may be ‘window dressed’ by firms to qualitative disclosures in annual reports which may forewarn against impending financial distress.


Abstract
Identifying and managing credit risk is vital for all lending institutions. Historically, credit risk is assessed using financial data from published financial statements. However, research indicates that the ability to detect financial hardship may be improved by textual analysis of firms' disclosed records. This study aims to establish an association between themes and words from Management Discussion and Analysis (MDA) reports of firms and corporate failures. The study took a sample of 57 Indian listed firms declared bankrupt under the Insolvency and Bankruptcy Code (IBC) along with a matched sample of 55 solvent firms (matched by industry and size) for the period of FY2011-2019. The first part of analysis identifies negative words from the published reports and compares them with the negative words of the Loughran-McDonald dictionary. Then a thematic analysis is done to identify the key themes from the MDA reports and the significant themes are validated with their corresponding financial ratios in the third step using a panel logistic regression. Word analysis results show that IBC firms have significantly greater negative tone (2.21 percent) as against 1.30 percent of solvent firms. Thematic analysis results show that manageability, activity and performance are significant themes for predicting financial distress. Financial variables such as ownership pattern, promoters' shares pledged, return on capital employed, asset utilization are some of the ratios in sync with the key themes. The study recommends that lenders and other stakeholders should look beyond financial statements which may be 'window dressed' by firms to qualitative disclosures in annual reports which may forewarn against impending financial distress.

Vandana Gupta (India), Aditya Banerjee (India)
Using textual analysis in bankruptcy prediction: Evidence from Indian firms under IBC INTRODUCTION Credit risk arises from the possibility of loss when the borrower is unwilling or unable to repay the lender in full, leading to an economic loss to the lending bank. Banks may not have the necessary information to assess a borrower's default probability, which can lead to financial losses for the banks and, subsequently, cause a systemic crisis. Identifying red flags in borrowers' credit behavior can help banks mitigate credit risk. A firm's credit risk impacts all its stakeholders: the investors, lenders, and management; therefore, assessing and managing credit risk is essential.
Usually, credit risk is modeled with either linear discriminant analysis (LDA) or logistic regression (LR). However, these models require published financial data from financial reports as input. Bankruptcy prediction based on published financial reports is challenging due to the low frequency (quarterly or annual) of published information. Due to this, default prediction in the short term is difficult. Although accounting-based and market-based models continue to be applied for credit risk and default prediction, financial data can be subject to 'creative accounting' practices or window dressing, as was evident in Enron and WorldCom cases and Satyam Computers in India (Gandhi, 2019).
Textual analysis of firms' published documents has gained prominence in recent times. Textual analysis refers to techniques that are applied to elicit information from relevant texts. The information collected from texts can be used for research, analysis, and business intelligence (Loughran & McDonald, 2016). In the published annual reports of listed Indian companies, Management Discussion and Analysis (MDA) is a statement where firm executives comment on the past performance of their company, address the issues of compliance and risk, and present the outlook and proposed plan of action of a firm. In conjunction with financial statements, MDA provides valuable insights that could provide enhanced information about a firm's credit risk and can be valuable input in credit risk models (Nguyen & Huynh, 2020). Therefore, textual analysis of specific sections of published financial documents has been adopted in this study as a key methodology for predicting financial distress.

LITERATURE REVIEW
The pioneers in developing accounting models were Beaver (1966Beaver ( , 1968) and Altman (1968). Beaver (1966) applied a univariate statistical analysis to predict corporate failure, while Altman (1968) applied the multiple discriminant analysis (MDA) and proposed the Z-Score model that could distinguish between distressed and solvent companies with financial ratios as input data. Other models included logistic regression by Ohlson (1980), and Zavgren (1988). Zmijewski (1984) developed a model by applying Probit Analysis, while Shumway (2001) pointed to the merits of applying panel data in bankruptcy prediction. Several models of bankruptcy prediction have been developed using machine learning (Yeh & Lien, 2009 The application of textual analysis techniques in finance and accounting is a novel and emerging field. However, given the volume of documents in financial disclosure, there is significant scope for extracting critical information from the published text. These disclosures can complement the information present in published financial statements (Abrahamson & Amir, 1996). The tone of the documents published in a disclosure process can reveal some underlying facts about a firm that may not be evident in its financial statements (Kearney & Liu, 2014;Amani & Fadlalla, 2017). The style and content of the disclosures in these reports often contains more relevant information than can be obtained from financial statements alone. Thus textual-based predictors can serve as potential signals of financial distress among firms.
One of the first studies that applied textual analysis in finance was by Kohut and Segars (1992), which discriminated between companies with good and bad financial performance based on president's letters. Swales (1988) applied thematic analysis and demonstrated that factors like growth, anticipated profits and losses, and management optimism could help differentiate between financially strong and weak firms. Abrahamson and Amir (1996), in their study, explained the importance of analyzing the annual report text for investors. Bryan (1997) and others advocate for businesses' MD&A disclosures can help evaluate their shortterm prospects.
The research by Smith and Taffler (1992) and Tennyson et al. (1990) use content analysis to examine the relationship between firms' disclosures and bankruptcy. Frazier (1980, 1983) conducted a descriptive study on the content of firms' environmental disclosures for three different industries, and found a relationship between narrative disclosures and firm performance. Frazier et al. (1984) explored using the WORDS package (computer software) for analyzing accounting disclosures. His study found that the content of the documents is related to firms' market performance. The management teams of surviving and bankrupt enterprises' various responses to demand crises were examined using content analysis in the D'Aveni and MacMillan (1990) study. Previts et al. (1994) in their study revealed that when making investment recommendations, analysts take into account both financial and nonfinancial information from the MD&A and the president's letter. According to Zhang et al. (2010), the dictionary-based textual analysis is related to the 'bag of words' since texts are considered unsorted sets of words. Four distinct word lists are popular in accounting and finance research. These are, The Henry (2008) list, the Harvard General Inquirer (GI), Diction, and the Loughran & McDonald (2011) list. The main weakness of the Henry list is that it was used originally for examining earnings press releases. A significant portion of the material might not be evaluated because of the Henry list's word count limitations. The Harvard dictionary provides a software-based mapping function for text files based on several dictionary classes using an algorithm to classify words in each category. The Harvard dictionary has been widely used for analyzing press news, corporate disclosures, and initial prospectuses due to its earlier availability. Another dictionary used is 'Diction,' that includes data on the overall word count, character count, average word length, number of distinct word types, counts of special characters, and high-frequency word counts. Harvard Dictionary and Diction did not have any specific vocabulary for financial applications. As a result, terms that are considered unfavorable in a general sense may have a different meaning in a financial sense. Li (2010b) found that tone classifications based on conventional dictionaries were not accurate enough.
Loughran and McDonald (2011) pioneered textual analysis by building a financial dictionary with six lists of words that offered more accuracy than the traditional Harvard Dictionary. The dictionary was used only to analyze financial communication, and it contained six lists of words to represent six different types of sentiment in that context (negative, positive, uncertainty, litigious, strong modal, and weak model). The works were extended in 2014 and 2016 (Loughran & McDonald, 2014, 2016. The role of the dictionary in financial research was extended by Engelberg et al. (2012) who revealed that competent information-processing short sellers use publicly available news as signals for their trades. A new readability metric for financial reporting was developed by Bonsall et al. (2017).
Their study demonstrated that firms can reduce their cost of debt, and achieve better credit ratings if their filings have better readability. Other similar research works were extended by Haralambie (2016) who examined the issues related to credit risk management considered by commercial banks when analyzing corporate clients. Hu et al. The significance of theme analysis, one of the most popular research methodologies in qualitative research, was highlighted in works by Braun and Clarke (2006). The key objective of thematic analysis is to establish different themes, assign headings for the same from individual studies, and present them coherently (Thorne et al., 2004). Research questions and concepts are developed and explored using the key concepts and claims embodied by themes (Lián & Fayolle, 2015). A mixed-method approach can be adopted for the validation of textual analysis. Several works on mixed methods in finance have been done, including those by Creswell (1999), Östlund (2011), and Shorten and Smith (2017). All these authors have advocated that numbers and words together are essential to convey the results.
This study implements textual analysis through a mixed-method approach to identify the credit risk of Indian firms. To the best of the authors' knowl-edge, such an approach has not yet explored in the context of Indian firms. This study can be broadly divided into three parts. The first part analyzes the difference in negative words in MDA reports among insolvent and healthy firms using the Loughran and McDonald (2016) dictionary. The study then proceeds to thematic analysis, where keywords from MDA reports are divided among existing themes (Smith & Taffler, 2000;Tate et al., 2010) and a new theme validated by professionals from the credit rating industry and the head of credit of an Indian bank. Thus, the validation by two industry experts covered both sides of the spectrum: the lender and the rating industry. The objective of the thematic analysis is to study whether MDA reports of insolvent firms classified in themes can indicate the possibility of bankruptcy in firms. In the third part, a quantitative analysis of the panel data of the financials of sample firms is carried out to determine the significant variables and thereby validate the themes analyzed in the second part.
To the best of the authors' knowledge, the association between textual disclosures (word analysis and thematic analysis) and corporate bankruptcy has not been done in the Indian context. Prior works are extended in testing if an association between firm failure and key 'words' and 'themes' is there for firms in India that are bankrupt under IBC (Insolvency and Bankruptcy Code). The findings from textual analysis are validated with quantitative analysis to see if the variables can map with the significant themes.
This study aims to establish an association between themes and words from Management Discussion and Analysis (MDA) reports of firms and corporate failures.
Based on the extant literature discussed above, three hypotheses are framed. In this context, the first hypothesis is: H 1 : Financial distress is more strongly correlated with the annual report's use of more negative words. Therefore, the proportions of negative words are statistically the same for IBC and solvent firms.
The second hypothesis is related to thematic analysis: H 2 : Themes generated from MDA reports of firms can discriminate between bankrupt and solvent firms.
The third hypothesis is related to the quantitative analysis used to validate our selected themes:

Mixed method approach
The mixed method approach is used in applied research and uses quantitative and qualitative methods. Researchers have advocated some theories and rationale for using this approach. Dewasiri et al. (2018) debate that confirming the results by two separate techniques increases the completeness, rationality, and soundness of findings than a single approach. Hasse-Biber (2010) and Greene et al. (1989) emphasize that the motivation for the mixed method approach can increase the validity while minimizing bias, thereby allowing analysis from different perspectives and expanding the overall rigor of the study.
This study uses mixed methods following the opinion of Jick (1979) and Creswell (2003), and looks for convergence across qualitative and quantitative methods within social science research on credit risk. A mixed-method approach is used in mapping the key financial indicators with the significant themes generated to validate the themes. The validation approach consists of a quantitative analysis using financial ratios mapped to the most significant themes identified from the selected themes under study.

Textual analysis: Word and thematic analysis
The initial analysis in the study aims to un-

Quantitative analysis
The previous section's thematic analysis aims to identify the most important themes from the MDA reports that can differentiate between bankrupt and solvent companies. Assuming τ themes out of the six under study are relevant ( )

Findings from word analysis
In the first step of the analysis, NVIVO 12 is run for word frequency analysis. The words generated from the MDA reports for both sets of firms are compared with the negative word list from Loughran and McDonald's (2011) dictionary. The results indicate that whereas there were 2.21 percent of bad words for the bankrupt firm, there were only 1.30 percent of negative words for solvent firms. Results from the two-sample test of proportion are available in Table 1. The test statistic value is greater than the z-statistic at the 5 per cent level. Thus, the first hypothesis (H 1 ) is confirmed that more negative words from corporate disclosures are indeed associated with financial distress in companies.

Findings from thematic analysis
The initial analysis results in the MDA contents being classified into 12 broad themes. Rather than relying entirely on the themes generated from the software, the themes are reclassified based on those available in the literature and validated by experts from rating agencies. Three categories for categorizing content qualities were established by Osgood et al. (1957): evaluative (positive/negative), potency (strong/weak), and activity (active/passive). From an intertemporal perspective, Houghton (1988) adds a fourth dimension to accounting, along with evaluative (beneficial/unfavorable), potency (tangible/intangible), activity (dynamic/static), and manageability (expected/ unexpected). The research study also refers to Smith and Taffler's (2000) classification of the Chairman's comments of corporations into four categories: evaluative (beneficial/adverse), potency (tangible/intangible), activity (dynamic), and manageability (static/expected/unexpected). The critical sub-themes under these broad themes are classified based on the judgment and after discussion with eminent executives in the risk management domain. These themes shed light on the sentiments expressed in the qualitative disclosures of annual reports, indicating any financial distress.
The themes evaluative, activity, and management quality themes are maintained based on prior works on thematic analysis as stated above. A further dimension of outlook, performance, and strategic risks is added. The theme 'outlook' and 'strategic risks' is validated by industry experts and from prior research. 'Performance' theme is identified on the assumption that good/bad performance is always stated as part of MDA reporting and has also been classified as a relevant theme in prior similar research works. While strategic risks theme considers keywords as 'risks' and 'concern', outlook helps get an insight into prospects of 'investment,' 'projects,' 'infrastructure,' and 'developments' (Hanley & Hoberg, 2017).
The keywords are extracted for each theme before the computation of scores. A description of the number of keywords found for each theme and the weight of each theme is provided in Table 2.  Table 3. The results show that the coefficients of only three themes are significant: activity, management quality, and performance. The coefficient of all the three significant themes are negative thereby implying inverse relationship between financial distress and themes. . The dependent variable is binary which takes the value of 1 for IBC firms and 0 for solvent firms. The notation '***' refers to significance at the 1% level, '**' refers to significance at the 5% level, and '*' refers to significance at the 10% level.
The next step in this process is to identify the financial variables that correspond to these themes and verify the robustness of these themes with financial information.

Results from quantitative analysis
Results from the previous section show three significant themes: activity, management quality, and performance. The financial performance of a company can be assessed using a combination of financial ratios based on activity, management capability and performance. This research study has identified 17 ratios under the broad classification stated above. The criteria for choosing ratios are those that: (i) They have been identified in the literature as default measuring indicators; (ii) Previous empirical works have applied these ratios in predicting insolvency; and (iii) They are easy to compute with data from exiting financial databases; (iv) The multi-collinearity among the variables is not significant (variance inflation factor below 5). The size (log of total assets) and age of firms are controlled. A description of the variables and the variance inflation factor (VIF for multi-collinearity) is provided in Table 4. Table 5 contains the panel logistic regression's findings. Results show that 9 out of 17 variables are significant. Thus, two activity ratios (Gross fixed asset utilization ratio and Export/sales), four management quality ratios (Market capitalization/Debt, Debt/ equity, Promoters' shareholding, and Promoters' shares pledged), and two performance ratios (Price/ Book Value, and return on total assets) are statistically significant. Both control variables are also significant. Based on the results, we can say that our third hypothesis (H 3 ) is partially validated.
Activity ratios demonstrate how dynamic a company's operations are. Better operational efficiency is important for delivering quality goods or services to customers leading to a more profitable firm with higher credit ratings and thus creditworthiness (Smith & Taffler, 2000). Performance of a firm has a direct bearing on its operational and financial risk as is evident by the profitability ratios (EPS, P/BV, PAT/Capital Employed and ROTA). The more profitable a company is, the better is its ability to service debt and thereby reduce its financial risk. Management quality is the most significant theme as is reflected in the ownership pattern, the quality of shares pledged, the market value of the firm, its cash resources available and the ability of the firm to manage its contingent liabilities. These ratios in the table above are in sync with the management theme.

DISCUSSION
Results reveal a significant difference in negative words in MDA between insolvent and solvent firms. The proportion of negative words in compa-ny MDA reports under IBC is substantially higher than the percentage of words that are unfavorable in the reports of healthy companies. The results thus provide evidence to support the first hypothesis (H 1 ). It is also clear that the report's tone must be considered when deciding which enterprises are bankrupt and which are not, and it confirms that the negative words from the Loughren and McDonald definition apply in the Indian context. A significant difference in thematic keywords  . The dependent variable is binary which takes the value of 1 for IBC firms and 0 for solvent firms. The notation '***' refers to significance at the 1% level, '**' refers to significance at the 5% level, and '*' refers to significance at the 10% level.
among the two groups of firms is also observed, which is further validated by a quantitative analysis of panel data. The results strongly suggest using textual analysis to identify insolvency and bankruptcy in firms early. The findings are unique for Indian firms, and applying the mixed method reaffirms the robustness of the results.
The results provide evidence to support the second hypothesis (H 2 ) of this study, and confirm that certain themes in MDA reports (activity, manageability and performance) can distinguish between IBC and solvent firms before the actual filing of bankruptcy. The negative signs of the coefficients of activity and management scores suggest that fewer keywords in these themes might enhance the likelihood that a business may eventually declare bankruptcy. However, a sudden increase in performance-based keywords (in conjunction with more infrequent activity and management quality-based keywords) may indicate future default. The three significant themes jointly represent over 70 percent of the keywords in MDA reports. Findings further suggest that strong corporate governance as reflected in strong management leads to a higher credit quality and creditworthiness. The results from management quality ratios suggest that higher market capitalization over debt indicates the better financial health of a firm. However, an increase in the debt/equity ratio is a strong indicator of future financial distress. Lowering promoter shareholding is again a strong indicator of future distress and might indicate a loss in promoters' confidence. However, more shares pledged by promoters may indicate an impending crisis in the company.
Highly profitable firms perform better in the market and have low credit risk and high credit quality. The firm's credit rating may be significantly affected by profitability, according to some theories. The coefficient of price to book value, one of the outcomes from the performance ratios, is negative and significant, indicating that a firm's stronger financial situation is indicated by higher equity market performance. The possibility of distress in a company is significantly and negatively affected by return on total assets, which demonstrates that higher profitability lowers the likelihood of default. The gross fixed asset utilization ratio, among the activity ratios, has a favorable and significant effect on a company's likelihood of defaulting. In the recent past, several steel companies in India expanded their investment in fixed assets in anticipation of higher demand and subsequently increased activity levels. Though, the higher demands never materialized and resulted in unsustainable debt levels 2 for these companies. The positive sign of the coefficient of gross fixed asset utilization ratio results from several steel companies filing for IBC during the sample period included in the study. It signifies the debt burden and subsequent financial distress these companies faced despite having high asset utilization. The negative sign of the coefficient of export/sales shows a decline in exports as a percentage of sales may indicate a loss in revenues and subsequent financial stress for companies.
The results of the quantitative analysis show that the significant themes in the MDA report can indicate financial distress. The quantitative variables and qualitative analysis of keywords in MDA can be combined to obtain a precise forecast of an impending financial crisis in any company.
From the findings of the study, it can be inferred that comparing the approach taken by regulators to evaluate financial crisis with the tone of financial reports offers advantages. Most financial distress models use some linear combination of financial ratios. Lenders may "window dress" financial data to meet or exceed minimal regulatory standards if they are aware of the financial ratios that regulators use. Additionally, firm managers could propagate false information regarding the caliber of their real estate holdings and regulatory capital through the exercise of their accounting discretion. As a result, in order to forecast financial trouble, regulators or investors may wind up employing erroneous financial data. Extreme events may be difficult to foresee using financial data due to previously unheard-of losses for banks that met or exceeded regulatory standards. (Gandhi et al., 2019).

CONCLUSION
The research study explores if the qualitative disclosures as given in the MDA reports of firms in India can preempt firm failure and bankruptcy. For this purpose, word analysis and theme analysis are performed using NVIVO. Words generated from MDA reports are matched with the negative words of the Loughran and McDonald dictionary, and significant themes are identified from the reports. The themes are then validated by applying a mixed-method approach, using quantitative techniques of panel logistic regression.
It is observed that the firms need to go beyond financial information in predicting financial distress and bankruptcy. This study contributes in that it is the first study that has combined quantitative and qualitative approaches in the context of companies under IBC in India. The study contributes by improving upon the existing literature by applying a mixed-method approach in mapping the findings of quantitative research with content analysis. Using two different approaches to confirm findings leads to greater completeness, validity, complementarity (analysis from different perspectives), and generalizability of findings than possible with a single methodology. For managerial implications, this study allows lenders to integrate qualitative variables in their credit risk models.
Textual analysis can help in eliciting additional information over existing financial data thereby enhancing the quantity and quality of financial distress analysis. The textual analysis method could be used by practitioners to complement their analysis of financial data. This could be useful for practitioners in valuing equity and assessing investor sentiment. Auditors can use textual information to identify accounting irregularities and fraudulent activities and predict bankruptcy. To track financial health and identify difficulty, regulators and lenders could examine the frequency of negative phrases and words in all companies' annual reports. In addition to financial variables, greater accuracy in forecasting financial difficulty may be achieved by analyzing the relative frequency of negative terms. As a result, the findings of this study can be used by regulators, auditors, and lenders to create early-warning models that could assist in individually identifying problematic enterprises in the future.

ACKNOWLEDGMENTS
The infrastructural support provided by FORE School of Management, New Delhi in completing this paper is gratefully acknowledged.