“Stock returns are not always from the same distribution: Evidence from the Great Recession”

Portfolio allocation strategies, and notably the mean-variance approach, use past re- turns to assign optimal weights. Even though both past and expected returns should come from the same distribution, a formal test of whether this holds in practice has not been conducted yet. Thus, the study examines if the daily returns of 242 companies with continuous trading in the S&P index come from the same distribution using the Kolmogorov-Smirnov, Cramér-Von Mises, and Wilcoxon rank-sum tests. The tests suggest that generally stock returns do come from the same distribution. However, the hypothesis is rejected during the Great Recession, with the rejection rate increasing as the forecast horizon increased. The rejection rate, using an array of macroeconomic variables, is found to record high levels of persistence. Although macroeconomic variables were not found to be statistically significant determinants of the rejection rate, market distress has a small but significant effect.


INTRODUCTION
The exact specification of the stock market returns distribution has intrigued both academics and practitioners as it holds a prominent role in portfolio construction. In the case of the most popular portfolio selection theory, one of Markowitz's basic assumptions in his 1952 seminal paper is that investors make base their decisions for portfolio selection based on expected risk and return. Implicit in this assumption is that these two factors are known, or at least they can be approximated with relative accuracy.
The most common way to gauge risk and return in the financial world is through past performance, which implies the use of an empirical distribution based on historical returns (Bodie, Kane, & Marcus, 2008). Given the large available sample of stock returns and, most importantly, if past and future returns belong to in the same distribution, practitioners can find the optimal weights for their desired portfolio mix of risk and return based on the Markowitz procedure. However, implicit in this procedure is the assumption that past and future returns come from the same distribution.
To examine whether this basic assumption holds, this study employs the nonparametric two-sample Kolmogorov-Smirnov test for the equality of continuous probability distributions. The test is sensitive to differences in both the location and shape of the empirical cumulative distribution functions of two samples. If returns are from the same distribution, then any portfolio allocation ap-The results, using the daily data for 242 companies in the S&P 500 index, show that the Markowitz procedure can be a good approximation during most of the sample period; however, it should not be viewed as a universal attribute. During the period of the financial crisis, the vast majority of stock-specific return distributions reject the equality hypothesis. The results are confirmed using two additional nonparametric tests, the Cramér-Von Mises and the Wilcoxon rank-sum test, which point to qualitatively similar conclusions. Testing for the macroeconomic drivers of the rejection rate using a variety of regression models with GARCH errors shows that, while the rate is highly persistent, common explanatory variables such as financial uncertainty and the interest rate are not significant drivers of its behavior. Overall market distress appears to have a statistically significant, albeit economically small, impact.
To our knowledge, this is the first study in the literature, which specifically examines this basic, albeit implicit, assumption 1 . Thus far, empirical studies, which aimed at examining the fit of statistical distributions on the data, do not tend to use individual stock data, nor do they distinguish between estimation and forecast samples. Existing studies (e.g., Egan, 2007 The findings of this study provide a rationale for why previous research (e.g., DeMiguel, Garlappi, & Uppal, 2007; Kirby & Ostdiek, 2012) found that the mean-variance approach does not offer important out-of-sample benefits. Similarly, the findings support the view that portfolio performance is sensitive to changes in asset means (Chopra & Ziemba, 1993;Best & Grauer, 1991), which is a defining feature of the distribution.
The remainder of this paper is organized as follows: section 1 provides an overview of the related literature, section 2 presents the tests for the equality of distributions and the equality rejection with the macroeconomic conditions, section 3 presents the results for both the main analysis and the robustness tests, section 4 discusses the findings and final section provides a brief summary and conclusions. 1 The study most similar to this is Chae and Lee (2018) who examine the significance of differences of the return distribution (distribution uncertainty) in the cross-sectional pricing of stocks. Markowitz (1952) introduced the context of investment portfolios selection and construction. In his seminal paper, he supports that investors make base their decisions for portfolio selection based on expected risk and return. Therefore, investors choose to maximise the expected portfolio returns and simultaneously minimize the risk. Implicit in this assumption is that these two factors are known, or at least they can be approximated with relative accuracy.

LITERATURE REVIEW
The use of an empirical distribution based on historical returns enables the investment community to gauge risk and return through past perfor-mance (Bodie, Kane, & Marcus, 2008). Based on the Markowitz's procedure, investors can rely on the large available sample of stock returns to find the optimal weights for their desired portfolio mix of risk and return, only if past and future returns belong to the same distribution. Otherwise, inference based on past returns can lead to wrong investment decisions.
While prior empirical studies have focused primarily on examining the fit of statistical distributions on the data, the assumption that past and future returns come from the same distribution holds remains largely unexplored. For example, Egan (2007) and Malevergne, Pisarenko, and Sornette (2005) show that non-normal distribu-tions (Stretched Exponential and Pareto) are better at capturing returns data, using the data spanning up to 100 years for the NASDAQ and the S&P indices. Similarly, Aparicio and Estrada (2001) find that the hypothesis of normality is rejected for 13 European securities markets. Nonetheless, studies such as the above use aggregate indices and do not focus on portfolio allocation in individual stocks. As such, they do not require the distribution to be the same in the estimation and forecast samples.
The burgeoning literature on examining the equality of continuous probability distributions proposes the nonparametric two-sample Kolmogorov-Smirnov test. The form of the Kolmogorov-Smirnov test and its asymptotic distribution under the null hypothesis was first published by Kolmogorov (1933), while a table of the distribution was offered by Smirnov (1948). It was first presented to the English-speaking audience by Massey (1951). The implementation of the test, and its sensitivity to differences in both the location and shape of the empirical cumulative distribution functions, serve the purpose of the paper to examine whether two samples come from the same distribution.
Another stream of literature on examining the equality of distributions uses the Cramér-Von Mises criterion as an alternative test to Kolmogorov-Smirnov test, which is widely used for comparing two empirical distributions. The criterion is named after Harald Cramér and Richard Edler Von Mises who first proposed it in 1928-1930 (Cramér, 1928;Von Mises, 1928). The generalization to two samples is due to Anderson (Anderson, 1962).
To further strengthen the validity of the findings, the existing literature suggests another test to determine whether two independent samples, selected from populations, have the same distribution.
In particular, one employs the Wilcoxon ranksum test, a non-parametric statistical hypothesis test after Wilcoxon (1945) who proposed the ranksum test for two independent samples. It tests the null hypothesis that it is equally likely that a randomly selected value from one sample will be less than or greater than a randomly selected value from a second sample.
The possible correlation of macroeconomic changes should also hold when using any portfolio strategy that uses the past to predict the future. Prior literature suggests that macroeconomic variables can affect future firm performance (Issah & Antwi, 2017). For instance, according to Humpe and Macmillan (2009), the long-term interest rate should contain an indicator of the long-term perception of the economy regarding the discount rate, while Andreou (2015) supports the view that the view regarding potential market default could potentially play a role. Therefore, to shed more light on the main findings of the paper, it was essential to further examine whether the series of rejection fares any relationship with the underlying macroeconomic conditions, following prior literature.

The Kolmogorov-Smirnov test
In the Kolmogorov-Smirnov setup, supposing that there are two samples is as specified above. If the null is true, has a distribution that does not depend on ( ) ( )

12,
Fx F x = but depends on n and . m Note that an important point of the two-sample test is that it does not specify the nature of that common distribution. In other words, the test does not examine whether the common distribution is Normal, Student's T, or Weibull but only whether the two samples come from the same distribution. As such, failure to reject the hypothesis does not ensure that a method for constructing the mean-variance portfolio under whichever distribution is correct. Not rejecting the hypothesis, however, suggests that since the two distributions will likely be the same, one could likely obtain good out-of-sample performance if inference is based on the estimation distribution.
In this study, the Kolmogorov-Smirnov statistic for equality of distributions is estimated using a sample of daily stock returns for 242 firms, which have been actively trading in the S&P 500 index from January 1, 2000 to December 31, 2014, with a total of 3773 observations for each stock. All data were obtained from the Center for Research in Security Prices (CRSP) database. For estimation, the in-sample window size is set at 1000 rolling observations with a daily step, in line with most other studies in the literature (e.g., Yilmaz, 2012; Kim, Shamsuddin & Lim, 2011).
The equality of the 1000-observation distribution is compared with the distribution obtained using a window of 100 observations after the end of the estimation sample. Notation-wise, the estimation sample, ( ) 1 , Fx starts at t and ends at t + 1000, while observations t + 1001 to t + 1101 comprise ( ) 1 , Fy i.e., the sample, which will be used for distribution equality. For simplicity, one will refer to ( ) 1 Fx as the estimation sample and to ( ) 1 Fy as the forecast sample. A similar-sized window for the evaluation of forecasts was also employed by Pesaran and Pick (2011).
For robustness, results for samples comprising of 50 and 200 observations are also estimated. The results from the estimations can be found in subsection 3.1.

The Cramér-Von Mises test
Supposing that one has the observed values for two samples in increasing order, i.e., 12 If the value of T is larger than the tabulated values, one can reject the hypothesis that two samples come from the same distribution. The results from the estimations can be found in subsection 3.2.

The Wilcoxon rank-sum test
The test involves the calculation of a statistic, denoted as . U Statistics equivalent to U can be considered the sum of ranks in one of the samples, rather than U itself. Its calculation, firstly, requires assigning numeric ranks to all the observations, and then adds the ranks for the observations which came from sample 1. The sum of ranks in sample 2 is now determinate since the sum of all the ranks equals ( ) 1

2, TT+
where T is the total number of observations. U is given by where M and N is the size for the first and second samples, respectively, and r and s is the sum of the ranks in the first and second samples, respectively.
Thus, smaller values of U support the research hypothesis, and larger values of U support the null hypothesis. For any U test, the theoretical range of U is from 0 (complete separation between groups, H 0 most likely false and H 1 most likely true) to * NM (little evidence in support of H 1 ). In every test, we must determine whether the observed U supports the null or research hypothesis. This is done following the same approach used in parametric testing.
Specifically, one determines a critical value of U such that if the observed value of U is less than or equal to the critical value, one rejects H 0 in favor of H 1 and if the observed value of U exceeds the critical value, one does not reject H 0 . To determine the appropriate critical value, one needs sample sizes (N and M) and two-sided level of significance. The results from the estimations can be found in subsection 3.3.

Equality rejection and macroeconomic conditions
To examine whether the series of rejection fares any relationship with the underlying macroeconomic conditions, one uses a regression model with GARCH errors, as first presented by Bollerslev (1986). Formally, the GARCH (p, q) model can be formulated through a mean equation: where t Y is the percentage of firms, which reject the hypothesis of equality on a given day (rejection rate), t M is a vector of macroeconomic variables, and t ε is the error term, which evolves according to the following process: 3 For an overview of the interpretation of ARCH and GARCH terms, see Campbell, Lo, and MacKinlay (1997, p. 483) or Alexander (2008, p. 283). 4 The MDL index was obtained from the personal website of Panayiotis C. Andreou (https://www.pandreou.com/). sents past values of the error variance (interpreted as the persistence of shocks to the error variance)3. p and q are the orders of ARCH and GARCH terms, respectively. As in other studies in the literature, one limits the scope of the estimation to the GARCH (1,1), which has been shown to perform well in financial markets (Andersen & Bollerslev, 1998;Hansen & Lunde, 2005). A similar setup for rejection rates has been used by Michail (2019).
To account for potential asymmetries in volatility, Glosten, Jagannathan and Runkle (1993) introduced an additional term to the GARCH specification. The GJR-GARCH model replaces equation (3) (4) is statistically significant, negative shocks have a distinct impact on stock market returns. If the sign is negative, then positive shocks have a different impact on rejection volatility.
To examine whether macroeconomic developments play a role in the estimation, the estimation uses the daily 10-year bond yield (see also Humpe and Macmillan (2009) and Campbell and Thompson (2007) for more on the relationship between stock markets and the long-run bond yield), while the CBOE VIX index is used as a measure of market fear and risk as it uses option-implied volatility in its estimation. Besides, one also employs the Market Default Likelihood Index (MDLI) of Andreou (2015) to account for market-wide distress and the overall probability of default. All data were obtained from the St Louis Fed 4 .
To account for September 2008, the month of the Lehman collapse, a dummy variable taking the value of one during the period, is also used. Besides, to examine whether abnormal behavior, with regard to the rejection rate, took place during the whole recession period, a dummy, which takes the value of 1 from December 2007 till June 2009, and 0 otherwise was also included in the estimation. Recession dates were obtained from the National Bureau of Economic Research (NBER). To avoid issues related to causality through the use of contemporaneous terms, all macroeconomic variables are included with a lag. This is also more practical since their usefulness in predicting the rejection rate can also be examined. The results of the estimation can be found in Table 4, panels (a) to (d). Figure 1 shows the test results on each of 242 firms under examination. In particular, Figure 1 indicates the percentage, out of a total of 2773 (3773 minus 1000 window) estimation days in which the estimation distribution was not the same as the forecast distribution, for each of 242 firms under study. At two ends of the spectrum, some forecast distributions were not equal to the estimation sample 25% of the times, while there also exist cases in which forecast distributions were always equal to the estimation ones. On average, estimation and forecast distributions were different, only 5% of total days ( Table 1).

The Kolmogorov-Smirnov test
The fact that in only 5% of total trading days till distributions were unequal would support the use of the Markowitz procedure. However, this result hides an important caveat: the number of daily stock returns, which were different from their forecast distribution, has not been uniform across time. Figure 2 shows that the number of stock returns, which rejected the hypothesis of equality on each given day, as a percentage of total stocks, skyrocketed during the financial crisis.
In particular, starting from September 2008, when the number of stocks with different forecast distributions stood at 5%, the percentage skyrocketed to more than 60% in February 2009, returning to 5% a year after the upswing in September 2009. During that period, an average of 32.6% of stock returns rejected the hypothesis of equality (Table  1). Excluding these 12 months, the average hypothesis rejection drops to just 2.1%.
Robustness checks (Figure 3),   An interesting complication of the test results is that the longer the forecast horizon, the larger the average rejection rate appears to be. While it makes intuitive sense, since a larger horizon allows for the existence of more values, which can change the empirical distribution of the returns, it nevertheless provides a rationale against long-run forecasts. Practitioners can expect that short-term realisations will come from the same distribution as past returns, in the case of statistical modeling. However, it could be the case that, as the horizon lengthens, realizations will most likely not belong to the same empirical distribution, with all the implications this entails for stock market forecasting.

The Cramér-von Mises test
Robustness tests are conducted using the Cramér-Von Mises test. Figure 4 shows the Cramér-Von Mises test results on each of 242 firms under study. In particular, Figure 4 indicates the percentage, out of a total of 2,773 (3,773 minus 1,000 window) estimation days in which the estimation distribution was not the same as the forecast distribution, for each of the 242 firms under study. At the two ends of the spectrum, some forecast distributions were not equal to the estimation sample 79% of the times, while there also exist cases in which forecast distributions were not equal to the estimation ones only 14% of the times. On average, estimation and forecast distributions were different 38% of total days. Note: Figure 2 depicts the percentage of individual stock returns which rejected the equality hypothesis at each given date, measured as the number of stocks for which the hypothesis was rejected at every given sample, over the total number of stocks. The rolling sample size was set at 1,000 daily observations and the forecast sample at 100. The dates depict the enddate of each rolling sample.  The results of Cramér-Von Mises test, strengthen the caveat presented by the Kolmogorov-Smirnov that the number of daily stock returns, which were different from their forecast distribution has not been uniform across time. Figure 5 shows that the number of stock returns, which rejected the hypothesis of equality on each given day, as a percentage of total stocks, increased rapidly during the financial crisis.
Particularly, before September 2008, an average of 38% of stock returns rejected the hypothesis of equality. Between September 2008 and 2009, on average, 87% of stock returns rejected the hypothesis of equality. During 12 months, 100% of stock returns rejected the hypothesis of equality, with 85 days. The percentage returns to an average of 29% of stock returns, which rejected the hypothesis of equality after the crisis.
Checks, using a forecast sample of 50 ( Figure 6, panel (a)) and 200 ( Figure 6, panel (b)), observations show an image, by and large, similar to the one using the 100-observation forecast sample. In particular, panel (a) suggests that the impact is more concentrated between October 2008 and Note: Figure 3 depicts the percentage of individual stock returns, which rejected the equality hypothesis at each given date, measured as the number of stocks for which the hypothesis was rejected at every given sample, over the total number of stocks. The rolling sample size was set at 1,000 daily observations and the forecast sample at 50 and 200 for panels (a) and (b), respectively. The dates depict the end-date of each rolling sample.     Note: Figure 5 depicts the percentage of individual stock returns, which rejected the equality hypothesis at each given date, measured as the number of stocks for which the hypothesis was rejected at every given sample, over the total number of stocks. The rolling sample size was set at 1,000 daily observations and the forecast sample at 100. The dates depict the enddate of each rolling sample.

The Wilcoxon rank-sum test
Robustness tests are conducted using the Wilcoxon rank-sum test. Figure 7 shows the Wilcoxon rank-Note: Figure 6 depicts the percentage of individual stock returns, which rejected the equality hypothesis at each given date, measured as the number of stocks for which the hypothesis was rejected at every given sample, over the total number of stocks. The rolling sample size was set at 1,000 daily observations and the forecast sample at 50 and 200 for panels (a) and (b), respectively. The dates depict the end-date of each rolling sample.  sum test results on each of the 242 firms under examination. Specifically, Figure 7 specifies the percentage, out of a total of 2,773 (3,773 minus 1,000 window) estimation days in which the estimation distribution was not the same as the forecast distribution, for each of the 242 firms under study. At the two ends of the spectrum, some forecast distributions were not equal to the estimation sample 14% of the times, while there also exist cases in which forecast distributions were always equal to the estimation ones. On average, estimation and forecast distributions were different 3% of total days.
While the overall percentages are weaker using this statistic test, the number of stock returns which rejected the hypothesis of equality on each given day, as a percentage of total stocks, rise steeply during financial crisis. In particular, as present-Note: Figure 7 depicts the percentage of rejections of each stock, i.e., the number of samples for which the hypothesis of equal variance was rejected by the variance ratio test, over the total number samples for each stock. The x-axis reflects each stock available from January 1, 2000 till December 31, 2014. Note: Figure 8 depicts the percentage of individual stock returns, which rejected the equality hypothesis at each given date, measured as the number of stocks for which the hypothesis was rejected at every given sample, over the total number of stocks. The rolling sample size was set at 1,000 daily observations and the forecast sample at 100. The dates depict the enddate of each rolling sample.  The robustness checks ( Figure 9) that have been implemented using a forecast sample of 50 (panel  Note: Figure 9 depicts the percentage of individual stock returns, which rejected the equality hypothesis at each given date, measured as the number of stocks for which the hypothesis was rejected at every given sample, over the total number of stocks. The rolling sample size was set at 1,000 daily observations and the forecast sample at 50 and 200 for panels (a) and (b), respectively. The dates depict the end-date of each rolling sample.

Equality rejection and macroeconomic conditions
As the estimation results suggest, the macroeconomic variables do not appear to have had an impact on the level of rejection rate, since neither the interest rate nor the VIX index found to be statistically significant. In contrast, the Lehman dummy's significance illustrates that the increase in the percentage of inequality of distributions during that period cannot be associated with any other development. The results also underline the connection between the overall market distress and several rejections. The impact is statistically significant throughout all the estimations, while the positive sign suggests that as market distress increases the rejections of the equality hypothesis also increase. As the standard deviation of the MDLI is approximately 3.67, a one-standard-deviation shock would mean an increase of 0.11 percentage points in the rejection rate. Although the number is small, it should be borne in mind that given the magnitude of the AR terms, the shock will be highly persistent 5 .
Moving to the volatility equation, ARCH and GARCH terms are statistically significant throughout all panels. Moreover also appears to be statistically significant and negative, suggesting that positive shocks have a different impact on volatility. Finally, while the housing boom and recession dummies are not significant in the mean equation, the conditional volatility of the rejection rate decreased during the period of the housing boom and increased during the recession (panel (d)), again suggesting that macroeconomic conditions have at least some impact on the rejection rate.

DISCUSSION
This study examines if the daily returns of 242 companies with continuous trading in the S&P index during 2000-2014 come from the same distribution using the Kolmogorov-Smirnov, Cramér-Von Mises, and Wilcoxon rank-sum tests. The findings suggest that although the dis-tribution of returns generally tends to be equal, during the Great Recession, the equality hypothesis of stock returns is frequently rejected, and the rejection rate tends to increase with the forecast horizon increased. Therefore, the longer the forecast horizon, the larger the average rejection rate appears to be, which opposes the long-run forecasts. The choice of a shorter forecast period is not a panacea, though, in times of distress, the percentage of firms, which records the returns where the forecast sample does not belong to the same distribution as the estimation sample increases by much more when the horizon is shorter. The reaction is nonetheless shorter-lived compared to longer horizons. As such, while short-term forecasting can likely be more accurate most of the times, it can be more prone to errors in the periods of turmoil.
The results of the Kolmogorov-Smirnov test present the caveat that the number of daily stock returns, which were different from their forecast distribution has not been uniform across time. Specifically, the rejection of the equality hypothesis increased rapidly during the financial crisis. The robustness checks conducted using both the Cramér-Von Mises test and the Wilcoxon rank-sum test qualitatively replicates the previous image.
The results using all three different statistic tests and the different estimation windows are robust. We can conclude that our results are not driven by the method used to either test the distribution equality, or several observations used to the forecast sample. Regardless of the strength of the percentage of rejection, the equality hypothesis or the peak of the stock returns that reject it, the image is identical and consistent. The findings suggest that while, in general, the equality of distribution hypothesis holds, the percentage of rejection increases rapidly during of financial crisis.
A further complication of the results is that macroeconomic developments should play an important role since the percentage of rejection rises significantly during the Great Recession. The possible correlation of macroeconomic changes and the equality of distributions should hold when using any portfolio strategy, which uses the past to predict the future. As such, it is not limited to the mean-reversion optimal portfolio procedure.
The findings suggest that the rejection rate is highly persistent and does not appear to have a relationship with usual macroeconomic variables. Nonetheless, a market distress index can capture, to some extent, the developments in the rejection rate. This finding, along with the result that conditional volatility being related to macroeconomic states such as the housing boom and the recession, suggests that the rejection of equality of distribution, is to some extent, correlated with changes in the macroeconomic environment.

CONCLUSION
The main conclusion is that the usual assumption that the distribution of past returns equals that of future returns holds in general. However, timing is important when it comes to using the mean-variance portfolio in practice: for example, during the Great Recession, stock returns rejected the hypothesis of equality more than one out of five times, with the rejection rate increasing as the forecast horizon increased. The rejection rate is increasing in the forecast horizon in normal conditions, while it increases substantially more in the short run in times of market distress. Macroeconomic variables were not found to be statistically significant determinants of the rejection rate, with a large unexplained part remaining during the month of the Lehman collapse, while market distress has a small but significant effect.
The above findings bear important implications. First, the results from this empirical study on the goodness-of-fit of the "past returns can be used for optimal future allocation of resources" doctrine suggest that while this holds in general, the profitability of the mean-variance portfolio (or any other strategy which allocates weights based on past returns) will be time-dependent. As the results further suggest, it could be the case that the optimality of portfolio selection may even be macro-dependent. Second, the