“Creating better tracking portfolios with quantiles”

Tracking error is a ubiquitous tool among active and passive portfolio managers, widely used for fund selection, risk management, and manager compensation. This paper shows that traditional measures of the tracking error are incapable of detecting variations in skewness and kurtosis. As a solution, this paper introduces a new class of Quantile Tracking Errors (QuTE), which measures differences in the quantiles of return distributions between a tracking portfolio and its benchmark. Through an extensive simulation study, this paper shows that QuTE is six times more sensitive than traditional tracking measures to skewness and three times more sensitive to kurtosis. The QuTE statistic is robust to various calibrations and can easily be customized. By using the QuTE tracking measure during the Dot Com bubble and the Great Recession, this paper finds differences between the DIA and its benchmark, the DJIA, that other- wise would have gone undetected. Quantile based tracking provides a robust method for relative performance measurement and index portfolio construction.


INTRODUCTION
Index fund managers and risk managers use inadequate tools to track a portfolio's relative performance. The most commonly used tracking error measures are cast as squared deviations between a tracking portfolio and its benchmark, and thus are focused only on the mean and variance of returns. This type of quadratic structure is inconsistent with the linear performance fees through which most managers are compensated (Kritzman, 1987). Instead, managers are incentivized to avoid extreme return deviations (Rudolf et al., 1999), which implies that higher order moments, such as kurtosis, are relevant. Moreover, Beasley et al. (2003) suggest that managers are incentivized to avoid consistently underperforming their benchmark, suggesting that skewness is also relevant.
Doroc'akov'a (2017) and Blume and Edelen (2004) point out that the goal of a tracking error is to measure how closely a portfolio can exactly replicate its associated benchmark. There is a preponderance of evidence that asset returns are non-Gaussian (Mills, 1995;Chung et al., 2006). Therefore, tracking only the first two moments, as do conventional measures, is insufficient.
Other shortcomings of traditional tracking error measures have been cited. For instance, Pope and Yadav (1994) illustrate the bias in the tracking error due to serial correlation in returns. Moreover, Ammann and Tobler (2000) recognize that tracking error variance is subject to the sampling error. This paper makes two contributions to the literature on portfolio tracking. First, this paper details a previously undocumented shortcoming of traditional tracking errors. Through a simulation study, this paper shows that traditional tracking errors (such as average tracking error and tracking error volatility) fail to detect situations in which the skewness (and/or kurtosis) of the tracking portfolio differs from that of the associated benchmark.
The second contribution of this paper is to introduce a class of quantile-based tracking errors (QuTE). As this paper will discuss in Section 2.1, there are many variants of the tracking error. Some have symmetric loss functions, structured via absolute or squared deviations. Meanwhile, other variants incorporate asymmetries vis-a-vis semi standard deviations, which are aligned with downside risk. Each have an analogue within the quantile-based measures. This paper shows that even the most basic of these QuTE measures can detect deviations in higher order moments of returns.
This paper begins with a detailed accounting of the traditional measures of tracking error alongside the newly proposed quantile-based measures. Then, the paper conducts an extensive simulation study to explore the relative merits of QuTE. Finally, this paper documents historical episodes where QuTE was able to detect important differences between a tracking portfolio and its benchmark, while the traditional measures were unresponsive.

LITERATURE REVIEW
The term "Tracking Error" has evolved over time and is used in myriad contradictory ways by academics and practitioners. To facilitate the discussion, this paper attempts to standardize the terminology and to provide a com-prehensive list of many variants of the tracking error. Define the price at time t as P t , and the return from t -1 through t as r t . Denote r P as the return on the tracking portfolio, r B as the associated benchmark, and T as the sample size (e.g., days) over which the portfolio is being tracked.
where (x)_ indicates taking only the positive elements of x. This can be annualized by multiplying the above measures by √M, where M is the number of periods per year.
Equation (1) was seen first in the academic literature in Franks (1992), which defined it simply "excess of benchmark returns". Among practitioners, the object in Equation (1) is sometimes referred to as Tracking Difference 1 . Roll (1992) refers to this object as "Tracking Error", which is commonly applied within the proceeding academic literature, and as such reserves that terminology throughout the balance of this paper. Note that the object in Equation (2) is simply an average of the Tracking Error over a period.
The object in Equation (3) is the next most used variant of the term Tracking Error. Franks (1992) refers to this object as Tracking Error, whereas Roll (1992) refers to this as Tracking Error Volatility (TEV). Many proceeding academic studies (Jorion, 2004) use the TEV terminology. Moreover, Equation (3) is commonly referred to as Tracking Error among practitioners 2 . Often this is reported as an annualized value 3 . Equation (4) is subtly distinct but is less often used in the literature than is Equation (3). Used by Ammann and Tobler (2000), it captures the square root of the sum of the squared tracking error. Root Mean Squared Tracking Error (RMSTE) in Equation (5) was used by Chincarini and Kim (2006) to capture both the variability and the level of the tracking errors.
As noted by Kritzman (1987), portfolio managers are rewarded by linear performance fees based upon the differences between their portfolio and the corresponding benchmark. Rudolf et al. (1999) argue that due to this fact linear deviations between the portfolio and benchmark give a more accurate description of the investors' risk attitudes than do squared deviations. As such, tracking measures based off absolute, rather than squared differences,

METHOD AND SIMULATION STUDIES
This section introduces a class of the tracking error that is based off the difference in the quantiles of the tracking portfolio and respective benchmark, which will be referred to as Quantile Tracking Error (QuTE). After introducing QuTE, this paper explores the differences between QuTE and traditional TE tracking measures by conducting simulation studies. Of particular importance, in subsection 2.2, is the sensitivity of each measure to differences in the empirical distributions of the benchmark and tracking portfolio. Subsections 2.3 and 2.4 focus on robustness of QuTE to various calibrations.

Method
Set a grid of returns that form  -1 groups with equal probability of occurring. Then denote r(τ) to be the τ th  -quantile of a return distribution.
This paper defines the following tracking error variants inside of the QuTE class, Intuitively, QuTE compares two assets via differences in the quantiles of their respective return distributions. This is especially useful in finance, given the preponderance of returns with excess skew and kurtosis, and quantile-based methods' ability to capture these distributions (Rostek, 2010). Moreover, a quantile-based approach is consistent with the utility maximization via quantile maximization of Rostek (2010), as well as with Giovannetti (2013), who builds an asset pricing model consistent with CRRA preferences via quantile maximization.
Since the Value-at-Risk (VaR) is merely a quantile of a return distribution, QuTE can be seen as matching on the space of VaRs at various levels. Yamai and Yoshiba (2002) show that portfolio ranking via VaR is consistent with expected utility maximization and is free of tail risk. This paper adapts the findings of Rostek (2010), who characterizes the behavior of an agent evaluating different (investment) alternatives by the τ th quantile of the implied (return) distributions and selects the one with the highest quantile payoff. Investor's preferences can be represented via the quantiles of the associated return distribution. In the context of benchmark tracking, the investor's preferences 4 Note that a natural analogue to QuTE is moment-based matching, rather than quantile based. One could use a method of moments type estimator to match a select set of empirical moments between the benchmark and optimal portfolio. Although potentially attractive, a moment-based approach lacks the flexibility of a nonparametric quantile-based method. 5 The one-to-one mapping between returns and quantile levels permits leveraging the distribution matching literature and cast QuTER within the Fidelity family of similarity measures.
for deviations from their benchmark can be cast via the differences in the quantiles of the portfolio and benchmark. Portfolio construction with VaR based objective functions is increasingly common (see Gaivoronski & Pflug (2005) for recent examples). Moreover, a quantile-based approach 4 is especially attractive, given the prevalence of VaR for portfolio risk management (Follmer & Leukert, 1999). Notice the similarities with the tracking error measures defined in Section 1. Importantly, the averaging in the QuTE class is not done over time T, but rather across quantile levels  . The QuTE measures never force portfolio managers to compare his/her portfolio to the benchmark on a daily basis 5 . This might mitigate the problem of "short termism" as indicated by Ma where λ(τ) is the importance of quantile τ to the overall tracking error measure. Beasley et al. (2003) do not discuss weighting schemes, but given they are directing the weightings over time, any of the numerous time series lag function might suffice (Almond, etc...). In this paper, the importance weights are linked to the area of the return distribution the user finds most important. Analogous to choosing the quantile level for risk buffers in Basel (e.g., 5% VaR), the importance of specific quantiles can be designated. For tractability and interpretation, this paper recommends scaling such that Section 2.4 offers two approaches to scaling: equal quantile weight and total return attribution.

Sensitivity to differences in return distributions
In this subsection, a simulation study is conducted to evaluate the traditional tracking error measures of Section 1, as well as the QuTE based measures of Section 2.1. A toy exercise is crafted that, while simple in nature, permits highlighting the sensitivity of the tracking errors to differences in the underlying return distributions. Given the preponderance of evidence citing skewness and kurtosis (see Chung et al. (2006) and Mills (1995), among others) in asset returns, coupled with the calls for linear performance measures (Rudolf et al., 1999;Kritzman, 1987), this paper considers deviations in these "higher order" moments.
The simulations begin by creating a benchmark portfolio. For simplicity, assume the returns of the benchmark follow a standard Normal distribution. Then, calibrate the length and empirical moments of the benchmark to match that of the monthly returns on Dow Jones Industrial Average over the period 1985 through June 2021. This same index is used in a Case Study detailed in Section 3.1. The simulations contain 10,000 paths, each of length 438 months.
Next, generate a tracking portfolio that follows one of five distinct distributions depicted in Table 1. In Case 0, the tracking portfolio has the same distribution as the benchmark portfolio. In Case 1, they differ only in the mean. Similarly, Case 2 varies in terms of variance, Case 3 in terms of skewness, and Case 4 in terms of kurtosis 6 . 6 Each series was simulated within Matlab using the pearsrnd function for a Pearson system of random numbers with moments calibrated to match the mean, standard deviation, skewness, and kurtosis of the monthly return of the Dow Jones Industrial Average over the period 1985 through June 2021. 7 The measures of absolute and semi tracking error are beyond the scope of this paper. 8 This paper also considers excess standard deviation in the range -0.9 to 4, excess skewness in the range -1.4 to 1.4, and excess kurtosis in the range -1.5 to 4.5. 9 This   This exercise explores the ability of the various traditional tracking measures to detect differences in the mean (standard deviation, skewness, kurtosis) of the tracking portfolio and benchmark. As noted in Section 1, the TEV depicted in Equation (3) is the most used tracking measure among academics and practitioners. The TEV is compared to ATE, TER, and RMSTE 7 .
First, vary the mean return of the tracking portfolio in excess of the benchmark (i.e., excess mean) in the range S ∈ {-5% to 5%} 8 . Next, compute the ATE, TER, RMSTE and TEV for each of these values of excess mean by taking the average over simulation paths. Finally, scale 9 the values for each of the cases for ease of visual comparison. Panel A of Figure 1 depicts the ATE, TER, RMSTE and TEV values over the range of excess mean values. Panels B, C, and D similarly reflect excess standard deviation, skewness, and kurtosis. Figure 1  A desirable measure of tracking error should achieve a minimum at an excess mean (standard deviation, skewness, kurtosis) of 0, i.e., when there is no difference between the tracking portfolio and benchmark, the tracking error measure should be at its low point. According to Figure 1, ATE is unable to detect changes in any of the four moments. Meanwhile, TEV performs similarly to TER and RMSTE across Cases 2 through 4. In this sense, TEV is roughly equivalent to TER and RMSTE.
Next, compare the traditional and quantile-based tracking measures in terms of their abilities to detect differences in the underlying statistical distributions of the benchmark and tracking portfoli-os. The comparison is centered around the TER of Equation (4) and the QuTER of Equation (12). Note the prior findings that TER is roughly equivalent to the popular TEV, which makes this comparison relevant. Moreover, QuTER is a direct analogue of TER, providing a fair comparison. Table 2 explores these relative sensitivities by computing the percent change in the (Qu)TER statistic relative to Case 0. The greater is the percent change in the (Qu)TER in Case 1 relative to Case 0, the more sensitive is that measure to variations in the means of the two series.  Table 2 reports the sensitivity of TER and QuTER to variations in the distributions of the tracking portfolio and benchmark. Each cell represents the percent change in the associated tracking measure relative to Case 0, averaged over 10,000 simulated paths. The row labeled PVal reports the p-value from a two-tailed test of equal means.
The p-value of 0 for Case 1 in Table 2 implies that the percent change in the QuTER statistic for Case 1 relative to Case 0 is not equal to the percent change in the TER statistic for Case 1 relative to Case 0. In fact, QuTER and TER have unequal sensitivities to differences in each of the first four statistical moments. Moreover, one-tailed t-tests suggest that the QuTER is in fact more sensitive than TER in all Cases.
These findings are explored further by conducting a sensitivity analysis similar to the exercise above. Again, vary the degree of mean returns in the tracking portfolio in excess of the benchmark (i.e., excess mean) in the range S ∈ {-5% to 5%} 10 . Next, compute the TER and QuTER for each of these values of excess mean, simulated and averaged over 10,000 paths. Finally, scale the values for each of the cases for visual comparison. Panel A of Figure 2 depicts the TER and QuTER values over the range of excess mean values. Panels B, C, and D similarly reflect excess standard deviation, skewness, and kurtosis. Again, a desirable measure of the tracking error should achieve a minimum at an excess mean (standard deviation, skewness, kurtosis) of 0, i.e., when there is no difference between the tracking portfolio and benchmark, the tracking error measure should be at its low point. The values for each of the cases for visual comparison. Panel A of Figure 2 depicts the TER and QuTER values over the range of excess mean values. Panels B, C, and D similarly reflect excess standard deviation, skewness, and kurtosis. Again, a desirable measure of the tracking error should achieve a minimum at an excess mean (standard deviation, skewness, kurtosis) of 0, i.e., when there is no difference between the tracking portfolio and benchmark, the tracking error measure should be at its low point.
Panel A of Figure 2 suggests that TER and QuTER are both sensitive to variations in the mean return of the tracking portfolio and benchmark. To facilitate a statistical comparison between TER and QuTER, this paper conducts a regression that projects the standardized tracking errors upon the absolute moment's differences (excess moment) in the tracking portfolio and benchmark. An error group dummy variable and an interaction term with the moment difference are added to the regression to explore whether there are differences between the different tracking errors.
Consider Case 3 as an illustration. Define the response variable as Y i = z-score(Error i ) for i ∈ S, where Error i is the QuTER or TER value. The tracking errors are standardized within each error group to ease comparison. For example, Now, define the excess moment, excess skewness here, as Em i = |skew i Pskew i B | for i ∈ S. The excess mean, standard deviation, and kurtosis are all defined analogously. The error group dummy D QuTER is set to 1 if the QuTER is used to measure the tracking error.
The regression is specified as follows, Y i = α + β 1 Em i +β 2 D QuTERi +β 3 D QuTERi Em i + e i , with typical assumptions on the error term. The object of interest is testing if 3 β = 0, which would imply that the tracking error measures behave similarly as the excess moment rises. Further, the directionality can be gauged from the sign of the estimated coefficient. For instance, a positive 3 β from the skewness regression (Case 3) would imply that QuTER is more sensitive than TER to variations in skewness between the tracking portfolio and the benchmark.
Regression results of the standardized QuTER and TER upon absolute excess statistical moments across each of the 100 percentiles. Case 1 captures excess mean as |mean i Pmean i B |. Case 2 captures excess standard deviation as |stddev i Pstddev i B |. Case 3 captures excess skewness as |skew i Pskew i B |. Case 4 captures excess kurtosis as |kurtosis i Pkurtosis i B |. Table entries refer to the slope estimate averaged across 10,000 paths. Table 3 demonstrates that the estimated β 3 is positive and statistically significant for Cases 3 and 4.  Specifically, the QuTER statistic is roughly six times more sensitive than TER to deviations in skewness and three times more sensitive than TER to deviations in kurtosis. This finding aligns with Figure 2, where QuTER appears to detect changes in the third and fourth moment, while TER is unable to do so.

Robustness to quantile grid granularity
This subsection explores whether the granularity of the quantile grid for the QuTE statistics impacts their ability to detect differences in the distributions of the tracking portfolio and the benchmark.
The exercise of Section 2.2 is repeated by simulating the benchmark returns as simple Gaussian noise and then varying the tracking portfolio in four ways, Case 1 alters the mean, Case 2 alters the variance, Case 3 alters the skewness, and Case 4 alters the kurtosis. Figure 3 depicts the percent-age change in the QuTER statistic in a given Case relative to Case 0. The x-axis varies the size of the quantile grid (  ). The reported values are the median across 10,000 simulated paths.
The percentage change in the QuTER statistic falls as the number of quantiles in the grid rises. The relationship appears to plateau near 10 quantiles, indicating that the QuTER measure is robust to the choice of quantile grid.

Impact of varying quantile weights
This section explores whether variations in the quantile weighting scheme affect QuTE's ability to detect deviations between the distributions of the tracking portfolio and benchmark.
Blitz and Hottinga (2001) illustrate how to compare various investment strategies via a Tracking Error framework. In a similar vein, various quantiles are weighted by whatever criterion is most important to the investor. Four different weighting schemes are considered: equal weight, tail risk weight, downside risk weight, and total return attribution.
For the equal weight scheme, each quantile has equal importance. For the tail risk weighting scheme, set λ = 0 for quantiles 1-5% and quantiles 95-100% and λ = 1/90 for all other quantiles. For the downside risk weighting scheme, set λ equal among all quantiles with downside deviations. This scheme is inspired by loss aversion ala Kahneman and Tversky (1979), and is closely connected to the Semi-Standard Deviation based (quantile) tracking errors. Finally, consider a total return attribution weighting scheme, wherein each quantile is weighted according to its contribution to the portfolio's total return. Specifically, the relative frequency of return observations that fall within that bin is computed. Then, take the average bin return times relative frequency and divide by the total portfolio return 11 to compute the attribution of any given bin. By design, these attributions sum to 1, and thus are viable choices for quantile weights λ. Figure 4 illustrates how the QuTER objective function varies with the four aforementioned weighting schemes using the structure from Section 2.2. The height of each bar is the associated QuTER averaged over 10,000 paths. The number above each bar is the gross change of that average QuTER statistic relative to Case 0. For instance, the 1.1 above the first bar in Case 1 implies that the QuTER value for the equal weight scheme in Case 1 is 1.1 times as large as the equal weighting scheme QuTER statistic for Case 0. The legend can be read as follows: EW = Equal Weight, TR = Total Return Attribution, Tail = Tail Risk, and Down = Downside Risk. Figure 4 reports the value of QuTER in five cases: Case0 -tracking and benchmark portfolio come from the same distribution; Case1 -means differ; Case2 -variances differ; Case3 -skewness differs; and Case4 -kurtosis differs. See Section 2.2 for details. The height of each bar marks the QuTER value averaged over 10,000 simulated paths. The number on top of each bar represents the gross change of that QuTER value relative to Case0.
A quantile weighting scheme of equal weight or total return attribution is robust to a wide array of

EMPIRICAL RESULTS AND DISCUSSION
In this section, two small case studies are conducted to illustrate the behavior of QuTE alongside a traditional measure of tracking error. The first case regards tracking the DJIA, while the second focuses on tracking the MSCI Emerging Markets Index. QuTER and TER measures are applied in both an unconditional and conditional setting.

Tracking the DJIA
The Dow Jones Industrial Average (DJIA) is the benchmark and the DIA SPDR ETF is the tracking portfolio. The DJIA is a leading index of equity market returns in the USA, launched on May 26, 1896, and with approximately 1,876.70 dollars indexed to its performance. The DIA is among the largest of the DJIA ETF tracking portfolios, with an average of 6,912,000 USD in daily volume since the inception date. It is also one of the oldest ETFs 12 These findings are similar for AQuTE and AAQuTE. 13 The adjusted close prices are from Yahoo Finance.
to track the DJIA portfolio, with an inception date of January 13, 1998 13 .
The dataset contains monthly log returns for both the DJIA (benchmark) and the DIA (tracking portfolio) over the period January 1998 to June 2021. Figure 5 depicts the time variation of these two return series overlaid upon one another. Simple visual inspection suggests they are quite similar. In fact, the correlation between these two return series is almost 1. Figure 5 displays the relationship between the DIA and the DJIA monthly returns from January 1998 to June 2021. Table 4 contains basic descriptive statistics such as mean, standard deviation, skewness, and kurtosis, as well as select quantiles of these two series.     Table  4 by overlaying histograms of the tracking portfolio and benchmark in Panel A, and presenting a two-way QQ plot in Panel B. In addition, Table 5 presents various measures of (quantile) tracking errors. Note that the TE and QuTE values are not directly comparable, given the different scaling of each measure.   Taken together, the above results reveal that the DIA has distributional properties that are remarkably similar to the DJIA, thereby supporting the visual inspection. Each of the moments and quan-   In a similar fashion, TER and QuTER statistics are computed between the benchmark and tracking portfolio. Panel A of Figure 9 depicts the rolling tracking measures computed over rolling threeyear windows, while Panel B depicts the month-tomonth percent change in each tracking measure.  These movements in the QuTER reflect its sensitivity to differences in return distributions that were not detected by TER.
Another episode of interest is the Great Recession. The TE swings wildly from 0.72 to 0.83 over the period 2008 to 2010. The mean return differences, as depicted in Panel A of Figure 8, vary between 0.17 and 0.24, and with it TER varies between 0.72 to 0.88. Notice that skewness changed from 0.04 to 0.05 and kurtosis from -0.08 to -0.11 over that period 14 . QuTER captured these movements, by increasing by almost 29 percent over that period, rising from 0.87 to almost 1.12, outpacing the roughly 15% change in TER.

Tracking the MSCI Emerging Markets Index
In is carefully investigated to exemplify the differences between TER and QuTER. The dataset consists of monthly simple returns over the period January 2013 through June 2021 15 .
The correlation between these two return series is 0.97 during this sample period. As depicted in Figure 10, the empirical distributions are similar. Nonetheless, as depicted in Figure 11, there are differences between the two series. Analogous to Figure 8, Figure 12 illustrates the time variation in the differences of the first four empirical moments of the benchmark and tracking portfolio. Panels B, C, and D show stark time variation in the differences of standard deviation, skewness, and kurtosis.   ences. Panel B plots the standard deviation differences. Panel C plots the skewness differences. Panel D plots the kurtosis differences. All differences in moments were computed over 3-year rolling windows. The sample period is from January 2013 to June 2021.
TER is little changed during this period, as seen in Figure 13, ranging from approximately 0.9 to 1.2. Meanwhile, QuTER is able to detect these variations in the series, ranging between 0.8 and 1.7. The relative sensitivity of QuTER is even more stark in Panel B of Figure 13.

CONCLUSION
The purpose of this study is to develop better tracking portfolios. A key ingredient is having a robust measure of the differences between a candidate portfolio and its benchmark. Traditional tracking error measures like TEV and TER are insufficient. The QuTE class of tracking measures introduced in this study can detect important differences between two portfolios that are seemingly identical.
The simulations suggest that tracking performance relative to a benchmark with QuTER is statistically more powerful than using traditional measures. The QuTER statistic is robust to various calibrations, such as the choice of quantiles to match. Moreover, the quantiles chosen for matching can be weighted to reflect directions of deviation that are most important to the investor. The case studies illustrate this power in Emerging market and Developed market equities during the turbulent episodes of the Dot Com crash and the Great Recession.
Performance measurement and portfolio evaluation might benefit from including quantile-based measures alongside traditional tracking errors. Moreover, given the success exhibited by the case studies, managers of index and tracking portfolios should consider leveraging the QuTE class for portfolio construction.
(A) Indexed TER and QuTER (B) TER and QuTER monthly percent change