“Automated trading systems’ evaluation using d-Backtest PS method and WM ranking in financial markets”

Given the popularity and propagation of automated trading systems in financial mar- kets among institutional and individual traders in recent decades, this work attempts to compare and evaluate such ten systems based on different popular technical indica- tors in combination – for the first time – with the d-Backtest PS method for parameter selection. The systems use the technical indicators of Moving Averages (MA), Average Directional Index (ADX), Hyo, Moving Average Convergence/ Divergence (MACD), Parabolic Stop and Reverse (SAR), Pivot, Turtle and Bollinger Bands (BB), and are enhanced by Stop Loss Strategies based on the Average True Range (ATR) indicator. Improvements in the speed of the back-testing computations used by the d-Backtest PS method over weekly intervals allowed examining all systems on a 3.5 years trading period for 7 assets in financial markets, namely EUR/USD, GBP/ USD, USD/JPY, USD/CHF, XAU/USD, WTI, and BTC/USD. To evaluate the systems more holistically, a weighted metric is introduced and examined, which, apart from profit, takes into account more factors after normalization like the Sharpe Ratio, the Maximum Drawdown and the Expected Payoff, as well as a newly introduced Extended Profit Margin factor. Among the automated systems examined and evaluated using the weighted metric, the Adaptive Double Moving Average (Ad2MA) system stands out, followed by the Adaptive Pivot (AdPivot), and the Adaptive Average Directional Index (AdADX) systems.


INTRODUCTION
In the past couple of decades, the scientific and financial community has tackled an ever-increasing amount of work to develop software systems that trade automatically in the financial and stock markets around the world. Thus, these systems have become ubiquitous in the global trading scene, and a need to evaluate them becomes apparent.
The current research examines and attempts to evaluate ten automated trading systems that are based on widely used technical indicators. For this purpose, the d-Backtest PS method is used to select the systems' parameters, and a weighted metric based on five factors regarding the consistency, risk efficiency, and profitability of the systems is introduced.
Taking the above into consideration and in order to evaluate the systems, weekly backtests were conducted on 7 assets of financial markets for a total period of 3.5 years. Along with the d-Backtest PS method, a

LITERATURE REVIEW
As financial data became more accessible and consistent, and information technology tools became more capable, automated trading systems could utilize various techniques and indicators. Some systems use machine learning techniques to process the time series data of an asset and decide on trading actions (Henrique, Sobreiro, & Kimura, 2018;Booth, Gerding, & McGroarty, 2014), while other systems use the help of technical indicators and rules based on these technical indicators to make decisions when trading an asset (Chong & Ng, 2008;Wilder, 1978). There are also more complex systems that combine two or more technical indicators, including artificial neural networks, fuzzy logic, or other advanced machine learning techniques (Silva et al., 2014;Osunbor & Egwali, 2016). Apart from technical data and indicators, automated trading systems can also utilize information from outside the financial markets captured in news articles or social media trends (Azhikodan, Bhat, & Jadhav, 2019).
In parallel with the development of automated trading system, there has been a search for better means for evaluating such systems. The most commonly used criterion for evaluating the trading strategies has been the return on investment or the similar growth rate and profit rate (Y.-H. Chou . Apart from the profitability, the combination of these metrics can give valuable insights regarding the overall behavior of an automated trading system, such as the risk taken, or the consistency of its returns. Different metrics can be weighted and combined to form a single metric for evaluating an automated trading system (Svoboda, 2012) who created a 5F metric using 5 metrics with equal weights. In the current work, 5 different metrics were combined into a new metric, examining the use of different weights for each one to compare and rank several automated trading systems, continuing on the work done by .
Since most of the automated trading systems utilize rules and components with various parameters, choosing optimal values for each parameter is of great importance. One of the simplest ways to determine an optimal value for a parameter is to test an automated trading system for different values of each parameter for a set period and then use the best values for a future period. This forward testing technique has been employed by Marcus (2013). Automated trading systems that use neural networks, or other components that have to be trained, usually allow for a training period for training and configuring the system and an evaluation period for evaluating the system (Silva, Castilho, Pereira, & Brandao, 2014; Trippi & Desieno, 1991). Another approach is to choose a variable period of time to use as a back-testing period (Vezeris, Schinas, & Papaschinopoulos, 2016) with the introduction of the d-Backtest PS method. In the current work, both a constant back-testing period and the d-Backtest PS method at weekly intervals are used (Vezeris, Kyrgos, & Schinas, 2018a) to optimize the parameters for all the systems studied, having refined its back-testing phase, rendering it less computationally intensive.
With this work, a weighted metric is introduced and examined that can be used for evaluating the trading systems, as well as a new factor, the Extended Profit Margin, which is used in the weighted metric mentioned above. Another contribution consists of the improvement of the back-testing phase of the d-Backtest PS method so that it becomes much less computationally demanding, without loss of the d-Backtest's edge, which provided us with data on time, for all 10 systems and 7 assets of financial markets over a trading period of 3.5 years. To the best of our knowledge, this is the first time that the d-Backtest PS method has been used to effectively compare different automated trading systems, paired with a newly introduced metric for their evaluation.

Trading systems assessment factors
When it comes to evaluating the performance of the automated trading systems, some qualities such as return, volatility, and risk are important (Ilić & Brtka, 2011;Kumiega & Vliet, 2012). Some metrics that are related to the net profit, the consistency, and the security of the trading strategy will be described in the following sections. Because the available data from the methods used are weekly results, to summarize the data of 183 weeks into a single value, some metrics had to be revised.
The total net profit is calculated by subtracting the gross loss of all losing trades from the gross profit of all winning trades.
While certainly a valuable metric, the metric alone can be deceptive as it cannot determine if a trading system is performing efficiently, nor can it normalize the results of a trading system based on the sustained amount of risk. Total net profit should be viewed in concert with other performance metrics. Henceforth, Total Net Profit will be referred to as Profit.
The extended Profit Margin ( ) xPM is a variant of profit margin. Profit margin indicates the profitability of a product, service, or business. It is expressed as a percentage; the higher the number, the more profitable the business.
In foreign exchange (FOREX) trading, profit margin is equal to the ratio of the final net profit divided by the gross profit during the examined period.
As discussed in Kim et al. (2010), although profit margin hedging is not the optimal rule for mean reversion, it can still be profitable if prices are mean-reverting. The need for a measure, which will represent both profit and loss, has led us to an extended equation. When the final net Profit is negative, the ratio of the net profit divided by the Gross loss is calculated. When the net profit amounts to zero, then xPM is also zero. So the following equation can describe xPM : SharpeRatio (SR) is a popular tool that investors and fund managers use to calculate the risk-adjusted return in stocks, and it was first introduced by Sharpe (1975). Otherwise, SR is referred to as reward-to-variability ratio. Essentially, the ratio shows how much excess return one receives in return for the extra volatility endured as the 'price' for holding a riskier asset:

SharpeRatio AHPR RFR Standard Deviation HPRs
where RFR is the risk-free rate, which is normally assumed as 0% in FOREX trading, AHPR is the average holding period return on investment or, simply put, an arithmetic mean of a relative gain per trade. Calculated as the sum of HPR divided by HPR total, HPR is calculated as a ratio between balance after out or in-out operation and previous balance (after balance/previous balance).
In general, the higher the , SR the more risk-efficient is the trading system and the smoother its return over time. A negative SR either means that the risk-free rate is greater than the portfolio's return, or that the expected return is likely to be negative.
To evaluate the performance of the systems, the SR of the weekly data was calculated using the following formula: where at the start of each week -the balance is $10,000, so Profit + $10,000 is the balance at the end of the week. As the data for each individual trade were not available, it was considered that one week is one trade to summarize the metric for the data.
A maximum drawdown (MDD) is the maximum observed loss from a peak to a trough of a portfolio before a new peak is attained. MDD is an indicator of downside risk over a specified time period. It is an important measure of risk for a trading strategy (Pardo, 2012). The formula for MDD is as follows: .
It is important to note that it only measures the size of the largest loss without considering the frequency of large losses.
MDD is an indicator used to assess the relative riskiness of one strategy versus another as it focuses on capital preservation, which is a key concern for most investors. A low maximum drawdown is preferable, as this indicates that losses from investment were minimal.
The expected payoff (EP) is a statistically calculated index representing the average profit/loss factor of a trade. It can also be considered when it comes to displaying the expected return of the next trade. It is calculated as total net profit divided by total trades.
All the measures above are useful, and each of them shows something different about the automated trading systems. To evaluate all factors simultaneously, it was decided to examine a weighted metric for five factors. A similar effort was made by Svoboda (2012) who integrated five indices into one index with equal weights. Their sub-indices represent yield, liquidity, success factor, stability, financial default. In our version of the weighted metric, each factor used in the equation can have a different weight. The following basic assumptions were used as the starting point that can help minimize the space of the weights combinations to be examined: • Profit and Extended Profit Margin are the most important because they represent the profitability of the system and its safety, so their weights should be greater than those of other factors.
• Drawdown is the next most important because it represents the risk of a strategy, but its weight must be negative as lower values are more desirable.
• Given the correlation of the Expected Payoff metric with the Profit metric, its weight has to be relatively small, smaller than the weights for the other metrics, and smaller than the Sharpe Ratio's weight, which is a measure of the consistency of a system's returns.
• Normalization must take place for higher quality results because of the wide range of values between the different metrics of each system, which allows for the sum of the Weights to be equal to 1.
For the normalization of each metric, each value was divided by the sum of the absolute values of the measure: where i x is the value of a measure and n is the number of the trading systems being evaluated.
Using the normalized metrics, the value of the combined Weighted Metric was examined: where each measure is normalized as described above. To avoid arbitrary weight setting and to examine the robustness of the Weighted Metric as a method of systems' classification, the classification of each system was examined and summarized using all the possible combinations of weights that fulfill the criteria mentioned in the above bullet points, using a value space of [ ] 0,1 with a step of 0.02 for each of the weights.

Automated trading strategies
For this research, 10 popular automated trading strategies were implemented based on well-known indicators and examined their efficiency using the metrics described in sub-section 2.1. Each auto- The Adaptive Ichimoku (AdIchimoku) ATS uses the well-known Ichimoku indicator (Elliott, 2007). The indicator comprises five lines called the tenkan-sen, kijun-sen, chikou span and lastly senkou span A and senkou span B which -when combined -form the "kumo cloud".
When the tenkan-sen line is above the kijun-sen line, the price is above the kumo cloud, the kumo cloud's width is not zero and the chikou span is above the equivalent price of the past, these conditions constitute a buy signal, and a long position is opened.
When the tenkan-sen line is below the kijun-sen line, the price is below the kumo cloud, the kumo cloud's width is not zero and the chikou span is below the equivalent price of the past, these conditions constitute a sell signal, and a short position is opened.
When the price breaks above the kumo cloud, this constitutes an exit signal and any short positions are closed. Similarly, when the price breaks below the kumo cloud, this constitutes another exit signal, and any long positions are closed. When the close price crosses above the SAR value, this constitutes a buy signal, and any short positions are closed, and a long position is opened. When the close price crosses below the SAR value, this constitutes a sell signal, and any long positions are closed, and a short position is opened.
To confirm that a trend has really reached its end and a reversal is underway, the system can delay the exit of a position before entering a new one.
The Adaptive Pivot (AdPivot) ATS uses as an indicator, Support and Resistance levels created around Pivot Points.
When the close price crosses above a chosen Resistance level, this constitutes a buy signal, and a long position is opened. When the closing price crosses below a chosen Support level, this constitutes a sell signal, and a short position is opened.
The Take Profit can be set either on the next Support level for short positions or the level after that. Similarly, for the long positions, it can be either set on the next Resistance level or the level after that. There is also an option, where instead of closing the whole position on the Take Profit level, only half is closed. Like this, one manages to secure a portion of the profits, and the rest of the volume that continues to ride on the potential trend, closing only by the trailing Stop Loss.
The Adaptive Turtle (AdTur tle) ATS uses various Donchian Channels and is formulated based on the strategy described in Curtis (2007). In addition to these rules, the AdTurtle ATS keeps on adding volume to the initial open position as the price moves to more profitable levels. The maximum permitted additions to the initial po-sition are five, and each time it invests less than it previously did.
The Adaptive Bollinger Bands Anti-trend (AdBBAntiTrend) ATS, uses the well-known Bollinger Bands indicator, created by Bollinger (2001). The indicator consists of 3 bands, the middle band is a simple moving average, while the upper and lower bands are typically two standard deviations.
When the close price first breakouts the upper band and then closes below it, this constitutes a sell signal, and any long positions are closed, and a short position is opened. When the close price first breakouts the lower band and then closes above it, this constitutes a buy signal, and any short positions are closed, and a buy position is opened.
An example of each strategy is shown in Appendix A. Each figure shows the EUR/USD time seriesfrom Metatrader 5 trading terminal -along with the respective indicator of each ATS. The horizontal axis represents time, while the vertical axis the price of EUR/USD. Finally, the red downward arrows indicate sell signals, while the blue upward ones, buy signals. A green dot indicates an exit signal.

Back-testing
The Metatrader 5 trading platform by MetaQuotes was utilized to carry out the backtests to examine the automated trading systems. Besides, the Microsoft SQL Server was used to collect and handle the findings.
Backtests were carried out on 7 assets of financial markets, specifically on EUR/USD, GBP/USD, USD/CHF, USD/JPY, XAU/USD, WTI, and BTC/ USD, with weekly tests over a three and a half-year period, from 24/01/2016 to 27/07/2019 with data from the ForexTime, FxPro, and Alpari brokers. The authors refrained from choosing assets from categories such as equities or rates because their trading sessions are brief in comparison.
The initial capital for each test is set at $10,000. Each ATS risks losing 20% of the equity with every trade over a 1% movement of price on every asset apart from WTI and BTC/USD. These two assets are highly more volatile than the rest. Therefore, the percentage was adjusted to 3% and 20% price movement for WTI and BTCUSD, respectively.
Βacktesting processes are widely used today in forecasting experiment tests. The d-Backtest PS method dynamically finds the best back-testing period that will be considered for the following week.
The d-Backtest PS method requires several backtests, typically 30; it has to choose parameters for each week. This means that many backtests have to be run in to have enough input for the d-Backtest PS method over a long period of trading time, which makes this first phase the most computationally expensive one. In this research, the authors have devised a new way to shorten the backtests' phase by running only the backtests for the one-week periods instead of all the back periods until 30 weeks. Using the data from the one-week tests, a backtest of any number of consecutive weeks can then be extracted by combining the resulted metrics from each combination of parameters throughout the consecutive one-week backtests, using the techniques described in sub-section 2.1. The tradeoffs of this technique are those not considered the compounding effects in the back-testing periods but can examine larger combinations of parameters for each system over a much larger period. The comparison of this new version of the d-Backtest PS method with the best BTs and the 6-month back-testing methods described next offers an assessment of the new methodology's performance.
An artifact of the d-Backtest PS method is the fact that one had to run tests and collect enough data for a few weeks before the examined period, in order for the d-Backtest PS method to initialize the different methods it uses for classifying the back-testing periods.
The best back-testing periods, which will be mentioned as best BTs, are the ideal version of the d-Backtest PS method. The best BTs are determined using ex post-computed historical data to select the optimal back-testing period and presented in this work to show the absolute limit that the d-Backtest PS method could theoretically achieve.
To compare the d-Backtest PS method with more traditional and used methods, a simple back-testing method was also examined, which uses a constant back-testing period of 6 months (26 weeks). By that simple method, each system selects the best parameters of the previous period of 26 continuous weeks and uses them in the next adjacent week.

RESULTS
The results for each system can be presented now that the automated trading systems and the measures used to evaluate them are described. Apart from the data resulting from the d-Backtest PS, the results of the best back-testing periods that the d-Backtest PS method could have chosen are also presented. Besides, the data of a simple 6-month back-testing method are presented.
If excluding the results from best BTs method, which was expected to outperform the other two methods, the results from the d-Backtest PS method far exceed the results from the simple 6-month back-testing method, which suffers losses in all systems. The detailed results can be found in Appendix.

Weighted metric classification
The WM metric was calculated for all the combinations of weights with values from 0 to 1 with a step of 0.02 fulfilling the requirements in paragraph 2.6, and the ranking of each system in all of these combinations (a total of 224,956 combinations) was examined.
The following tables summarize the classification occurrences of each system for every different weight combination examined. The d-Backtest PS method's results can be seen in Table 1, while the results for the best BTs and the 6-month constant BT can be seen in Tables 2  and 3, respectively. For the d-Backtest PS method, the Ad2MA stands out clearly as the best system in any of the WM used, with the AdPivot and AdADX coming next. For the Best BTs, the first 3 systems remain the same, but this time AdPivot comes first with Ad2MA and AdADX following. For the 6-month constant BT, there is a different ranking with the AdBBAntiTrend coming first, followed by the AdTurtle, with the AdPivot in the third place.

Compounding capital diagrams
One key technique for stock trading is also the portfolio management (Petropoulos et al., 2017;Yao et al., 2007). In the testing round, each system started each week with $10.000 for each asset. But there could be value in examining what a system would have done when investing its profits from the previous week to the next, compounding its capital. To examine this, the authors started with . 10000

EstimatedProfit
StartingBalance ActualProfit = ⋅ = (7) Thus, a full picture of the 3.5-year testing period using parameters generated by the d-Backtest PS method can be observed in Figure 1.

CONCLUSION
With the current research, the study regarding the comparison of high-frequency algorithmic trading systems in financial markets is broadened by implementing ten such automated trading systems, based on a widely used technical indicator. To evaluate these systems, the d-Backtest PS method was used at weekly intervals, the best BTs periods, which provide the ideal outcome of the d-Backtest PS method and a 6-month constant back-testing strategy. To assess the results, consistency and low risk were also taken into consideration. Typical metrics provided by Metatrader 5 were used. Instead of profit factor, the more robust xPM metric was devised, which is an extended version of Profit Margin, and its values range is [-1, 1]. Because of the importance of every metric, a weighted metric aggregation was created to evaluate the systems, and the evaluation results of different weights for every metric were examined.
Eventually, from all the automated trading systems, Ad2MA, along with AdPivot, excel both in the d-Backtest PS method and the best BT periods. Especially in the d-Backtest PS method, the difference between Ad2MA and the rest of the systems was obvious. The simple 6-month constant back-testing strategy does not yield profitable results for any of the systems. Moreover, an attempt of primary portfolio management was made, using seven selected symbols, and Ad2MA was the only one who had an overall upward tendency.
In the future, we plan to study more closely the two automated trading systems that stood out in the current research -Ad2MA and AdPivot -and examine whether variations of them could perform profitably with less dependence on their parameters' values.