“Forecasting the changes in daily stock prices in Shanghai Stock Exchange using Neural Network and Ordinary Least Squares Regression”

The research focuses on finding a superior forecasting technique to predict stock move- ment and behavior in the Shanghai Stock Exchange. The author’s interest is in stock market activities during high volatility, specifically 13 years from 2002 to 2015. This volatile period, fueled by events such as the dot-com bubble, SARS outbreak, political leadership transitions, and the global financial crisis, is of interest. The study aims to analyze changes in stock prices during an unstable period. The author used advanced computer sciences, Machine Learning through information processing and training, and the traditional statistical approach, the Multiple Linear Regression Model, with the least square method. Both techniques are accurate predictors measured by Absolute Percent Error with a range of 1.50% to 1.65%, using a data file containing 3,283 observations generated to record the daily close prices of individual Chinese companies. The t-test paired difference experiment shows the superiority of Neural Network in the finance sector and potentially not in other sectors. The Multiple Linear Regression Model performs equivalent to the Neural Network in other sectors.


INTRODUCTION
The stock market is somewhat volatile and sensitive in various areas such as economic environment and news, political policy, industrial development, market news, and natural factors; therefore, predicting stock prices is a difficult task. The ability to more accurately predict is of high interest to those involved in the investment market. Accurately predicting the stock prices can provide individual investors, stock fund managers, and financial analysts more opportunities to gain profit in the stock exchange.
Forecasting the daily stock prices is challenging due to its being influenced by several factors, as previously mentioned. The challenge is increased further because of the unpredictable high volatility in stock prices of the Chinese markets from 2002 to 2015. During this exceptionally uncertain period, it is expected that stock prices are extremely difficult to predict. This work differs from that of Tjung, Kwon, Tseng, and Bradley-Geist (2010), and Tjung, Kwon, and Tseng (2012). That is, Tjung's et al. (2010) models were designed to forecast daily changes in stock prices using data from September 1, 1998 to April 30, 2008, and Tjung, Kwon, and Tseng (2012) forecasted 37 stocks from eight industries over the same periods as in Tjung's et al. (2010). Both of these studies focus on the US stock market, whereas the models are obtained from more updated data from a much larger number of companies in the Chinese stock market.
The daily stock price data for China in this study cover the period from January 1, 2002 to December 31, 2015. With its unparalleled attributes, the Chinese stock market is one of the largest global markets. The study, as described in the methodology section, looks at two models to determine the best outcomes for training and forecasting. This paper is structured as follows. Section 1 provides a literature review relevant to the Chinese stock market and stock price forecast. Section 2 describes the methodology utilized in the study. Specifically OLSR models and SPSS statistics are applied to forecast changes in daily stock prices in China. Furthermore, Section 2 introduces Alyuda Neural Network (ANN) and discusses how NN can be used to predict stock prices. It details the sources and characteristics of data, variables, and data normalization processes. Section 3 reports the results of t-test pair-wise hypotheses testing, error measurement by forecast methodologies, and other related results. Section 4 provides the relevant discussions of results from this study. The final section closes this paper with the conclusion and future research direction.

LITERATURE REVIEW
Kwon, Wu, and Zhang (2016) compare the forecasting performance of different versions of BI models in predicting China's stock prices. They discuss the model's ability to extract and explain vast amounts of data and knowledge and how it all relates to enhancing the process of decision making. They conclude that normalized and denormalized data provide similar results.
Liu and Wang (2011) look at the Independent Component Analysis, the NN model, and the BP model. They demonstrate that the NN model outperforms the other two models in analyzing fluctuations in the Chinese stock markets.
Dai, Liu, and Wang (2012) investigate improved accuracy in predicting Shanghai B shares using a combination of Nonlinear Independent Component Analysis and Neural Network over other models such as LICA-BPN, PCA BPN, and single BPN. With the use of data from Shanghai A and B shares, Nasdaq Composite Index, and Industrial Average Index, predictions and combined time strength functions will possibly increase the accuracy of LeNN.
Cao, Ham, and Lam (2013) report that in the stock market movement, the Back Propagation Network was found to be slightly superior to the Radial Basis Function Network.
The Chinese stock market is sensitive to several factors. Government regulations, corruption, predictability, efficiency, investors' behavior are listed and discussed by numerous scholars as influential factors impacting the Chinese stock and financial markets. Gordon and Li (2003) stated that the banking system in China was required to advance money to enterprises owned by the state, notwithstanding financial performances. This indicates that relevant government mechanisms designate share values. Additionally, in a study by Riedel, Jin, and Gao (2007), as high as 69% of 1,400 public companies' stocks are non-tradable shares. Yao and Yueh (2009) explained that a relatively high fluctuation in market returns is influenced by prescribed pricing by types of share and different types of shares governed by unique trading regulations. This has been a question raised by Firth, Rui, and Wu (2009). The bureaucratic style of regulation of the Chinese Securities Regulatory Commission (CSRC) resulted in declined investor confidence due to their accessibility to information. Sanction Enforcement Information (SEI) delayed information to the public for more than 15 days, causing deterioration of investor confidence.
The consequence of economic transformation periods because of China's corruption during these times results in a highly manipulated market. The competition and economic conditions of the corrupt authority allocate resources based on bribes instead of market efficiency. Knight and Yueh (2008) argue that Chinese entrepreneurs may take advantage of questionable relationships. Additionally, Chinese culture has historically been known to exploit the legal system. Fan, Wong, and Zhang (2007) also report a tradition where companies have political connections to further their motives. They argue that former government officials are CEOs of companies; this results in lower stock returns for companies politically unconnected.
Relative to US markets, Chen (2010) reports less predictability for Chinese stock markets. He found that informative stock prices cause it and less heterogeneously distributed return predictors. Zhang, Wei, and Huang (2014) provide similar findings when predicting markets comparing S&P 500 stocks with Shanghai composite. It was more difficult to predict individual stocks in Chinese markets than to predict markets in the U.S. Jiang (2011) indicated that small capital firms' ownership concentration and stock portfolios aid in the predictability of markets. Industries showing the highest predictability in stock portfolios are insurance, finance, real estate, and services. Others also argue that effective predicting is inefficient in China because of deviations from the standard. The stock market, using market efficiency theory Kang, Cheong, and Yoon (2010)  Li, S. Wang, and X. Wang (2017) examine the impact of social trust, the level of mutual trust among the members of a society, on stock price crash risk. Through the use of a large sample of Chinese A-share firms listed over the period 2001-2015, firms headquartered in regions of high social trust tend to have a smaller firm-specific stock price crash risk. Thus, social trust is one of the critical predictors of stock price crashes. China's vast social trust diversity originates from fifty-six ethnic groups within thirty-one provinces and more than eighty different native dialects that are not comprehensible to non-native speakers.
Hafezi, Shahrabi, and Hadavandi (2015) propose the Bat-Neural Network Multi-Agent System (BNNMAS) to predict stock price. To predict eight years of DAX stock prices by quarters, they use (1) BNNMAS in a four-layer multi-agent framework, (2) the quarterly data of 17 national indexes and three international indexes that include oil price, gold price and the exchange rate of the German mark with the US dollar, (3) feature selection and time lag selection for data preprocessing phase, and (4) hybrid bat-neural network (BNN) model are used for the function approximation phase. Their results show significantly more accurate and reliable performance of BNNMAS compared to other models such as GANN, and others. BNNMAS, therefore, is suitable for predicting stock prices over long periods.
Hu, Tang, Zhang, and Wang (2018) use an improved sine cosine algorithm (ISCA-BPNN), the S&P 500 and DJIA Indices data for predicting the direction of opening prices in the S&P 500 and DJIA Indices. They collected data from Google Trends for a period from January 1, 2010 to June 16, 2017. They propose that back propagation neural network models (ISCA-BPNN) improve predicting the direction of stock markets. They compared it with the BPNN model, GWO-BPNN model, PSO-BPNN model, WOA-BPNN model, and SCA-BPNN model. Their results for predicting the direction of the opening price showed that the ISCA-BPNN model is better than all the other models in the study and that Google Trends can help predict future financial returns. Their results indicate that swarm intelligence algorithms can optimize other artificial neural networks' parameters for predictions and classifications.
Wang, Yao, and Yu (2018) extract the event-relevant data from Web news and user sentiments from social media. They analyze their joint impacts on the stock price movements using a coupled matrix and tensor factorization framework. They construct two auxiliary matrices of the stock quantitative feature matrix and the stock correlation matrix and incorporate them to assist the tensor decomposition. A coupled matrix and tensor factorization scheme support heterogeneous information integration and multi-task learning simultaneously. Thus, their prediction was for multiple correlated stocks simultaneously through the utilization of the commonalities among stocks. Using Chinese A-share stock market data and the HK stock market data, their proposed model achieves 62.5% and 61.7%, respectively.
The essence of the problems could be characterized by the sizable market of the Shanghai Stock Exchange (SSE). It is one of the world's largest stock market by market capitalization at USD 5.01 trillion as of May 2019. It is challenged and influenced by government regulations, corruption, predictability, efficiency, investor behavior, and other influences. This has attracted cumulative research work, as described earlier. The previous research found that artificial or business intelligence is a superior predictor of the traditional statistical method. However, this work indicates a contradiction which appears in some industrial sectors of the SSE. The accuracy of forecasting methodologies in this study is verified by the t-test paired difference experiment and is measured by Absolute Percent Error (APE).

METHODOLOGY
This study research uses normalized data used as input for the NN and OLSR models. It details the sources and characteristics of data and variables to generate and forecast changes in the stock price.

Data and variables
A shares were downloaded from Yahoo Finance, looking at 151 companies on the Shanghai Stock Exchange from January 2002 to December 2015. These companies are randomly sampled to represent the eight industry-wide sectors: basic materials, conglomerates, consumer goods, financial, healthcare, industrial goods, services, and utilities.
For each company of these 151 firms, a data file consisting of 3,283 records of the daily close prices is generated. The author looks at 21 indicators categorized by Macroeconomic and Microeconomic indicators, indicators of market sentiment, and institutional investors (see Appendix B for the complete list of indicators). They are identified as independent variables to predict the movements of the stock prices in the Shanghai Stock Exchange.
The Dependent Variable is the target company selected from each industry sector. All other companies in the same sector, along with 21 indicators, are arranged as Independent Variables. Both types of variables are employed for model training and forecasting.
For each stock, this comparative study applies to Neural Network and Ordinary Least Square Models that forecast the variability in the Shanghai Stock Exchange. Records of daily closing prices, classified by variable categories, are used as input for the Neural Network and Ordinary Least Square Models. Dummy variables are used as additional input for the model training and forecasting to improve the precision of a forecast. They are added to specify the pre-holiday and post-holiday days. The research applies and uses all-variable OLSM, stepwise OLSM, and Alyuda Neuro Intelligence software to build NN forecasting models.
The normalization method (Tjung, Kwon, & Tseng, 2012) is adapted to achieve better fore-casting performance. They conduct the forecasting analysis using NN models in the US stock markets and Alyuda Neuro Intelligence. Further, Tjung, Kwon, and Tseng (2012) point out that the NN models generate a lower standard deviation than traditional regression analysis. It is important to note that the normalized data method provides a superior outcome than the non-normalized data in model learning and forecasting.
To normalize the observations for individual companies, the author pinpoints the minimum value (Minimum) of daily stock price changes, then take the absolute value (Absolute), and add 0.1 to lift a zero value to a positive in the normalization dataset. Consequently, the normalization values ((Absolute (Minimum)) + 0.1) are added to every daily change of company data. For example, the observations of daily stock changes for Company Fresno supposing its minimum of -5.03, the normalization value (Absolute (-5.03) + 0.1 = 5.04), is added to every daily stock change of Company Fresno, thus normalizing the observations.
The normalized data are used as input for the Machine Learning software and OLSR models. Alyuda Neuro Intelligence software is used to build NN and forecasting models. 95% of the observations are used for training and validating data, while the remaining 5% for performance testing.

Neural Network (NN) method
Multi-Layer Perceptron (MLP) and Back Propagation Algorithm are prediction methodologies. There are numerous different Architectures of the Network that were applied in previous research. Multi-Layer Perceptron (MLP) is one of the popular ANN Architectures widely accepted (Alyuda Research, 2006). The Multi-Layer Perceptron (MLP) is a feed-forward neural network with the ability to improve its model errors by iteratively changing its interconnecting weight of the architecture among all connections of the input layer, hidden layer, and output layer (Gardner & Dorling, 1998  Because of the uncertainty of the training process, more training times for the Artificial Neural Network will have a better chance to achieve better results. The training trials set the iterations to stop when 10,000 are completed with 10 to retrain or stop training at 0.000001MSE improvement or the achievement of 0.01 training error. This training process conducted three trials for each stock, and the Network with the lowest Relative Error is selected.

Terms and abbreviations Descriptions
Training Set The data that were normalized and used to train Neuro Intelligence to find the best-fit architecture for model optimization and parameter validation

Query Set
The data that were normalized, with 5% manually extracted and separated, for NeuroIntelligence to use for calculating the estimated price changes where y is the actual value, all 1 x are the predictors, and .
ε In this study, the actual value that one attempts to predict is the change in daily stock prices. Moreover, therefore, the corresponding predicted values of changes in stock prices ˆ, y as described.
To analyze data using OLSR, the model involves the following assumptions: • the dependent variables should be interval or ratio, measured on a continuous scale; • the observations or residuals are independent; • there is a linear relationship between (i) the dependent variable (y) and each of the independent variables (xi), and (ii) the dependent variable and the independent variables (x 1 , x 2 , x 3 , …, x n ) collectively; • data shows homoscedasticity, in which variances along the line of best fit remain similar or approximately equal; • independent variables contain minimal or no multicollinearity with each other; • data should not contain significant outliers, high leverage points, or highly influential points; • the residuals (errors) are normally distributed, or their distribution could be approximated by normal distributions with a mean of zero.
The author runs OLSR for forecasting stock prices in the Chinese markets across all eight industries using SPSS statistics both (a) the method of Enter, including all independent variables of interest, and (b) the stepwise method, including some and potentially not all independent variables, which allow a reduction in the effect of collinearity and multicollinearity.
Multicollinearity may exist in real data. When it exists, the significance of the overall model, tested by F-statistics is not jeopardized. However, the significance of any independent variable on the dependent variable is jeopardized because the impact of a single independent variable cannot be isolated due to collinearity/multicollinearity effects among some independent variables. Thus, one solution to run regression analysis is to use the stepwise method, an iterative modeling procedure that reduces the collinearity/multicollinearity. There are two major forms of stepwise regression: • Forward Selection Approach: This model starts with a simple linear regression run to determine the significance of individual independent variables and the dependent variable.
If some independent variables have the significance of partial regression coefficient, measured by p-value, at or below the desired α levels, the stepwise-in regression selects the variable with the strongest relationship with the dependent variable (or smallest p-value). The process is recursive with each new iteration result in adding the single most significant independent variable to the regression equation.
The recursive process starts with the first n iterative of independent variables to find x 1 to add to regression equation, the second iterative of (n-1) independent variables to identify x 2 to be in function on. The procedure stops if none of the partial regression coefficients have the p-value at or below the α level.
• Backward Elimination Approach: This modeling starts with a regression analysis run on the full model, including all independent variables and the dependent variables. If the results from the most recent iterative run show one or more of the partial regression coefficients with the p-value larger than α, the least significant independent variable is removed. The next iteration runs with the exclusion of the removed variables. The process is recursive with each new iteration results in removing the least significant independent variable. The procedure stops if all of the independent variables remaining in the analysis are significant with p-value at or smaller than α.
The author runs the SPSS statistic for both regressions using all independent variables (Regression-Enter) and Stepwise Regression.
Next, the author will discuss the hypotheses testing and their results in section 3 to compare the two models' forecast performance: NN and OLSR.

RESULTS
The paired difference t-tests typically are used to compare two population means, where the observations in one sample can be paired with observations in the other sample. The observations in the paired difference t-tests are defined as the differences between the two dependent samples. This statistical procedure is robust and practical in many circumstances. It has the following assumptions: • the values of paired difference must be continuous (interval/ratio); • within the paired difference is independent of one another; • the paired difference should be approximately normally distributed; • the sample of the paired difference should not contain outliers.
In the analysis, the author uses the t-tests for paired difference procedures to compare the error (measured by Absolute Percent Error) from two methods, specifically Neural Networks and OLSR, to determine whether Neural Network is better than OLSR. Both prediction methods attempt to forecast the price changes of the same stocks; this justifies the dependent sample errors. Specifically, the forecast error for price change on a particular day is measured twice: one error from using the Neural Networks and the other error from applying the OLSR, resulting in pairs of differences in errors. Therefore this t-test is an appropriate statistical procedure to determine the mean difference of the errors from Neural Networks and OLSR.
Both the regression models with all independent variables and the stepwise regressions serve as the representatives of OLSR and are used in our study. The two designs of the null and alternative hypothesizes are as follows:

Set 1 of hypotheses statements:
Ho: Neural Network Model is not better than the multiple linear regression models (RegAll) involving all independent variables in forecasting the changes in stock price.
Ha: Neural Network Model is better than the multiple linear regression models (RegAll).

Set 2 of hypotheses statements:
Ho: Neural Network Model is not better than the stepwise regression models (RegStep) in forecasting the changes in stock price.
Ha: Neural Network Model is better than the stepwise regression models (RegStep).
Two types of OLSR models are analyzed in this paper. The first is "Regression All" meaning the regression model, which is inclusive of all independent variables or the input data as the independent variables, and the second is "Regression Stepwise," which selects only the significant independent variables.

DISCUSSION
Based on the results shown in  Table 2.
The t-test results reveal that NN models are not better than the regression models in some industries, the accountable reason that allows OLSR performance is believed to be as good as NN is the normalization/denormalization process. This indicates that the efficient data normalization allows the benefits of using less costly and less time-consuming traditional methods such as OLSR to be delivered to provide the forecast with a similar level of accuracy to what would be if one uses NN.
Further is the summary of overall average APE by methodologies showing the promising accuracy in their prediction for all the methodologies in our study. Absolute Percent Error measures accurate practical value techniques predictors with a range of 1.50% to 1.65%. Table 2. Industry-wide summary for APE (Absolute Percent Error) for daily stock price change and p-value of t-test for paired differences

APE and t-test p-values by Industries
Neural Networks

CONCLUSION
The analysis finds that Machine Learning, the NN model known for its unfriendly cost and time consumption, provides superior forecasting in changes of the daily stock price in some industry sectors such as finance and possibly consumer goods. Contradictorily, the traditional statistical approaches, such as OLSR models known for their cost and time efficiency, could deliver similar quality forecasting in many sectors. The accountability for the success of this discovery is the appropriate selection of critical factors and data normalization procedures.
The author looks at the years from 2002 to 2015 and the challenges these years brought to stock market prediction. The collection of considerably highly volatile daily stock price data is what sparked our research interest. The author sets up a system to evaluate and analyze performance outcomes.
From the evaluation criterion of industry-wide average APE and our efficient normalization process, our study concludes that both forecasting methods of NN and OLSR secure accurate outcomes. The practical value is evident, as realized with an outcome range forecast error measured by APE of 1.50% to 1.65%. The hypotheses testing indicates that the prediction performance of Neural Network may not be better than Regression with all variables and/or the Stepwise Regression under some circumstances as described in detail.
Future research could be directed towards an analysis of the US stock market with a cross-comparison between the Chinese and the US stock markets to identify which market is more predictable and whether the two markets' critical factors are somewhat similar or different.