CN108985941A

CN108985941A - A kind of stock intelligent Forecasting of combination newsletter archive

Info

Publication number: CN108985941A
Application number: CN201810791693.9A
Authority: CN
Inventors: 李晓东; 贡诚; 冯钧
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2018-07-18
Filing date: 2018-07-18
Publication date: 2018-12-11

Abstract

The invention discloses a kind of stock intelligent Forecastings of combination newsletter archive, pre-process first to newsletter archive, filter Chinese word segmentation and stop words, delete the newsletter archive of not time tag；The prediction duration Δ t for determining stock filters selection newsletter archive according to the time tag of newsletter archive；Character representation is carried out to the newsletter archive for filtering selection, and forms the character representation vector at corresponding moment with the stock certificate data feature vector at corresponding momentSelf-encoding encoder deep learning network is constructed, by character representation vectorInput self-encoding encoder deep learning network carries out compression and feature extraction, obtains low-dimensional character representation vector

Description

A kind of stock intelligent Forecasting of combination newsletter archive

Technical field

The invention belongs to technical field of data processing, and in particular to a kind of stock intelligent predicting side of combination newsletter archive Method.

Background technique

Financial market, especially stock market, it is both closely bound up with historical quotes, while being also extremely easy by burst gold Melt the influence of media event.At present Common Prediction Method be the forecasting problem of stock market is modeled as recurrence in machine learning or Classification problem.The prior art includes: building forecasting system, and the forecasting problem of stock is reduced to classification problem, by analyzing shadow The some feature distributions for ringing the newsletter archive of stock market are judged；Stock is carried out using the machine learning model of multicore Regression forecasting.However, existing most methods, which often depend upon expert, carrys out selected characteristic.But simple dependence is manually from big It is very difficult for measuring and character representation is found and excavated in complicated stock certificate data and newsletter archive.And for point of machine learning It, can pole if cannot carry out preferable feature vector to data set with vector indicates for generic task and some other task The precision of prediction of big influence classifier.

Deep learning character representation (Deep learned representation, DLR) includes the nerve net of a multilayer Network has the ability for taking out its character representation from sample using the network structure of multilayer, the energy that this extraction feature indicates Power is completely unsupervised.Non-linear but relatively simple activation primitive can be used in each layer of DLR, by the spy of input Sign indicates that vector transforms into more abstract character representation vector.Therefore, the feature by multilayer neural network after abstract Expression can provide a good sample input quantity for the classification task in future.

Application of the deep learning in financial field includes: to construct the model system of investment combination decision, utilizes coding certainly Device (auto-encoder) carries out investment combination, by classifier which determines only on the multitiered network structure of deep learning Stock will do very well in stock market from now on；Analysis and pre- is carried out to stock market using the multilayer neural network of deep learning It surveys.But existing Prediction of Stock Index method, it is usually that the media event that will affect shares changing tendency and historical quotes data are separately examined Consider, and not can be carried out good character representation when handling data, limits Prediction of Stock Index accuracy to a certain extent.

Summary of the invention

To solve the above problems, the present invention proposes a kind of stock intelligent Forecasting of combination newsletter archive, news is realized The combination of event and historical quotes data solves the low technical problem of Prediction of Stock Index accuracy.

The present invention adopts the following technical scheme that a kind of stock intelligent Forecasting of combination newsletter archive, specific steps are such as Under:

1) newsletter archive is pre-processed, filters Chinese word segmentation and stop words, delete the news text of not time tag This；

2) the prediction duration Δ t for determining stock filters selection newsletter archive according to the time tag of newsletter archive；

3) character representation, and the stock certificate data feature vector composition with the corresponding moment are carried out to the newsletter archive for filtering selection The character representation vector at corresponding moment

4) self-encoding encoder deep learning network, the character representation vector that step 3) is obtained are constructedInput self-encoding encoder Deep learning network carries out compression and feature extraction, obtains low-dimensional character representation vector

5) ELM neural network model is constructed, quantificational expression is carried out to the variation degree of share price, determines ELM neural network mould The target output value of type；

6) target output value according to determined by step 5), and the low-dimensional character representation vector that step 4) is obtainedAs the input of ELM neural network model, optimizes ELM neural network model parameter, obtain final prediction model.

Preferably, selection newsletter archive, specific steps are filtered according to the time tag of newsletter archive in the step 2) are as follows:

21) Prediction of Stock Index duration Δ t is determined；

22) data that stock possesses are in [t_s, t_e] in the period, from t_sStart, t₀∈[t_s, t_eΔ t] filtering selection away from From t₀The nearest newsletter archive of+time Δt is as t₀The newsletter archive at moment traverses [t_s, t_eΔ t] whole is obtained after the period Newsletter archive.

Preferably, the character representation vector at corresponding moment is formed in the step 3)Specific steps are as follows:

31) reverse document-frequency (the term frequency-inverse document of word frequency-is carried out to newsletter archive Frequency, tf-idf) it calculates, obtain t₀The tf-idf value of the newsletter archive word at moment is as t₀The newsletter archive at moment Character representation, all newsletter archives include n different words altogether, then t₀Moment newsletter archive character representation is [w₁, w₂..., w_n], wherein w₁、w₂、w_nRespectively represent the tf-idf value of first word, second word and n-th of word；

32) according to specific stock certificate data choose needed for index composition stock certificate data feature vector, index include opening price, Closing price, exchange hand and highest price, certain branch stock include m index, then t₀The stock certificate data feature vector at moment is [q₁, q₂..., q_m], wherein q₁, q₂, qm respectively indicates first index, second index and m-th of index in t₀Moment it is specific Value；

33) feature that step 31) and step 32) obtain is combined, obtains the newsletter archive and stock at corresponding moment Character representation vector after data combination:

Preferably, low-dimensional character representation vector is obtained in the step 4)Specifically:

Each layer of neural network is all self-encoding encoder, and feature is arranged in the dimension of the character representation vector obtained according to step 3) Indicate vector compression ratio, compression ratio be equal to deep neural network in the number of first layer neuron and the last layer neuron it Than the number of first layer neuron is the dimension for the character representation vector that step 3) obtains, the number of the last layer neuron The dimension of i.e. final low-dimensional character representation vector；Determine the number of plies of deep learning network and the neuron that each layer is included Number is arranged from first layer to the number linear decrease of last one layer of neuron.

Preferably, the target output value of ELM neural network model is determined in the step 5) specifically: ELM neural network The input of model is t₀The low-dimensional character representation vector at momentTarget output value is t₀The target of+time Δt exports y_i:

Wherein, r (x) is comparison function, reflects t₀+ time Δt is relative to t₀The variation degree at moment, θ are according to stock reality The threshold value of border situation setting, when stock is in t₀When the rise degree of+time Δt is greater than threshold θ, then target exports y_iIt is+1, instantly When drop degree is greater than threshold θ, target exports y_iIt is -1, when variation degree is less than threshold θ, shares changing tendency held stationary, then mesh Mark output y_iIt is 0.

Preferably, final prediction model specific steps are obtained in the step 6) are as follows:

61) according to the prediction duration Δ t of stock, the historical quotes data for the stock for needing to predict and newsletter archive are carried out Cutting, and mark that form sample set as follows:

{(X₁, Y₁), (X₂, Y₂) ..., (X_l, Y_l)}

Sample set includes l sample, wherein X_iFor input value, t is indicated_iMoment passes through the obtained low-dimensional feature of step 4) Indicate vectorY_iFor output valve, indicate to pass through the obtained t of step 5)_iThe target output value of+time Δt；

62) sample set is divided into training sample set and test sample collection, using training sample set to common ELM and core letter The neural network of several KELM is trained, and is carried out cross validation to two kinds of ELM using test sample collection, is determined final prediction Model.

It invents achieved the utility model has the advantages that the present invention is a kind of stock intelligent Forecasting of combination newsletter archive, realizes The combination of media event and historical quotes data solves the low technical problem of Prediction of Stock Index accuracy.

For disaggregated model, the learning machine that transfinites (extreme learning machine, ELM) model can reach Higher accuracy rate can be used to analyze a large amount of data sample.Weight between the learning machine network layer that transfinites and partially Initial value can be assigned at random by setting, and the parameter all to model inside is avoided to carry out tuning.The present invention will affect stock price The market of history and the media event of burst, which combine, forms character representation vector, carries out character representation to it using DLR, and by stock City's forecasting problem is converted into a classification problem, and with transfiniting, learning machine carries out classification prediction, significantly improves short-term Prediction of Stock Index Accuracy rate.

Detailed description of the invention

Fig. 1 is the flow chart that newsletter archive is screened in the embodiment of the present invention；

Fig. 2 is newsletter archive screening schematic diagram of the invention；

Fig. 3 is the relational graph of threshold θ and predictablity rate acc of the invention；

Fig. 4 is model training flow chart of the invention.

Specific embodiment

Below according to attached drawing and technical solution of the present invention is further elaborated in conjunction with the embodiments.

A kind of stock intelligent Forecasting of combination newsletter archive, comprising the following steps:

1) as shown in Figure 1, being pre-processed to newsletter archive, Chinese word segmentation and stop words is filtered, newsletter archive is converted For convenience of the data format of computer identification, the newsletter archive of not time tag is deleted；

Common text processing software Jieba or NLTK (natural language toolkit) can be achieved in filtering The pretreatments such as literary participle, stop words；

21) Prediction of Stock Index duration Δ t is determined；

22) data that stock possesses are in [t_s, t_e] in the period, from t_sStart, t₀∈[t_s, t_eΔ t], filtering selection Distance t₀The nearest newsletter archive of+time Δt is as t₀The newsletter archive at moment traverses [t_s, t_eΔ t] it is obtained after the period entirely Portion's newsletter archive；

As shown in Fig. 2, wherein t₀As current time, t Δ are t₀+ time Δt, when only one is new in this period When hearing text, then select the newsletter archive as t₀The newsletter archive at moment, when there is a plurality of newsletter archive in this period, then Select distance t₀The nearest newsletter archive of+time Δt is as t₀The newsletter archive at moment.

With t₀The stock certificate data and distance t at moment₀The corresponding original number as input of the nearest newsletter archive of+time Δt According to, and t₀Reference data of the share price predicted needed for+time Δt as output.

3) character representation, and the stock certificate data feature vector composition with the corresponding moment are carried out to the newsletter archive for filtering selection The character representation vector at corresponding moment；

31) tf-idf calculating is carried out to newsletter archive, obtains t₀The tf-idf value of the newsletter archive word at moment is as t₀When The character representation of the newsletter archive at quarter, all newsletter archives include n different word { word altogether₁, word₂... wordn }, Middle word₁, word₂, word_nFirst word, second word and n-th of word are respectively represented, then corresponds to moment t₀News text Eigen is expressed as the vector [w of n dimension₁, w₂..., w_n], wherein w₁、w₂、w_nRespectively represent first word, second word The tf-idf value of language and n-th of word, it may be assumed that

w_n=tf-idf (word_n)

32) according to specific stock certificate data choose needed for index composition stock certificate data feature vector, index include opening price, Closing price, exchange hand and highest price, certain branch stock include m index, then t₀The stock certificate data feature vector at moment is [q₁, q₂..., q_m], wherein q₁, q₂, q_mFirst index, second index and m-th of index are respectively indicated in t₀Moment it is specific Value；

33) feature that step 31) and step 32) obtain is combined, obtains corresponding moment t₀Newsletter archive and stock Character representation vector after ticket data combination:

4) self-encoding encoder deep learning network is constructed, the character representation vector that step 3) is obtained inputs self-encoding encoder depth Learning network carries out compression and feature extraction, obtains low-dimensional character representation vector；

The character representation vector exported by multilayer neural networkWith original inputIt compares, in addition to Other than dimension is reduced, the feature after newsletter archive and the combination of historical stock data can be more taken out.

It should be noted that many of prior art Open-Source Tools can build deep learning network, such as sklearn Deep learning tool box (deep-learning-toolbox) in (Scikit learn), keras and matlab, with deep- For learning-toolbox, neural network module therein (neural network, NN) module is chosen, in setup module Parameter, that is, network number of plies and every layer of neuron number, build deep learning network.The layer of network is set in the present embodiment Number is 10, and first layer neuron number is 1011, and the last layer neuron number is 860, from first layer to last one layer of nerve First number obeys linear decrease ordered series of numbers.

The input of ELM neural network model is the t that step 4) obtains₀The low-dimensional character representation vector at moment Target output value is t₀The target of+time Δt exports y_i:

Wherein, r (x) is comparison function, is t in the present embodiment₀+ time Δt stock price subtracts t₀Moment stock price, Reflect t₀+ time Δt stock is relative to t₀The variation degree of moment stock, θ are the threshold values being arranged according to stock actual conditions, when Stock is in t₀When the rise degree of+time Δt is greater than threshold θ, stock can rise, then target exports y_iIt is+1；When decline degree is big When threshold θ, stock can decline, and target exports y_iIt is -1, when variation degree is less than threshold θ, shares changing tendency held stationary, then Target exports y_iIt is 0.

In financial field, the situation of change obedience of stock is just distributed very much, and has fertile tail effect (fat tail), and above-mentioned three The probability that kind happens is indicated with following formula:

Wherein pdfG_aussian(x) the Gaussian Profile probability density function of stock price, P are indicated_-1, P₀And P₊₁It respectively indicates The probability of stock decline, held stationary or rising.By above formula it is found that threshold θ directly determines the probability of three kinds of situations, For the time series of no any priori knowledge, future condition is carried out according to the probability distribution that hypothesis time series is obeyed Prediction, shown in the following formula of the relationship of predictablity rate and probability distribution:

It is symmetrically obtained according to θ and-θ about 0 simultaneously:

P_-1=P₊₁

P₀=1-2P₊₁

In conjunction with above-mentioned formula, can finally obtain:

Acc=6p²-4P+1

P=P_-1=P₊₁, relationship such as Fig. 3 of threshold θ and predictablity rate acc.

From figure 3, it can be seen that accuracy is predictablity rate acc, it is inverted similar to one with the modified-image of threshold θ The curve of SIN function, predictablity rate acc minimum point obtain at p=1/3, and threshold θ is about 0.4 at this time, accurate in prediction Predictablity rate acc is increased with the raising of threshold θ on the right side of rate acc minimum point, but when θ is very big, most of share price Variation is all predicted to be and remains unchanged, then Prediction of Stock Index is nonsensical, on the left of predictablity rate acc minimum point accuracy rate with The raising of θ and reduce, so the smaller predictablity rate of θ is higher, but θ cannot be below every transaction expense of stock market, otherwise cannot Embody the fluctuation of stock market, thus the reasonable value range of θ should be equal to or slightly larger than stock market average every transaction expense, with perfume (or spice) For Hong Kong stock city, θ takes average every transaction expense 0.003 that stock market trades (30bps, bps indicate net assets per share).

6) target output value according to determined by step 5), and the low-dimensional character representation vector that step 4) is obtainedAs input, optimizes ELM neural network model parameter, obtain final prediction model.

{(X₁, Y₁), (X₂, Y₂) ..., (X_l, Y_l)}

Sample set includes l sample, wherein X_iFor input value, t is indicated_iMoment passes through the obtained low-dimensional feature of step 4) Indicate vectorY_iFor output valve, indicate to pass through the obtained t of step 5)_iThe target output value of+time Δt.

62) as shown in figure 4, sample set is divided into training sample set and test sample collection, using training sample set to basis ELM and the neural network of KELM of kernel function be trained, the input value of training sample set is inputted into neural network, obtains mind Output through network, while error calculation is carried out with the target output value of training sample set, the error of network is obtained, error is anti- To propagation, the weight of every layer of neuron, all training sample sets of loop iteration, until error convergence to a certain range are updated. For ELM, also need to choose two parameters.By taking RBF Radial basis kernel function as an example, Radial basis kernel function needs to be arranged two ginsengs Number γ and C, the range of γ are set as { 2^-17, 2^-16..., 2², the range of C is set as { 2^-5, 2^-4..., 2¹⁴, pass through open source Software chooses the optimal value of the two parameters on training sample set by way of crosscheck.

Cross validation is carried out using neural network of the test sample collection to the ELM on basis and the KELM of kernel function, is determined most Whole prediction model.

It is predicted using the variation tendency of method of the invention to future stock, is made with the stock in 33 Hong Kong in 2001 For test, and only consider newsletter archive or only consider stock certificate data ELM neural network prediction result compare, at 5 points In clock -30 minutes time span of forecasts, accuracy rate rises 0.02, and and consider the principal component analysis of newsletter archive and stock certificate data (principal component analysis, PCA) feature learning method compares, and accuracy rate improves 0.03.

Claims

1. a kind of stock intelligent Forecasting of combination newsletter archive, which comprises the following steps:

1) newsletter archive is pre-processed, filters Chinese word segmentation and stop words, delete the newsletter archive of not time tag；

3) character representation is carried out to the newsletter archive for filtering selection, and forms and corresponds to the stock certificate data feature vector at corresponding moment The character representation vector at moment

4) self-encoding encoder deep learning network, the character representation vector that step 3) is obtained are constructedInput self-encoding encoder depth Learning network carries out compression and feature extraction, obtains low-dimensional character representation vector

5) ELM neural network model is constructed, quantificational expression is carried out to the variation degree of share price, determines ELM neural network model Target output value；

6) target output value according to determined by step 5), and the low-dimensional character representation vector that step 4) is obtainedMake For the input of ELM neural network model, optimizes ELM neural network model parameter, obtain final prediction model.

2. a kind of stock intelligent Forecasting of combination newsletter archive according to claim 1, which is characterized in that the step It is rapid 2) according to the time tag of newsletter archive filter selection newsletter archive, specific steps are as follows:

21) Prediction of Stock Index duration Δ t is determined；

22) data that stock possesses are in [t_S, t_e] in the period, from t_sStart, t₀∈[t_s, t_eΔ t] filtering selection distance t₀+ The nearest newsletter archive of time Δt is as t₀The newsletter archive at moment traverses [t_s, t_eΔ t] whole news are obtained after the period Text.

3. a kind of stock intelligent Forecasting of combination newsletter archive according to claim 1, which is characterized in that the step Rapid 3) the middle character representation vector for forming the corresponding momentSpecific steps are as follows:

31) tf-idf calculating is carried out to newsletter archive, obtains t₀The tf-idf value of the newsletter archive word at moment is as t₀Moment The character representation of newsletter archive, all newsletter archives include n different words altogether, then t₀Moment newsletter archive character representation is [w₁, w₂..., w_n], wherein w₁、w₂、w_nRespectively represent the tf-idf value of first word, second word and n-th of word；

32) feature vector of the composition stock certificate data of index needed for being chosen according to specific stock certificate data, index include opening price, closing quotation Valence, exchange hand and highest price, certain branch stock include m index, then t₀The stock certificate data feature vector at moment is [q₁, q₂..., q_m], wherein q₁, q₂, q_mFirst index, second index and m-th of index are respectively indicated in t₀The occurrence at moment；

33) feature that step 31) and step 32) obtain is combined, obtains the newsletter archive and stock certificate data at corresponding moment In conjunction with character representation vector later:

4. a kind of stock intelligent Forecasting of combination newsletter archive according to claim 1, which is characterized in that the step It is rapid 4) in obtain low-dimensional character representation vectorSpecifically:

Each layer of neural network is all self-encoding encoder, and character representation is arranged in the dimension of the character representation vector obtained according to step 3) The compression ratio of vector, compression ratio are equal to the ratio between first layer neuron and the number of the last layer neuron in deep neural network, The number of first layer neuron is the dimension for the character representation vector that step 3) obtains, and the number of the last layer neuron is i.e. most The dimension of whole low-dimensional character representation vector；Determine the number of plies of deep learning network and of neuron that each layer is included Number is arranged from first layer to the number linear decrease of last one layer of neuron.

5. a kind of stock intelligent Forecasting of combination newsletter archive according to claim 1, which is characterized in that the step It is rapid 5) in determine ELM neural network model target output value specifically: the input of ELM neural network model be t₀Moment it is low Dimensional feature indicates vectorTarget output value is t₀The target of+time Δt exports y_i:

Wherein, r (x) is comparison function, reflects t₀+ time Δt is relative to t₀The variation degree at moment, θ are according to the practical feelings of stock The threshold value of condition setting, when stock is in t₀When the rise degree of+time Δt is greater than threshold θ, then target exports y_iIt is+1, when decline journey When degree is greater than threshold θ, target exports y_iIt is -1, when variation degree is less than threshold θ, shares changing tendency held stationary, then target is defeated Y out_iIt is 0.

6. a kind of stock intelligent Forecasting of combination newsletter archive according to claim 1, which is characterized in that the step It is rapid 6) in obtain final prediction model specific steps are as follows:

61) according to the prediction duration Δ t of stock, the historical quotes data for needing the stock predicted and newsletter archive are cut Point, and mark that form sample set as follows:

{(X₁, Y₁), (X₂, Y₂) ..., (X_l, Y_l)}

Sample set includes l sample, wherein X_iFor input value, t is indicated_iMoment passes through the obtained low-dimensional character representation of step 4) VectorY_iFor output valve, indicate to pass through the obtained t of step 5)_iThe target output value of+time Δt；

62) sample set is divided into training sample set and test sample collection, using training sample set to common ELM and kernel function The neural network of KELM is trained, and is carried out cross validation to two kinds of ELM using test sample collection, is determined final prediction mould Type.