CN109658241A

CN109658241A - A kind of screw-thread steel forward price ups and downs probability forecasting method

Info

Publication number: CN109658241A
Application number: CN201811403947.1A
Authority: CN
Inventors: 周振华
Original assignee: Chengdu Zhidaochuangyu Information Technology Co Ltd
Current assignee: Chengdu Zhidaochuangyu Information Technology Co Ltd
Priority date: 2018-11-23
Filing date: 2018-11-23
Publication date: 2019-04-19

Abstract

The invention discloses a kind of screw-thread steel forward price ups and downs probability forecasting methods, and specific method is: screw-thread steel characteristic is collected from internet and third party database；Criterion is minimized using information gain ratio combination square error, retains the biggish feature of information gain, generates decision tree；Then the empirical entropy that each node is calculated by loss function, recursively bounces back from the leaf node of tree upwards, if all leaf nodes of some father node merged, enables to its loss function to reduce, then carries out beta pruning, father node is become new leaf node；This step is repeated, it is final to reduce over-fitting probability until that cannot continue to merge；The present invention improves the speed of screw-thread steel forecasting of futures prix, saves manual analysis cost, realizes and is manually difficult to the various dimensions big data completed statistical analysis, while model has continuous learning feature, precision of prediction can be higher and higher.

Description

A kind of screw-thread steel forward price ups and downs probability forecasting method

Technical field

The present invention relates to forecasting of futures prix fields, and in particular to a kind of screw-thread steel forward price ups and downs probabilistic forecasting side Method.

Background technique

Term definition:

Screw-thread steel: screw-thread steel is being commonly called as hot rolled ribbed bars.Common its trade mark of hot-rolled reinforced bar by HRB and the trade mark surrender Point minimum value is constituted.H, R, B are respectively the English head of hot rolling (Hotrolled), three (Ribbed) with ribbing, reinforcing bar (Bars) words Position letter.

Futures: futures (Futures) and stock are entirely different, are the goods (commodity) that can really trade, phase from stock The owner of cargo is mark if it were not for goods, but with certain mass product such as cotton, soybean, petroleum etc. and financial asset such as stock, bond etc. Standardization can trade contract.Therefore, this subject matter can be certain commodity (such as gold, crude oil, agricultural product), can also be with It is financial instrument.

Decision tree: decision tree (Decision Tree) be it is known it is various happen probability on the basis of, pass through composition Decision tree seeks the probability that the desired value of net present value (NPV) is more than or equal to zero, and assessment item risk judges the decision point of its feasibility Analysis method is a kind of intuitive graphical method for using probability analysis.Since this decision branch is drawn as figure like the branch of one tree It is dry, therefore claim decision tree.

Fitting: figuratively, fitting is exactly that point a series of in plane is connected with a smooth curve.Cause There is countless possibility for this curve, to there are various approximating methods.The curve of fitting can generally use function representation, according to this The difference of a function has different fitting names.

Machine learning: machine learning (Machine Learning, ML) is a multi-field cross discipline, is related to probability By, statistics, Approximation Theory, convextiry analysis, the multiple subjects such as algorithm complexity theory.Specialize in how computer is simulated or realized The learning behavior of the mankind reorganizes the existing structure of knowledge and is allowed to constantly improve itself to obtain new knowledge or skills Performance.

Prediction model: prediction model is when being predicted using quantitative forecast method, and most important work is to establish prediction Mathematical model.Prediction model refers to the quantitative relation for prediction, between the things described in mathematical linguistics or formula.It The inherent law between things is disclosed to a certain extent, using it as the direct basis for calculating predicted value when prediction.Therefore, it Prediction accuracy is had significant effect.Any specific prediction technique is all characterized by its specific mathematical model. There are many type of prediction technique, respectively there is corresponding prediction model.

Basis: basis be a certain particular commodity in the spot price of a certain a certain amount of time and place and forward price it Difference.Its calculation method is that spot price subtracts forward price.If spot price is lower than forward price, basis is negative value；From stock Price is higher than forward price, and basis is positive value.

Over-fitting: over-fitting refers in order to obtain unanimously hypothesis and makes to assume to become over stringent.Avoiding over-fitting is point A core missions in the design of class device.The method for increasing data volume and test sample collection is generallyd use to classifier performance progress Evaluation.

Prior art " a kind of plastic raw materials concluded price trend forecasting method and device " is by obtaining default history The order data of plastic raw materials, plastics forward price data, crude oil futures price data, bank rate data and remittance in period Rate data；Order data is screened according to preset condition；According to after screening order data, plastics forward price data, Crude oil futures price data, bank rate data and exchange rate data calculate estimating into for plastic raw materials in default future time section Hand over price.

The shortcomings that prior art:

1. not analyzing for screw-thread steel varietal characteristic, it is not useable for screw-thread steel futures advance-decline forecasting.

2. unused machine learning techniques, do not have self-study characteristic, and needs manually adjusting parameter repeatedly, it is time-consuming to take Power.

3. it is slow to analyze speed, big data scene is not applicable, and it is big to increase feature work amount.

Summary of the invention

To solve the above problems, the present invention provides a kind of screw-thread steel forward price ups and downs probability forecasting method.This programme Specific steps are as follows:

1, learning sample data collection；Including the evidence and from third party database procurement data of fetching of swashing from internet；

2, data loading: after obtaining data, being stored in database for data, and when storage arranges and calculate all characteristic values, To be subsequently used as trained and test data；

3, data characteristics is chosen and is calculated

The data of continuous a period of time in database are taken out as training dataset D；Another section is taken not repeat with data set D Data as test data set T；Input training dataset D and feature A；

Empirical entropy H (D), the feature A of data set D are calculated separately to the empirical condition entropy H of data set D (D | A), information gain G (D, A), information gain ratio gR (D | A)；

4, decision-tree model generation and beta pruning

Criterion is minimized using CART algorithm and square error and generates decision tree, CART assumes that decision tree is a y-bend Tree, by recursively two points of each features, is divided into limited unit for feature space, and prediction is determined on these units Probability distribution；After having constructed decision tree, decision tree is carried out to subtract branch, noise node is removed；The beta pruning of decision tree passes through minimum Change the loss function of decision tree entirety to realize；

5, model measurement: ready test data set T before input, and the error between comparison model output and target value Value measures the quality of model training result；When predictablity rate is more than 70%, for training in next step；

6, training in rotation: the old data in data warehouse are divided into multiple groups training sample test data and complete more wheel training, and are held It is continuous to obtain the following new data generated and be used as training sample and test data, it repeats 2-5 step and iterates model training in rotation, raising Precision reaches designated value, output model；

7, latest data collection is inputted, screw-thread steel future forward price advance-decline forecasting result is exported.

In the program, swashes evidence of fetching from internet, be to crawl corresponding page using timing script and parse, after parsing Data be stored in database；Timing crawls the script with parsing data, and requests, the celery of Python can be used, Beautifulsoup4 is realized；From third party database procurement data, including free and payment uses；

Above data includes harbour inventory data, registration warehouse receipt data, in-stock data, futures data and basis data；It will Data cleansing arranges and is stored in database after daily merging for unit；When certain data sampling time unit is less than one, take The average value of the same day all data；It is greater than one day data without using any sampling time unit.

In the program, characteristic value calculation formula is as follows:

It is worth before harbour inventory change amount=harbour quantity in stock-harbour quantity in stock

It is worth before registration warehouse receipt variable quantity=registration warehouse receipt amount-registration warehouse receipt

Basis=spot price-forward price

Basis rate=basis/spot price

Opposite basis=basis-is averaged basis

Opposite basis rate=opposite basis/spot price

Other features are directed to database data value；

Other described features include stock average price on the 3rd, stock average price on the 7th, stock average price on the 15th, stock average price on the 30th.

For Feature Selection in step 3, wherein forward price data are exported as model, other characteristics are as model Input；Current data set D sample size is | D |, there is k classification C_k, | C_k| it is classification C_kNumber of samples, a certain feature A has n A different value a₁,a₂,……,a_n；Data set D can be divided into n subset D according to the value of feature A₁,D₂,……,D_n, |D_i| it is D_iNumber of samples, and remember subset D_iIn belong to class C_kThe collection of sample be combined into D_ik,|D_ik| it is D_ikNumber of samples.

Calculate data set D empirical entropy H (D) formula be

Entropy expresses the randomness of the data sample, i.e. confusion degree.

Feature A is to the empirical condition entropy H of data set D (D | A) calculating formula

In the case that conditional entropy expresses the fixation of A characteristic value, the entropy of data set D.

Information gain g (D, A) calculating formula is

G (D, A)=H (D)-H (D | A)

Information gain expresses when learning feature A, so that the entropy of class data set D reduces degree.

Information gain gR (D | A) calculating formula is

H in above formula_A(D) empirical entropy of the training set D about the value of feature A is indicated, both A value was to timing, the experience of data set D Entropy, calculating formula are

Feature A is to the information gain of training dataset D than being defined as its information gain and value of the training set D about feature A The ratio between entropy；Information gain ratio is bigger, effective special increment, when constructing tree, can calculate on each node information and increase Beneficial ratio, and finally determine the characteristic value of each node selection.

Further, in step 4, it is as follows to generate method for decision tree:

(1), since root node, the information gain of all possible features is calculated node, selects information gain maximum Feature of the feature as node, and child node is constructed by the different values of this feature；

(2), above method is recursively called to child node, constructs decision tree；

(3), until the information gain of all features without feature optional time until；

It is as follows that the square error minimizes criterion:

Assuming that the input space is divided into M unit R₁,R₂,...,R_M, and in each unit R_mOn have a fixation Output valve c_m, then regression tree can be expressed as

When the division of the input space determines, square error can be usedTo indicate regression tree pair The y in the prediction error of training data, formula_iIndicate the output feature given in data set.

Meanwhile in step 4, the beta pruning of decision tree passes through the complexity T to model on the basis of improving information gain Apply punishment, just obtain the definition of loss function:

In above formula, N_tIndicate the leaf node number below present node t；H_t(T) indicate that present node t calculates downwards test The empirical entropy of data set；| T | indicate the leaf node number of whole decision tree, both model complexity；The size of α reflects pair in formula The compromise of model training collection degree of fitting and model complexity considers；Wherein H_t(T) calculating formula are as follows:

Data set T can be divided into n subset T according to the value of feature A₁,T₂,……,T_n, | T_i| it is T_iSample Number；The process of the beta pruning exactly when α is determined, selects the smallest model of loss function, and specific algorithm is as follows:

(1), the empirical entropy of each node is calculated；

(2), it recursively bounces back upwards from the leaf node of tree, if all leaf nodes of some father node merged, energy Enough so that its loss function reduces, then beta pruning is carried out, father node is become into new leaf node；

(3), step 2 is repeated, until that cannot continue to merge.

Technical solution of the present invention bring has the beneficial effect that

1, the speed of screw-thread steel forecasting of futures prix is improved；

2, manual analysis cost is saved；

3, it realizes and is manually difficult to the various dimensions big data completed statistical analysis；

4, model has continuous learning feature, and precision of prediction can be higher and higher.

Detailed description of the invention

Fig. 1 is the flow chart of this programme.

Specific embodiment

The present invention is described in more detail with implementation method with reference to the accompanying drawing.

Fig. 1 is the flow chart of this programme, specific steps are as follows:

1.1 swash evidence of fetching from internet:

There is the website of some free publicity commodity datas in internet at present, and general type is to roll to refresh by the period.It can make Corresponding page is crawled with timing script and is parsed, and the data after parsing are stored in database.Timing crawls the foot with parsing data This, can be used the requests of Python, and the libraries such as celery, beautifulsoup4 are realized.

1.2 from third party database procurement data:

Part third party database has had structural data, can free or payment inquiry use.

2, data loading: after obtaining data, being stored in database for data, and when storage arranges and calculate all characteristic values, To be subsequently used as trained and test data；Characteristic value calculation formula is as follows:

Basis=spot price-forward price

Basis rate=basis/spot price

Opposite basis=basis-is averaged basis

Opposite basis rate=opposite basis/spot price

Other features include stock average price on the 3rd, stock average price on the 7th, stock average price on the 15th, stock average price on the 30th.Directly come Derived from database data value.

3, data characteristics is chosen and is calculated

3.1 take out the data of continuous a period of time in database as training dataset D；Take another section with data set D not Duplicate data are as test data set T；Input training dataset D and feature A；

Current data set D sample size is | D |, there is k classification C_k, | C_k| it is classification C_kNumber of samples, a certain feature A There are n different value a₁,a₂,……,a_n；Data set D can be divided into n subset D according to the value of feature A₁,D₂,……, D_n, | D_i| it is D_iNumber of samples, and remember subset D_iIn belong to class C_kThe collection of sample be combined into D_ik,|D_ik| it is D_ikNumber of samples.

3.2 calculate the empirical entropy H (D) of data set D:

Entropy expresses the randomness (confusion degree) of the data sample, both:

3.3 calculate feature A to the empirical condition entropy H of data set D (D | A):

In the case that conditional entropy expresses the fixation of A characteristic value, the entropy of data set D, both:

3.4 calculate information gain:

Information gain expresses when learning feature A, so that the entropy of class data set D reduces degree, both:

G (D, A)=H (D)-H (D | A)

3.5 calculate information gain ratio:

Using information gain as feature selecting criterion, can there are problems that the feature for being partial to select value more.It can be with This problem is compared using information gain to be corrected.Feature A is defined as its information to the information gain ratio of training dataset D and increases Benefit and the entropy of value the ratio between of the training set D about feature A, i.e.,

Information gain ratio is bigger, and effective special increment can calculate information gain when constructing tree on each node Than, and finally determine the characteristic value of each node selection.

4, decision-tree model generation and beta pruning

Criterion is minimized using CART algorithm and square error and generates decision tree, CART assumes that decision tree is a y-bend Tree, by recursively two points of each features, is divided into limited unit for feature space, and prediction is determined on these units Probability distribution；

It is as follows that decision tree generates method:

It is as follows that the square error minimizes criterion:

After having constructed decision tree, decision tree is carried out to subtract branch, noise node is removed；The beta pruning of decision tree passes through minimization The loss function of decision tree entirety is realized；

The beta pruning of decision tree applies punishment on the basis of improving information gain, through the complexity T to model, just obtains The definition of loss function:

Data set T can be divided into n subset T according to the value of feature A₁,T₂,……,T_n, | T_i| it is T_iSample Number.The process of beta pruning exactly when α is determined, selects the smallest model of loss function, and specific algorithm is as follows:

(1), the empirical entropy of each node is calculated；

(3), step 2 is repeated, until that cannot continue to merge.

Claims

1. a kind of screw-thread steel forward price ups and downs probability forecasting method, which comprises the following steps:

(1), learning sample data collection；Including the evidence and from third party database procurement data of fetching of swashing from internet；

(2), data loading: after obtaining data, being stored in database for data, and when storage arranges and calculate all characteristic values, with Just it is subsequently used as trained and test data；

(3), data characteristics is chosen and is calculated

The data of continuous a period of time in database are taken out as training dataset D；Take another section and the unduplicated number of data set D According to as test data set T；Input training dataset D and feature A；

Calculate separately empirical entropy H (D), the feature A of data set D to the empirical condition entropy H of data set D (D | A), information gain g (D, A), information gain ratio gR (D | A)；

(4), decision-tree model generation and beta pruning

Criterion is minimized using CART algorithm and square error and generates decision tree, CART assumes that decision tree is a binary tree, leads to Recursively two points of each features are crossed, feature space is divided into limited unit, and determine the probability of prediction on these units Distribution；After having constructed decision tree, decision tree is carried out to subtract branch, noise node is removed；The beta pruning of decision tree is determined by minimization The loss function of plan tree entirety is realized；

(5), model measurement: ready test data set T before input, and the error between comparison model output and target value Value measures the quality of model training result；When predictablity rate is more than 70%, for training in next step；

(6), training in rotation: the old data in data warehouse are divided into multiple groups training sample test data and complete more wheel training, and are continued It obtains the following new data generated and is used as training sample and test data, repeat 2-5 step and iterate model training in rotation, improve smart Degree reaches designated value, output model；

(7), latest data collection is inputted, screw-thread steel future forward price advance-decline forecasting result is exported.

2. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 1, which is characterized in that it is described from Internet swashes evidence of fetching, and is to crawl corresponding page using timing script and parse, and the data after parsing are stored in database；It is fixed When crawl with parsing data script, can be used Python requests, celery, beautifulsoup4 realize；It is described From third party database procurement data, including free and payment uses；

The data include harbour inventory data, registration warehouse receipt data, in-stock data, futures data and basis data；By data Cleaning arranges and is stored in database after daily merging for unit；

When certain data sampling time unit is less than one, the average value of the same day all data is taken；

It is greater than one day data without using any sampling time unit.

3. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 2, which is characterized in that the spy Value indicative calculation formula is as follows:

Basis=spot price-forward price

Basis rate=basis/spot price

Opposite basis=basis-is averaged basis

Opposite basis rate=opposite basis/spot price

Other features are directed to database data value；

4. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 2, which is characterized in that the spy Sign is chosen, and wherein forward price data are exported as model, other characteristics are as mode input；Current data set D sample Capacity is | D |, there is k classification C_k, | C_k| it is classification C_kNumber of samples, the value a that a certain feature A has n different₁, a₂,……,a_n；Data set D can be divided into n subset D according to the value of feature A₁,D₂,……,D_n, | D_i| it is D_iSample Number, and remember subset D_iIn belong to class C_kThe collection of sample be combined into D_ik,|D_ik| it is D_ikNumber of samples.

5. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 4, which is characterized in that the number It is according to empirical entropy H (D) calculating formula for collecting D

Entropy expresses the randomness of the data sample, i.e. confusion degree.

6. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 4, which is characterized in that the spy Sign A is to the empirical condition entropy H of data set D (D | A) calculating formula

7. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 4, which is characterized in that the letter Ceasing gain g (D, A) calculating formula is

G (D, A)=H (D)-H (D | A)

8. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 4, which is characterized in that the letter Breath gain gR (D | A) calculating formula is

H in above formula_A(D) empirical entropy of the training set D about the value of feature A is indicated, both A value was given periodically, the empirical entropy of data set D, Calculating formula is

Feature A is to the information gain of training dataset D than being defined as its information gain and entropy of the training set D about the value of feature A The ratio between；Information gain ratio is bigger, and effective special increment can calculate information gain ratio when constructing tree on each node, And finally determine the characteristic value of each node selection.

9. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 8, which is characterized in that described to determine Plan tree generation method is as follows:

(1), since root node, the information gain of all possible features is calculated node, the maximum feature of information gain is selected Child node is constructed as the feature of node, and by the different values of this feature；

It is as follows that the square error minimizes criterion:

Assuming that the input space is divided into M unit R₁,R₂,...,R_M, and in each unit R_mOn have one it is fixed defeated Value c out_m, then regression tree can be expressed as

When the division of the input space determines, square error can be usedTo indicate regression tree for training The prediction error of data, y in formula_iIndicate the output feature given in data set.

10. a kind of screw-thread steel forward price ups and downs probability forecasting method according to claim 9, which is characterized in that described The beta pruning of decision tree applies punishment on the basis of improving information gain, through the complexity T to model, has just obtained loss letter Several definition:

In above formula, N_tIndicate the leaf node number below present node t；H_t(T) indicate that present node t calculates downwards test data The empirical entropy of collection；| T | indicate the leaf node number of whole decision tree, both model complexity；The size of α is reflected to model in formula The compromise of training set degree of fitting and model complexity considers；Wherein H_t(T) calculating formula are as follows:

Data set T can be divided into n subset T according to the value of feature A₁,T₂,……,T_n, | T_i| it is T_iNumber of samples；

The process of the beta pruning exactly when α is determined, selects the smallest model of loss function, and specific algorithm is as follows:

(1), the empirical entropy of each node is calculated；

(2), it recursively bounces back from the leaf node of tree, if all leaf nodes of some father node merged, can make upwards It obtains its loss function to reduce, then carries out beta pruning, father node is become into new leaf node；

(3), step 2 is repeated, until that cannot continue to merge.