CN106407161A

CN106407161A - Distributed calculating method of standard deviation

Info

Publication number: CN106407161A
Application number: CN201611032295.6A
Authority: CN
Inventors: 卓颋; 刘洪明; 殷荣华; 高海军; 何涵
Original assignee: Joan Beijing Innovation Technology Co Ltd; Chongqing University of Post and Telecommunications
Current assignee: Joan Beijing Innovation Technology Co Ltd; Chongqing University of Post and Telecommunications
Priority date: 2016-11-22
Filing date: 2016-11-22
Publication date: 2017-02-15

Abstract

The invention discloses a distributed calculating method of standard deviation. The distributed calculating method comprises the following steps: 1) inputting each partial totality Pi; 2) calculating the mean value [mu]i and standard deviation STD.Pi of each partial totality Pi and the data number ni of the partial totality; 3) calculating the global mean value of collected data according to a formula; and 4) using the formula to calculate the global standard deviation. According to the distributed calculating method of standard deviation disclosed by the invention, the global standard deviation can be calculated as long as the mean value, standard deviation and umber of the partial totality are known; and through the method, the calculated amount is obviously decreased, due to the fact the dispersedly memorized partial totality is not need to be read frequently, a large amount of inquiry access time is saved, and the actual calculation efficiency is greatly improved.

Description

The distributed computing method of standard deviation

Technical field

The present invention relates to standard deviation computing technique field, particularly to a kind of distributed computing method of standard deviation.

Background technology

Standard deviation is defined as：Overall constituent parts standard value and the arithmetical average of its average deviation square square Root.In statistics, standard deviation is usually used to measure the difference size of one group of numerical value and degree of scatter, and standard deviation is bigger, represents Between most of numerical value and its meansigma methods, difference is bigger, such as in physicses, when doing repetition measurement, measured value set Standard deviation represent these measurement degree of accuracy.Mainly there is following several method obtaining standard deviation in prior art：

First, the sampling calculation method of standard deviation, extracts certain sample to conceptual data, and carries out sample mark to sample The calculating of quasi- difference, in order to replace overall standard deviation.

But sampling approach has sampling biass, especially in the environment of big data, this deviation can become apparent from.

2nd, the Traditional calculating methods of population standard deviation：

According to the definition of standard deviation, standard deviation be each data respectively with the difference of average square and average flat Root, wherein

The computing formula of mean μ：

The computing formula of standard deviation sigma：

Formula (3) can be derived by formula (2), it pushes over process and omits；

In the environment of big data, the amount of calculation of traditional standard difference computational methods is very big, operates unrealistic.

3rd, the iterative calculation method of standard deviation：

Fashionable when there being new data to enter, the Traditional calculating methods of standard deviation want the original all data values of re invocation with newly Increase data to come together to calculate new standard deviation, for this problem, there has been proposed the iterative calculation method of standard deviation：

Assume there is a seasonal effect in time series data：

x₁,x₂,x₃,x₄,...,x_n,x_n+1,...

In time point n, obtain data x_n, and in time point n+1, obtain data x_n+1.Whenever a new data flows into When it is necessary to calculate the standard deviation of n number including this new data in the time window of an a length of n.

Its committed step is as follows：Calculate first

Then, overall and X when calculating a newly-increased data by way of iteration_n+1And standard deviation STD.S_n+1：

X_n+1=X_n+x_n+1-x₁(6)

Formula (6) iteratively calculates the summation of data in the window of an a length of n, formula (7) iteratively calculate one long Standard deviation for data in the window of n.

By denominator (n-1) is replaced by n, obtain the iterative calculation method of population standard deviation：

For the iterative calculation method of the population standard deviation of flow data, simple and Convenient Calculation can be carried out to newly-increased data, but work as There is new data to enter fashionable, still need to again all data be calculated, cause computing redundancy.

3rd, the incremental calculation method of standard deviation

The technical problem computationally intensive in order to solve traditional standard difference, people also proposed the incremental computations side of standard deviation Method：

The method pushes over out following two relational expressions first on the basis of formula (1)：

x_n-μ_n-1=n (μ_n-μ_n-1) (9)

And due to：

Thus, push over out in conjunction with formula (9), (10)：

S_n=S_n-1+(x_n-μ_n-1)(x_n-μ_n) (12)

Then obtain：

Standard deviation incremental calculation method only needs to according to the standard deviation of conceptual data and variance and single newly-increased data before, Just newly overall standard deviation can be calculated.But when in the face of the big data of distributed storage, need other distributed storage Each value during local is overall, as subsequent delta, substitutes into one by one and calculates, and can not directly utilize each local totally existing Average and standard deviation, computational efficiency is not still high.

In summary, the method for the traditional calculations standard deviation according to standard deviation definition needs the deviation from average of each data value Square calculate, computationally intensive when data volume is a lot, when have new data enter fashionable it is necessary to recalculate overall average and new Sum of sguares of deviation from mean, therefore there is redundancy in its calculating.Though the incremental calculation method of existing standard deviation is all in the past without access Input data thus make use of known condition, but if the data bulk inputting afterwards than larger when, will enter afterwards Each data value carry out incremental computations one by one, then its amount of calculation nor substantially reduced.

Content of the invention

In view of this, it is an object of the invention to provide a kind of distributed computing method of standard deviation, only it is to be understood that each local Overall average, standard deviation and number, just can calculate overall standard deviation, thus solve existing standard difference computational methods calculating Measure big technical problem.

The distributed computing method of standard deviation of the present invention, comprises the following steps：

1) input the overall P in each local_i；

2) calculate the overall P in each local_iMean μ_i, standard deviation sigma_i, and the overall data amount check n in local_i；

3) according to formulaCalculate the overall average of input；

4) utilize formulaCalculate defeated Enter overall standard deviation.

Beneficial effects of the present invention：

The distributed computing method of standard deviation of the present invention, only it is to be understood that the average of each local data, standard deviation and number, just The standard deviation of conceptual data can be calculated；This method makes amount of calculation substantially reduce, and due to reading each dispersion storage without frequent The all data deposited, save the substantial amounts of queried access time, and Practical Calculation efficiency has bigger raising.

Brief description

Fig. 1 is the flow chart of the distributed computing method of standard deviation of the present invention；

Fig. 2 is the computation model figure of the distributed computing method of standard deviation of the present invention.

Specific embodiment

The invention will be further described with reference to the accompanying drawings and examples.

The distributed computing method of the present embodiment standard deviation, comprises the following steps：

1) input the overall P in each local_i；

3) according to formulaCalculate the overall average of input；

4) utilize formulaCalculate defeated Enter overall standard deviation.The overall standard deviation sigma in each local_iCan be using the Traditional calculating methods of the standard deviation described in background technology Or the incremental calculation method of standard deviation obtains.

Below by instantiation by the Traditional calculating methods of the distributed computing method of standard deviation and standard deviation, standard deviation Iterative calculation method and the incremental calculation method of standard deviation contrasted in complexity of the calculation, to prove the present invention The superiority of the distributed computing method of standard deviation.

First input each local overall：

The overall P in local₁：{4,16,14,13,16,-7,-3,16,10,-19,1,-6,9,-4,17,12,3,8,18,9}

The overall P in local₂：{-3,-12,3,4,7,13,-15,16,-15,19}

The overall P in local₃：{-18,-7,17,-18,-6,-13,-2,-18,-2,-12,10,0,10,9,20}

Calculate the overall P in each local of input_iMean μ_i, standard deviation sigma_i, data amount check n_iFor:

Each local totally P_iMean μ_iIt is respectively：μ₁=6.35, μ₂=1.7, μ₃=-2

Each local totally P_iStandard deviation sigma_iIt is respectively：σ₁=9.763580286, σ₂=11.97539143, σ₃= 12.35043859

Each local totally P_iData amount check n_iIt is respectively：n₁=20, n₂=10, n₃=15.

Relatively one：Calculate the overall P in local by the Traditional calculating methods of standard deviation₁, the overall P in local₂P overall with local₃This The overall standard deviation of three

Overall data total number is：n_t=n₁+n₂+n₃=45, this step includes 2 additions.

Calculate overall average：

This step includes 44 additions, 1 division.

Calculate overall standard deviation：

This step need to carry out 45 multiplication or square, 1 division, 44 additions, 45 subtractions, 1 extracting operation.

Understand, when calculating standard deviation with traditional computational methods, need 45 multiplication altogether, 2 divisions, 90 additions, 45 Subtraction, 1 extracting operation.

Relatively two：The overall P in local is calculated by the iterative calculation method of standard deviation₁, the overall P in local₂P overall with local₃This The overall standard deviation of three

The overall P in known local₁Standard deviation sigma₁=9.763580286, if the length of data window is the most number of data volume Length 20 according to block.

Calculate the sum of front 20 numbers according to formula (4)：

This step includes 19 additions.

According to formula (6) calculate the rear n number after newly-increased 21st data and：

X₂₁=X₂₀+x₂₁-x₁=127+ (- 3) -4=120

This step includes 1 addition, 1 subtraction.

Overall standard deviation is calculated according to formula (8)：

This step include altogether 3 multiplication or square, 2 divisions, 3 additions, 2 subtractions, 1 evolution.

When newly-increased 1 data value, the method for iteration needs to carry out 3 multiplication, 2 divisions, 23 additions altogether, and 3 subtract Method and 1 extracting operation.

The total data entering below is calculated by above step successively, draw overall standard deviation be σ= 13.37310734.

In the overall P in known local₁Average and standard deviation in the case of, calculate overall standard with the computational methods of iteration Difference needs to carry out 75 multiplication, 50 divisions, 594 additions, 75 subtractions, 25 extracting operations altogether.

The time complexity of this algorithm is related to the data volume in data block, is O (n-a), and n is overall data amount check, Constant a is the data amount check in first data block.Only during a newly-increased data, this algorithm is compared traditional computational methods and is had Advantage, but when newly-increased data volume is very big, amount of calculation with n proportional relationship, be even more than the amount of calculation of traditional method. In addition, having differences between the result of calculation of the method and correct result, only as approximate calculation method.

Relatively three：The overall P in local is calculated by the incremental calculation method of standard deviation₁, the overall P in local₂P overall with local₃This The overall standard deviation of three

The overall P in known local₁Mean μ₁=6.35, standard deviation sigma₁=9.763580286, data amount check n₁=20,

According to formula (11), calculate the overall P in local₁Sum of sguares of deviation from mean value

This step includes 2 multiplication.

Calculate the meansigma methodss of newly-increased 21st data

This step needs to carry out 2 multiplication, 1 division, 2 additive operations.

According to formula (12), calculate newly-increased 21st number according to this after sum of sguares of deviation from mean

S₂₁=S₂₀+((-3)-μ₁)((-3)-μ₂₁)=1989.809524

This step includes 1 multiplication, 1 addition, 2 subtractions.

According to formula (13), calculate：

This step includes 1 division, 1 evolution.

When newly-increased 1 data value, include 5 multiplication, 2 divisions, 3 additions, 2 subtractions, 1 extracting operation altogether.

The total data entering below is brought into above step successively calculated, draw overall standard deviation sigma= 11.77115118.

In the overall P in known local₁Average and standard deviation in the case of, calculating standard deviation with the computational methods of increment needs altogether Calculate 125 multiplication, 50 divisions, 75 additions, 50 subtractions, 25 evolutions.

The result that the method calculates is error free with accurate result.Fashionable when there being single new data to enter, can make full use of Known conditions, reduces computing redundancy.It can be seen that when newly-increased data volume increases, amount of calculation is in that multiple increases, may Exceed the amount of calculation needed for traditional calculations, but fewer than the amount of calculation needed for the computational methods of iteration.The incremental computations of standard deviation The time complexity of algorithm to overall in data volume related, be O (n-a), n is overall data amount check amount, constant a is first Data amount check during individual local is overall.

Relatively four：The overall P in local is calculated by the distributed computing method of standard deviation₁, the overall P in local₂P overall with local₃ The overall standard deviation of this three

According to formula：

Calculate overall mean μ_t, include 3 multiplication, 1 division, 4 additions for this step.

Using distributed standards difference algorithm Calculate overall standard deviation：

This step includes 12 multiplication, 9 divisions, 14 additions, 9 subtractions, 1 evolution.

The distribution calculation method of standard deviation is brought in above-mentioned data and calculates, altogether need to calculate 15 multiplication, 10 Division, 18 additions, 9 subtractions, 1 evolution.

The result that this algorithm calculates is accurate.When knowing the overall average in each local and standard deviation, can be easy Calculate overall standard deviation, be sufficiently used the known conditions of each data block, so that computational efficiency is greatly improved. The computation complexity of the method is unrelated with data amount check, and only the number overall with local is relevant.The time complexity of this algorithm is O L (), constant l is the overall number in local.

Knowable to calculation procedure required for from above-mentioned various standard deviation computational methods, the incremental computations side of standard deviation of the present invention Method makes amount of calculation substantially reduce, with the obvious advantage, and due to without the frequent all data reading each dispersion storage, saving a large amount of The queried access time, Practical Calculation efficiency has bigger raising.

The distributed computing method of the present embodiment standard deviation be used for stock market stability analyses example is presented herein below.

The fluctuation of stock price is the performance of stock market risk, and therefore stock market risk analyses are exactly to stock market Price fluctuation is analyzed.Undulatory property represents the uncertainty of future price value, this uncertain typically use variance or Standard deviation is portraying.Table 1 is the stock statistical indicator of China and U.S. part period.

Table 1：Upper card and Standard ＆ Poor's Index

Can be obtained by calculating：

Index of Shanghai Stock Exchange achievement expected value

=(1144.08+1686.75+4328.92+2912.42+2736.50+2795.42+2639.19+ 2211.11+ 2182.53+2279.74)/10≈2491.6660

Upper card stability bandwidth expected value ≈ 0.3323

Standard ＆ Poor achievement expected value ≈ 1356.2570

Standard ＆ Poor stability bandwidth expected value ≈ 0.17118

And the computing formula of standard deviation then calculates according to the formula (12) in background technology：

The performance dimension difference ≈ 800.5983 of Index of Shanghai Stock Exchange

Upper card stability bandwidth standard deviation ≈ 0.1032

Standard ＆ Poor's Index performance dimension difference ≈ 267.4948

Standard ＆ Poor stability bandwidth standard deviation ≈ 0.0736

Because standard deviation is absolute value it is impossible to directly be contrasted to Sino-U.S. by standard deviation, and the coefficient of variation can be straight Connect and compare.Can be calculated：

Upper card achievement coefficient of variation ≈ 800.5983/2491.6660 ≈ 0.3213

Upper card stability bandwidth coefficient of variation ≈ 0.1032/0.3323 ≈ 0.3105

Standard ＆ Poor achievement coefficient of variation ≈ 267.4948/1356.2570 ≈ 0.1972

Standard ＆ Poor stability bandwidth coefficient of variation ≈ 0.0736/0.17118 ≈ 0.4301

By comparing it can be seen that the upper card stability bandwidth coefficient of variation is greater than the Standard ＆ Poor stability bandwidth coefficient of variation, illustrate to grow For phase, China Stock Markets's stability is relatively poor, or not overripened stock market.

Finally illustrate, above example only in order to technical scheme to be described and unrestricted, although with reference to relatively Good embodiment has been described in detail to the present invention, it will be understood by those within the art that, can be to the skill of the present invention Art scheme is modified or equivalent, the objective without deviating from technical solution of the present invention and scope, and it all should be covered at this In the middle of the right of invention.

Claims

1. standard deviation distributed computing method it is characterised in that：Comprise the following steps：

1) input the overall P in each local_i；

3) according to formulaCalculate the overall average of input；

4) utilize formulaCalculate input total The standard deviation of body.