CN106407161A - Distributed calculating method of standard deviation - Google Patents
Distributed calculating method of standard deviation Download PDFInfo
- Publication number
- CN106407161A CN106407161A CN201611032295.6A CN201611032295A CN106407161A CN 106407161 A CN106407161 A CN 106407161A CN 201611032295 A CN201611032295 A CN 201611032295A CN 106407161 A CN106407161 A CN 106407161A
- Authority
- CN
- China
- Prior art keywords
- standard deviation
- overall
- data
- local
- calculate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/04—Trading; Exchange, e.g. stocks, commodities, derivatives or currency exchange
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Accounting & Taxation (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Finance (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Operations Research (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Algebra (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Technology Law (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a distributed calculating method of standard deviation. The distributed calculating method comprises the following steps: 1) inputting each partial totality Pi; 2) calculating the mean value [mu]i and standard deviation STD.Pi of each partial totality Pi and the data number ni of the partial totality; 3) calculating the global mean value of collected data according to a formula; and 4) using the formula to calculate the global standard deviation. According to the distributed calculating method of standard deviation disclosed by the invention, the global standard deviation can be calculated as long as the mean value, standard deviation and umber of the partial totality are known; and through the method, the calculated amount is obviously decreased, due to the fact the dispersedly memorized partial totality is not need to be read frequently, a large amount of inquiry access time is saved, and the actual calculation efficiency is greatly improved.
Description
Technical field
The present invention relates to standard deviation computing technique field, particularly to a kind of distributed computing method of standard deviation.
Background technology
Standard deviation is defined as:Overall constituent parts standard value and the arithmetical average of its average deviation square square
Root.In statistics, standard deviation is usually used to measure the difference size of one group of numerical value and degree of scatter, and standard deviation is bigger, represents
Between most of numerical value and its meansigma methods, difference is bigger, such as in physicses, when doing repetition measurement, measured value set
Standard deviation represent these measurement degree of accuracy.Mainly there is following several method obtaining standard deviation in prior art:
First, the sampling calculation method of standard deviation, extracts certain sample to conceptual data, and carries out sample mark to sample
The calculating of quasi- difference, in order to replace overall standard deviation.
But sampling approach has sampling biass, especially in the environment of big data, this deviation can become apparent from.
2nd, the Traditional calculating methods of population standard deviation:
According to the definition of standard deviation, standard deviation be each data respectively with the difference of average square and average flat
Root, wherein
The computing formula of mean μ:
The computing formula of standard deviation sigma:
Formula (3) can be derived by formula (2), it pushes over process and omits;
In the environment of big data, the amount of calculation of traditional standard difference computational methods is very big, operates unrealistic.
3rd, the iterative calculation method of standard deviation:
Fashionable when there being new data to enter, the Traditional calculating methods of standard deviation want the original all data values of re invocation with newly
Increase data to come together to calculate new standard deviation, for this problem, there has been proposed the iterative calculation method of standard deviation:
Assume there is a seasonal effect in time series data:
x1,x2,x3,x4,...,xn,xn+1,...
In time point n, obtain data xn, and in time point n+1, obtain data xn+1.Whenever a new data flows into
When it is necessary to calculate the standard deviation of n number including this new data in the time window of an a length of n.
Its committed step is as follows:Calculate first
Then, overall and X when calculating a newly-increased data by way of iterationn+1And standard deviation STD.Sn+1:
Xn+1=Xn+xn+1-x1(6)
Formula (6) iteratively calculates the summation of data in the window of an a length of n, formula (7) iteratively calculate one long
Standard deviation for data in the window of n.
By denominator (n-1) is replaced by n, obtain the iterative calculation method of population standard deviation:
For the iterative calculation method of the population standard deviation of flow data, simple and Convenient Calculation can be carried out to newly-increased data, but work as
There is new data to enter fashionable, still need to again all data be calculated, cause computing redundancy.
3rd, the incremental calculation method of standard deviation
The technical problem computationally intensive in order to solve traditional standard difference, people also proposed the incremental computations side of standard deviation
Method:
The method pushes over out following two relational expressions first on the basis of formula (1):
xn-μn-1=n (μn-μn-1) (9)
And due to:
Thus, push over out in conjunction with formula (9), (10):
Sn=Sn-1+(xn-μn-1)(xn-μn) (12)
Then obtain:
Standard deviation incremental calculation method only needs to according to the standard deviation of conceptual data and variance and single newly-increased data before,
Just newly overall standard deviation can be calculated.But when in the face of the big data of distributed storage, need other distributed storage
Each value during local is overall, as subsequent delta, substitutes into one by one and calculates, and can not directly utilize each local totally existing
Average and standard deviation, computational efficiency is not still high.
In summary, the method for the traditional calculations standard deviation according to standard deviation definition needs the deviation from average of each data value
Square calculate, computationally intensive when data volume is a lot, when have new data enter fashionable it is necessary to recalculate overall average and new
Sum of sguares of deviation from mean, therefore there is redundancy in its calculating.Though the incremental calculation method of existing standard deviation is all in the past without access
Input data thus make use of known condition, but if the data bulk inputting afterwards than larger when, will enter afterwards
Each data value carry out incremental computations one by one, then its amount of calculation nor substantially reduced.
Content of the invention
In view of this, it is an object of the invention to provide a kind of distributed computing method of standard deviation, only it is to be understood that each local
Overall average, standard deviation and number, just can calculate overall standard deviation, thus solve existing standard difference computational methods calculating
Measure big technical problem.
The distributed computing method of standard deviation of the present invention, comprises the following steps:
1) input the overall P in each locali;
2) calculate the overall P in each localiMean μi, standard deviation sigmai, and the overall data amount check n in locali;
3) according to formulaCalculate the overall average of input;
4) utilize formulaCalculate defeated
Enter overall standard deviation.
Beneficial effects of the present invention:
The distributed computing method of standard deviation of the present invention, only it is to be understood that the average of each local data, standard deviation and number, just
The standard deviation of conceptual data can be calculated;This method makes amount of calculation substantially reduce, and due to reading each dispersion storage without frequent
The all data deposited, save the substantial amounts of queried access time, and Practical Calculation efficiency has bigger raising.
Brief description
Fig. 1 is the flow chart of the distributed computing method of standard deviation of the present invention;
Fig. 2 is the computation model figure of the distributed computing method of standard deviation of the present invention.
Specific embodiment
The invention will be further described with reference to the accompanying drawings and examples.
The distributed computing method of the present embodiment standard deviation, comprises the following steps:
1) input the overall P in each locali;
2) calculate the overall P in each localiMean μi, standard deviation sigmai, and the overall data amount check n in locali;
3) according to formulaCalculate the overall average of input;
4) utilize formulaCalculate defeated
Enter overall standard deviation.The overall standard deviation sigma in each localiCan be using the Traditional calculating methods of the standard deviation described in background technology
Or the incremental calculation method of standard deviation obtains.
Below by instantiation by the Traditional calculating methods of the distributed computing method of standard deviation and standard deviation, standard deviation
Iterative calculation method and the incremental calculation method of standard deviation contrasted in complexity of the calculation, to prove the present invention
The superiority of the distributed computing method of standard deviation.
First input each local overall:
The overall P in local1:{4,16,14,13,16,-7,-3,16,10,-19,1,-6,9,-4,17,12,3,8,18,9}
The overall P in local2:{-3,-12,3,4,7,13,-15,16,-15,19}
The overall P in local3:{-18,-7,17,-18,-6,-13,-2,-18,-2,-12,10,0,10,9,20}
Calculate the overall P in each local of inputiMean μi, standard deviation sigmai, data amount check niFor:
Each local totally PiMean μiIt is respectively:μ1=6.35, μ2=1.7, μ3=-2
Each local totally PiStandard deviation sigmaiIt is respectively:σ1=9.763580286, σ2=11.97539143, σ3=
12.35043859
Each local totally PiData amount check niIt is respectively:n1=20, n2=10, n3=15.
Relatively one:Calculate the overall P in local by the Traditional calculating methods of standard deviation1, the overall P in local2P overall with local3This
The overall standard deviation of three
Overall data total number is:nt=n1+n2+n3=45, this step includes 2 additions.
Calculate overall average:
This step includes 44 additions, 1 division.
Calculate overall standard deviation:
This step need to carry out 45 multiplication or square, 1 division, 44 additions, 45 subtractions, 1 extracting operation.
Understand, when calculating standard deviation with traditional computational methods, need 45 multiplication altogether, 2 divisions, 90 additions, 45
Subtraction, 1 extracting operation.
Relatively two:The overall P in local is calculated by the iterative calculation method of standard deviation1, the overall P in local2P overall with local3This
The overall standard deviation of three
The overall P in known local1Standard deviation sigma1=9.763580286, if the length of data window is the most number of data volume
Length 20 according to block.
Calculate the sum of front 20 numbers according to formula (4):
This step includes 19 additions.
According to formula (6) calculate the rear n number after newly-increased 21st data and:
X21=X20+x21-x1=127+ (- 3) -4=120
This step includes 1 addition, 1 subtraction.
Overall standard deviation is calculated according to formula (8):
This step include altogether 3 multiplication or square, 2 divisions, 3 additions, 2 subtractions, 1 evolution.
When newly-increased 1 data value, the method for iteration needs to carry out 3 multiplication, 2 divisions, 23 additions altogether, and 3 subtract
Method and 1 extracting operation.
The total data entering below is calculated by above step successively, draw overall standard deviation be σ=
13.37310734.
In the overall P in known local1Average and standard deviation in the case of, calculate overall standard with the computational methods of iteration
Difference needs to carry out 75 multiplication, 50 divisions, 594 additions, 75 subtractions, 25 extracting operations altogether.
The time complexity of this algorithm is related to the data volume in data block, is O (n-a), and n is overall data amount check,
Constant a is the data amount check in first data block.Only during a newly-increased data, this algorithm is compared traditional computational methods and is had
Advantage, but when newly-increased data volume is very big, amount of calculation with n proportional relationship, be even more than the amount of calculation of traditional method.
In addition, having differences between the result of calculation of the method and correct result, only as approximate calculation method.
Relatively three:The overall P in local is calculated by the incremental calculation method of standard deviation1, the overall P in local2P overall with local3This
The overall standard deviation of three
The overall P in known local1Mean μ1=6.35, standard deviation sigma1=9.763580286, data amount check n1=20,
According to formula (11), calculate the overall P in local1Sum of sguares of deviation from mean value
This step includes 2 multiplication.
Calculate the meansigma methodss of newly-increased 21st data
This step needs to carry out 2 multiplication, 1 division, 2 additive operations.
According to formula (12), calculate newly-increased 21st number according to this after sum of sguares of deviation from mean
S21=S20+((-3)-μ1)((-3)-μ21)=1989.809524
This step includes 1 multiplication, 1 addition, 2 subtractions.
According to formula (13), calculate:
This step includes 1 division, 1 evolution.
When newly-increased 1 data value, include 5 multiplication, 2 divisions, 3 additions, 2 subtractions, 1 extracting operation altogether.
The total data entering below is brought into above step successively calculated, draw overall standard deviation sigma=
11.77115118.
In the overall P in known local1Average and standard deviation in the case of, calculating standard deviation with the computational methods of increment needs altogether
Calculate 125 multiplication, 50 divisions, 75 additions, 50 subtractions, 25 evolutions.
The result that the method calculates is error free with accurate result.Fashionable when there being single new data to enter, can make full use of
Known conditions, reduces computing redundancy.It can be seen that when newly-increased data volume increases, amount of calculation is in that multiple increases, may
Exceed the amount of calculation needed for traditional calculations, but fewer than the amount of calculation needed for the computational methods of iteration.The incremental computations of standard deviation
The time complexity of algorithm to overall in data volume related, be O (n-a), n is overall data amount check amount, constant a is first
Data amount check during individual local is overall.
Relatively four:The overall P in local is calculated by the distributed computing method of standard deviation1, the overall P in local2P overall with local3
The overall standard deviation of this three
According to formula:
Calculate overall mean μt, include 3 multiplication, 1 division, 4 additions for this step.
Using distributed standards difference algorithm
Calculate overall standard deviation:
This step includes 12 multiplication, 9 divisions, 14 additions, 9 subtractions, 1 evolution.
The distribution calculation method of standard deviation is brought in above-mentioned data and calculates, altogether need to calculate 15 multiplication, 10
Division, 18 additions, 9 subtractions, 1 evolution.
The result that this algorithm calculates is accurate.When knowing the overall average in each local and standard deviation, can be easy
Calculate overall standard deviation, be sufficiently used the known conditions of each data block, so that computational efficiency is greatly improved.
The computation complexity of the method is unrelated with data amount check, and only the number overall with local is relevant.The time complexity of this algorithm is O
L (), constant l is the overall number in local.
Knowable to calculation procedure required for from above-mentioned various standard deviation computational methods, the incremental computations side of standard deviation of the present invention
Method makes amount of calculation substantially reduce, with the obvious advantage, and due to without the frequent all data reading each dispersion storage, saving a large amount of
The queried access time, Practical Calculation efficiency has bigger raising.
The distributed computing method of the present embodiment standard deviation be used for stock market stability analyses example is presented herein below.
The fluctuation of stock price is the performance of stock market risk, and therefore stock market risk analyses are exactly to stock market
Price fluctuation is analyzed.Undulatory property represents the uncertainty of future price value, this uncertain typically use variance or
Standard deviation is portraying.Table 1 is the stock statistical indicator of China and U.S. part period.
Table 1:Upper card and Standard & Poor's Index
Can be obtained by calculating:
Index of Shanghai Stock Exchange achievement expected value
=(1144.08+1686.75+4328.92+2912.42+2736.50+2795.42+2639.19+ 2211.11+
2182.53+2279.74)/10≈2491.6660
Upper card stability bandwidth expected value ≈ 0.3323
Standard & Poor achievement expected value ≈ 1356.2570
Standard & Poor stability bandwidth expected value ≈ 0.17118
And the computing formula of standard deviation then calculates according to the formula (12) in background technology:
The performance dimension difference ≈ 800.5983 of Index of Shanghai Stock Exchange
Upper card stability bandwidth standard deviation ≈ 0.1032
Standard & Poor's Index performance dimension difference ≈ 267.4948
Standard & Poor stability bandwidth standard deviation ≈ 0.0736
Because standard deviation is absolute value it is impossible to directly be contrasted to Sino-U.S. by standard deviation, and the coefficient of variation can be straight
Connect and compare.Can be calculated:
Upper card achievement coefficient of variation ≈ 800.5983/2491.6660 ≈ 0.3213
Upper card stability bandwidth coefficient of variation ≈ 0.1032/0.3323 ≈ 0.3105
Standard & Poor achievement coefficient of variation ≈ 267.4948/1356.2570 ≈ 0.1972
Standard & Poor stability bandwidth coefficient of variation ≈ 0.0736/0.17118 ≈ 0.4301
By comparing it can be seen that the upper card stability bandwidth coefficient of variation is greater than the Standard & Poor stability bandwidth coefficient of variation, illustrate to grow
For phase, China Stock Markets's stability is relatively poor, or not overripened stock market.
Finally illustrate, above example only in order to technical scheme to be described and unrestricted, although with reference to relatively
Good embodiment has been described in detail to the present invention, it will be understood by those within the art that, can be to the skill of the present invention
Art scheme is modified or equivalent, the objective without deviating from technical solution of the present invention and scope, and it all should be covered at this
In the middle of the right of invention.
Claims (1)
1. standard deviation distributed computing method it is characterised in that:Comprise the following steps:
1) input the overall P in each locali;
2) calculate the overall P in each localiMean μi, standard deviation sigmai, and the overall data amount check n in locali;
3) according to formulaCalculate the overall average of input;
4) utilize formulaCalculate input total
The standard deviation of body.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611032295.6A CN106407161A (en) | 2016-11-22 | 2016-11-22 | Distributed calculating method of standard deviation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611032295.6A CN106407161A (en) | 2016-11-22 | 2016-11-22 | Distributed calculating method of standard deviation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106407161A true CN106407161A (en) | 2017-02-15 |
Family
ID=58082769
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611032295.6A Pending CN106407161A (en) | 2016-11-22 | 2016-11-22 | Distributed calculating method of standard deviation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106407161A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109100264A (en) * | 2018-10-22 | 2018-12-28 | 云南中烟工业有限责任公司 | A kind of method of quick predict ramuscule cigarette smoking uniformity |
CN109341544A (en) * | 2018-11-15 | 2019-02-15 | 上海航天精密机械研究所 | A kind of laser displacement sensor ranging numerical optimization |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103914870A (en) * | 2014-02-28 | 2014-07-09 | 天津工业大学 | High-universality automatic hologram reestablishing method based on new focus evaluation function |
CN104636318A (en) * | 2015-02-15 | 2015-05-20 | 杭州邦盛金融信息技术有限公司 | Distributed or increment calculation method of big data variance and standard deviation |
-
2016
- 2016-11-22 CN CN201611032295.6A patent/CN106407161A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103914870A (en) * | 2014-02-28 | 2014-07-09 | 天津工业大学 | High-universality automatic hologram reestablishing method based on new focus evaluation function |
CN104636318A (en) * | 2015-02-15 | 2015-05-20 | 杭州邦盛金融信息技术有限公司 | Distributed or increment calculation method of big data variance and standard deviation |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109100264A (en) * | 2018-10-22 | 2018-12-28 | 云南中烟工业有限责任公司 | A kind of method of quick predict ramuscule cigarette smoking uniformity |
CN109100264B (en) * | 2018-10-22 | 2020-11-17 | 云南中烟工业有限责任公司 | Method for rapidly predicting fine cigarette smoking uniformity |
CN109341544A (en) * | 2018-11-15 | 2019-02-15 | 上海航天精密机械研究所 | A kind of laser displacement sensor ranging numerical optimization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Laird | Missing data in longitudinal studies | |
Waugh | Inversion of the Leontief matrix by power series | |
Bresler et al. | Unsaturated flow in spatially variable fields: 2. Application of water flow models to various fields | |
Glass | A technique for fitting nonlinear models to biological data | |
Hunter | The computation of key properties of Markov chains via perturbations | |
CN106407161A (en) | Distributed calculating method of standard deviation | |
Sun et al. | Optimal portfolio strategy with cross-correlation matrix composed by DCCA coefficients: Evidence from the Chinese stock market | |
Pham-Gia | Exact distribution of the generalized Wilks’s statistic and applications | |
Hult et al. | On importance sampling with mixtures for random walks with heavy tails | |
CN111124489A (en) | Software function point estimation method based on BP neural network | |
Man et al. | Aggregation effect and forecasting temporal aggregates of long memory processes | |
Feng et al. | Geometric Brownian motion with affine drift and its time-integral | |
Wang | Dimension reduction in partly linear error-in-response models with validation data | |
Davidov et al. | Improving an estimator of Hsieh and Turnbull for the binormal ROC curve | |
Liou | More on the computation of higher-order derivatives of the elementary symmetric functions in the Rasch model | |
CN110019161A (en) | Abnormal data cleaning method based on information entropy theory | |
Ducey et al. | Accounting for bias and uncertainty in nonlinear stand density indices | |
Bapat et al. | On an inflated Unit-Lindley distribution | |
CN111914475A (en) | Bayesian inverse simulation method for accelerating depicting Gaussian hydrogeological parameter field | |
Lee et al. | Optimal weighting systems for direct age‐adjustment of vital rates | |
CN110659768B (en) | Academic influence evaluation and prediction method for data publications | |
Gai et al. | Statistical inference on partial linear additive models with distortion measurement errors | |
Lauder | Direct kernel assessment of diagnostic probabilities | |
Schucany et al. | Jackknifing R-estimators | |
CN115291528B (en) | Model uncertainty grade determination method, device and system and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170215 |