A kind of distributed or incremental calculation method of large data variance standard deviation
Technical field
The present invention relates to the computing method of super large data set (internal memory cannot preserve or processing speed excessively slow) the variance criterion difference in large data real-time analysis fields such as the controls of finance real-time wind, in real time reference, in real time marketing.
Background technology
In financial field or internet arena, the segmentation fields such as real-time wind control, in real time reference, marketing in real time often have needs to control risk according to turnover fluctuation situation etc., judge the application scenarios such as credit line of client, and these application scenarioss generally all need the computational problem relating to related data dimension variance, standard deviation.Tackle these demands, traditional technical scheme based on database sql problem when tackling small data quantity is not very large, generally can, by filtering raw data associated, then carry out calculating based on the cpu of internal memory obtaining variance criterion difference result.When needing to carry out the fluctuation computational problem of the large data dimension such as certain operation code, certain channel of disbursement, such scheme is due to the restriction of machine internal memory and data volume is too huge etc. that problem can cause calculating too slowly even can not use.
Variance be each data respectively with itself and average difference square and average, represent with alphabetical D.In theory of probability and mathematical statistics, variance (Variance) is used for measuring the departure degree between its mathematical expectation of random sum (i.e. average).In many practical problemss, the departure degree important in inhibiting between research random sum average.Its computing method are as follows:
Standard deviation (Standard Deviation), often mean square deviation is claimed again in Chinese environment, but be different from square error (mean squared error, square error is the average that each data depart from the square distance of actual value, and be also the average of error sum of squares, computing formula is in form close to variance, its evolution is root-mean-square error, root-mean-square error just and standard deviation in form close), standard deviation be sum of sguares of deviation from mean on average after root, represent with σ.Standard deviation is the arithmetic square root of variance.Standard deviation can reflect the dispersion degree of a data set.Average is identical, and standard deviation may not be identical.Its computing method are as follows:
Variance and standard deviation are when internal memory is enough large, or when data volume is little, its computing method are very simple.But when data acquisition large especially (internal memory cannot be deposited) simultaneously, must consider how to be split by raw data set, calculate variance or the standard deviation of different subset respectively, finally carry out the treatment scheme merged, the present invention is exactly so a kind of disposal route.
Summary of the invention
The object of the invention is to for the deficiencies in the prior art, a kind of distributed or incremental calculation method of large data variance standard deviation is provided.
The object of the invention is to be achieved through the following technical solutions: a kind of distributed or incremental calculation method of large data variance standard deviation, comprises the steps:
(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X
i={ x
1, x
2..., x
m, wherein i represents the sequence number number of collection;
(2) define intermediate variable C, S, V, D, wherein C is data amount check in set, and S is data sum in set, and V is the mean value of data in set, and D is the variance of data in set, then random subset X
ic
ifor m, S
ifor
v
ifor S
i/ m, D
ifor
(3) constantly perform step 1 and 2, obtain n transaction journal subset X
1, X
2..., X
i..., X
j..., X
nintermediate result; Assumption set X
iintermediate variable be C respectively
i, S
i, V
i, D
i; Set X
jintermediate variable be C respectively
j, S
j, V
j, D
j, according to following formula, will X be gathered
iand X
jcarry out merging the set X after drawing merging
ijvariance D
ij:
Wherein, V
ijrepresent set X
i+ X
jthe mean value of middle data, C
ijrepresent set X
i+ X
jthe number of middle data;
According to D
ijobtain gathering X
ijstandard deviation
(4) continue to perform step 3 until All Activity flowing water subset X
1, X
2..., X
nall complete calculating, obtain large data sets and close X={X
1, X
2..., X
npopulation variance D and the standard deviation sigma of its correspondence;
(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.
The invention has the beneficial effects as follows:
1, computing method of the present invention the variance of big collection is calculated the variance that is split as subset of computations and, counting and mean value, and gone out final variance and the standard deviation of big collection by the above-mentioned variable joint account that subset calculates.
2, variance and the standard deviation of super large data acquisition (cannot store with internal memory) can be processed.
3, for super large data acquisition, by the method, big collection can be split as some subsets, and be published to the above-mentioned variable of subset of computations on different machines, finally be completed the joint account function of all subsets by one of them machine; Thus the object of Distributed Calculation can be reached, shorten the variance criterion difference computing time of super large data acquisition.
4, the present invention is more suitable for the data system of magnanimity, and in mass data system, the variance criterion difference that much traditional method cannot complete large data calculates.
Embodiment
Distributed or the incremental calculation method of a kind of large data variance standard deviation of the present invention, comprises the steps:
(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X
i={ x
1, x
2..., x
m, wherein i represents the sequence number number of collection; Incremental crawler is generally based on delta field (if major key, creation-time and final updating time etc.) or based on modes such as sql vernier, file verniers.
(2) define intermediate variable C, S, V, D, wherein C is data amount check in set, and S is data sum in set, and V is the mean value of data in set, and D is the variance of data in set, then random subset X
ic
ifor m, S
ifor
v
ifor S
i/ m, D
ifor
(3) constantly perform step 1 and 2, obtain n transaction journal subset X
1, X
2..., X
i..., X
j..., X
nintermediate result; Assumption set X
iintermediate variable be C respectively
i, S
i, V
i, D
i; Set X
jintermediate variable be C respectively
j, S
j, V
j, D
j, according to following formula, will X be gathered
iand X
jcarry out merging the set X after drawing merging
ijvariance D
ij:
Wherein, V
ijrepresent set X
i+ X
jthe mean value of middle data, C
ijrepresent set X
i+ X
jthe number of middle data;
According to D
ijobtain gathering X
ijstandard deviation
(4) continue to perform step 3 until All Activity flowing water subset X
1, X
2..., X
nall complete calculating, obtain large data sets and close X={X
1, X
2..., X
npopulation variance D and the standard deviation sigma of its correspondence;
(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.In addition, at marketing domain, can based on the comparatively steadily information such as (standard deviation is lower than certain threshold value) of certain user's class commodity transaction fluctuation in certain period, to judge in this user period for this commodity class it is interested.