CN104636318A

CN104636318A - Distributed or increment calculation method of big data variance and standard deviation

Info

Publication number: CN104636318A
Application number: CN201510083970.7A
Authority: CN
Inventors: 王新根; 黄滔; 胡时豪
Original assignee: Hangzhou Bangsun Financial Information Technology Ltd
Current assignee: Zhejiang Bangsheng Technology Co.,Ltd.
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2015-05-20
Anticipated expiration: 2035-02-15
Also published as: CN104636318B

Abstract

The invention discloses a distributed or increment calculation method of big data variance and standard deviation. The method includes: the variance calculation of a large set is separated into the calculation of the variance, sum, count and mean value of subsets, and the final variance and standard deviation of the large set are calculated by combining the variables obtained through subset calculation. The method has the advantages that the variance and standard deviation of oversized data sets (which cannot be stored in an internal memory); the oversized data sets can be separated into a plurality of subsets and distributed to different machines so as to calculate the variables of the subsets, the combination calculation of all subsets is completed by one machine, distributed calculation is achieved, and the variance and standard deviation calculation time of the oversized data set is shortened; the method is applicable to massive data systems whose big data variance and standard deviation calculation cannot be completed by most traditional methods.

Description

A kind of distributed or incremental calculation method of large data variance standard deviation

Technical field

The present invention relates to the computing method of super large data set (internal memory cannot preserve or processing speed excessively slow) the variance criterion difference in large data real-time analysis fields such as the controls of finance real-time wind, in real time reference, in real time marketing.

Background technology

In financial field or internet arena, the segmentation fields such as real-time wind control, in real time reference, marketing in real time often have needs to control risk according to turnover fluctuation situation etc., judge the application scenarios such as credit line of client, and these application scenarioss generally all need the computational problem relating to related data dimension variance, standard deviation.Tackle these demands, traditional technical scheme based on database sql problem when tackling small data quantity is not very large, generally can, by filtering raw data associated, then carry out calculating based on the cpu of internal memory obtaining variance criterion difference result.When needing to carry out the fluctuation computational problem of the large data dimension such as certain operation code, certain channel of disbursement, such scheme is due to the restriction of machine internal memory and data volume is too huge etc. that problem can cause calculating too slowly even can not use.

Variance be each data respectively with itself and average difference square and average, represent with alphabetical D.In theory of probability and mathematical statistics, variance (Variance) is used for measuring the departure degree between its mathematical expectation of random sum (i.e. average).In many practical problemss, the departure degree important in inhibiting between research random sum average.Its computing method are as follows:

(\frac{Σ_{i = 1}^{n} {(s_{i} - \overset{&OverBar;}{s})}^{2}}{n})

Standard deviation (Standard Deviation), often mean square deviation is claimed again in Chinese environment, but be different from square error (mean squared error, square error is the average that each data depart from the square distance of actual value, and be also the average of error sum of squares, computing formula is in form close to variance, its evolution is root-mean-square error, root-mean-square error just and standard deviation in form close), standard deviation be sum of sguares of deviation from mean on average after root, represent with σ.Standard deviation is the arithmetic square root of variance.Standard deviation can reflect the dispersion degree of a data set.Average is identical, and standard deviation may not be identical.Its computing method are as follows:

σ (r) = \sqrt{\frac{1}{N} Σ_{i = 1}^{N} {(x_{i} - r)}^{2}}

Variance and standard deviation are when internal memory is enough large, or when data volume is little, its computing method are very simple.But when data acquisition large especially (internal memory cannot be deposited) simultaneously, must consider how to be split by raw data set, calculate variance or the standard deviation of different subset respectively, finally carry out the treatment scheme merged, the present invention is exactly so a kind of disposal route.

Summary of the invention

The object of the invention is to for the deficiencies in the prior art, a kind of distributed or incremental calculation method of large data variance standard deviation is provided.

The object of the invention is to be achieved through the following technical solutions: a kind of distributed or incremental calculation method of large data variance standard deviation, comprises the steps:

(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X _i={ x ₁, x ₂..., x _m, wherein i represents the sequence number number of collection;

(2) define intermediate variable C, S, V, D, wherein C is data amount check in set, and S is data sum in set, and V is the mean value of data in set, and D is the variance of data in set, then random subset X _ic _ifor m, S _ifor v _ifor S _i/ m, D _ifor

\frac{1}{m} Σ_{k = 1}^{m} {(x_{k} - V_{i})}^{2};

(3) constantly perform step 1 and 2, obtain n transaction journal subset X ₁, X ₂..., X _i..., X _j..., X _nintermediate result; Assumption set X _iintermediate variable be C respectively _i, S _i, V _i, D _i; Set X _jintermediate variable be C respectively _j, S _j, V _j, D _j, according to following formula, will X be gathered _iand X _jcarry out merging the set X after drawing merging _ijvariance D _ij:

D_{ij} = \frac{C_{i} D_{i} + (V_{i} + V_{ij}) [2 S_{i} + C_{i} (V_{i} - V_{ij})] + C_{j} D_{j} + (V_{j} - V_{ij}) [2 S_{j} + C_{j} (V_{j} - V_{ij})]}{C_{ij}}

Wherein, V _ijrepresent set X _i+ X _jthe mean value of middle data, C _ijrepresent set X _i+ X _jthe number of middle data;

According to D _ijobtain gathering X _ijstandard deviation

(4) continue to perform step 3 until All Activity flowing water subset X ₁, X ₂..., X _nall complete calculating, obtain large data sets and close X={X ₁, X ₂..., X _npopulation variance D and the standard deviation sigma of its correspondence;

(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.

The invention has the beneficial effects as follows:

1, computing method of the present invention the variance of big collection is calculated the variance that is split as subset of computations and, counting and mean value, and gone out final variance and the standard deviation of big collection by the above-mentioned variable joint account that subset calculates.

2, variance and the standard deviation of super large data acquisition (cannot store with internal memory) can be processed.

3, for super large data acquisition, by the method, big collection can be split as some subsets, and be published to the above-mentioned variable of subset of computations on different machines, finally be completed the joint account function of all subsets by one of them machine; Thus the object of Distributed Calculation can be reached, shorten the variance criterion difference computing time of super large data acquisition.

4, the present invention is more suitable for the data system of magnanimity, and in mass data system, the variance criterion difference that much traditional method cannot complete large data calculates.

Embodiment

Distributed or the incremental calculation method of a kind of large data variance standard deviation of the present invention, comprises the steps:

(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X _i={ x ₁, x ₂..., x _m, wherein i represents the sequence number number of collection; Incremental crawler is generally based on delta field (if major key, creation-time and final updating time etc.) or based on modes such as sql vernier, file verniers.

\frac{1}{m} Σ_{k = 1}^{m} {(x_{k} - V_{i})}^{2};

D_{ij} = \frac{C_{i} D_{i} + (V_{i} + V_{ij}) [2 S_{i} + C_{i} (V_{i} - V_{ij})] + C_{j} D_{j} + (V_{j} - V_{ij}) [2 S_{j} + C_{j} (V_{j} - V_{ij})]}{C_{ij}}

According to D _ijobtain gathering X _ijstandard deviation

(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.In addition, at marketing domain, can based on the comparatively steadily information such as (standard deviation is lower than certain threshold value) of certain user's class commodity transaction fluctuation in certain period, to judge in this user period for this commodity class it is interested.

Claims

1. a distributed or incremental calculation method for large data variance standard deviation, is characterized in that, comprise the steps:

\frac{1}{m} Σ_{k = 1}^{m} {(x_{k} - V_{i})}^{2};

D_{ij} = \frac{C_{i} D_{i} + (V_{i} + V_{ij}) [2 S_{i} + C_{i} (V_{i} - V_{ij})] + C_{j} D_{j} + (V_{j} - V_{ij}) [2 S_{j} + C_{j} (V_{j} - V_{ij})]}{C_{ij}}

According to D _ijobtain gathering X _ijstandard deviation