The distribution or incremental calculation method of a kind of big data variance criterion difference
Technical field
The present invention relates to the super large in real-time analysis fields of big data such as financial air control in real time, real-time reference, real-time marketing
The computational methods of data set (internal memory can not be preserved or processing speed is excessively slow) variance criterion difference.
Background technology
In financial field or internet arena, the subdivision field such as real-time air control, real-time reference, real-time marketing, which often has, to be needed
Will control risk, judge according to turnover fluctuation situation etc. the application scenarios, these application scenarios one such as credit line of client
As be required for being related to related data dimension variance, the computational problem of standard deviation.These demands are tackled, it is traditional based on database
Sql technical scheme problem when tackling small data quantity be not it is very big, typically can be by filtering raw data associated, Ran Houjin
Cpu of the row based on internal memory, which is calculated, can obtain the poor result of variance criterion.It is big when needing to carry out certain operation code, certain channel of disbursement etc.
During the fluctuation computational problem of data dimension, such scheme due to machine internal memory limitation and excessively huge data volume the problems such as meeting
Cause to calculate excessively slow or even unavailable.
Variance be each data respectively and itself and average difference square and average, represented with alphabetical D.
In probability theory and mathematical statistics, variance (Variance) is inclined between stochastic variable and its mathematic expectaion (i.e. average) for measuring
From degree.In many practical problems, the departure degree important in inhibiting between research stochastic variable and average.Its calculating side
Method is as follows:
Often claim mean square deviation in standard deviation (Standard Deviation), Chinese environment again, but different from mean square error
(mean squared error, mean square error is the average for the square distance that each data deviate actual value, namely square-error
The average of sum, computing formula is in form close to variance, and its evolution is root-mean-square error, root-mean-square error ability and standard deviation shape
Approached in formula), standard deviation is the root after sum of sguares of deviation from mean is averaged, and is represented with σ.Standard deviation is the arithmetic square root of variance.
Standard deviation can reflect the dispersion degree of a data set.Average identical, standard deviation is not necessarily the same.Its computational methods is as follows:
Variance and standard deviation are when internal memory is sufficiently large, or in the case that data volume is little, and its computational methods is
It is very simple.But when data acquisition system especially big (internal memory can not be deposited) simultaneously, must consider how initial data
Collection is split, and the variance or standard deviation of different subsets, the handling process finally merged are calculated respectively, and the present invention is exactly
So a kind of processing method.
The content of the invention
In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to provide a kind of big data variance criterion difference distribution or
Incremental calculation method.
The purpose of the present invention is achieved through the following technical solutions:A kind of distribution of big data variance criterion difference or
Incremental calculation method, comprises the following steps:
(1) collection of transaction journal:The transaction journal comes from multiple data sources, and the data source is sql, nosql, text
Part system etc.;Transaction journal is loaded into internal memory by the way of incremental crawler, m transaction journal data of collection, are obtained every time
To transaction journal collection be combined into Xi={ x1,x2,…,xm, wherein i represents the sequence number number of collection;
(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is collection
The average value of data in conjunction, D is the variance of data in set, then random subset XiCiFor m, SiForViFor Si/ m,
DiFor
(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained1、X2、…、Xi、…、Xj、…、XnIn the middle of knot
Really;Assuming that set XiIntermediate variable be C respectivelyi, Si, Vi, Di;Set XjIntermediate variable be C respectivelyj, Sj, Vj, Dj, according to
Following formula, by set XiAnd XjMerge and draw the set X after mergingijVariance Dij:
Wherein, VijRepresent set Xi+XjThe average value of middle data, CijRepresent set Xi+XjThe number of middle data;
According to DijObtain set XijStandard deviation
(4) step 3 is continued executing with until All Activity flowing water subset X1、X2、…、XnCalculating is fully completed, big data is obtained
Set X={ X1、X2、…、XnPopulation variance D standard deviation sigmas corresponding with its;
(5) detecting of fluctuation is traded based on standard deviation sigma:Established standardses difference limen value, fluctuation is then thought beyond the threshold value
Compare acutely, now introduce air control control strategy and carry out risk control.
The beneficial effects of the invention are as follows:
1st, computational methods of the present invention the variance of big collection is calculated the variance that is split as calculating subset and, count and flat
Average, and the above-mentioned variable joint account calculated by subset goes out the final variance and standard deviation of big collection.
2nd, the variance and standard deviation of super large data acquisition system (can not store with internal memory) can be handled.
3rd, for super large data acquisition system, big collection can be split as by some subsets by this method, and be published to difference
Machine on calculate the above-mentioned variable of subset, the joint account function of all subsets is finally completed by one of machine;So as to
The purpose of Distributed Calculation can be reached, shortens the variance criterion poor calculating time of super large data acquisition system.
4th, the present invention is more suitable for the data system of magnanimity, in mass data system, and many traditional methods can not be complete
Variance criterion difference into big data is calculated.
Embodiment
The distribution or incremental calculation method of a kind of big data variance criterion difference of the present invention, comprise the following steps:
(1) collection of transaction journal:The transaction journal comes from multiple data sources, and the data source is sql, nosql, text
Part system etc.;Transaction journal is loaded into internal memory by the way of incremental crawler, m transaction journal data of collection, are obtained every time
To transaction journal collection be combined into Xi={ x1,x2,…,xm, wherein i represents the sequence number number of collection;Incremental crawler is generally basede on increment
Field (if major key, creation time and final updating time etc.) or based on modes such as sql verniers, file verniers.
(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is collection
The average value of data in conjunction, D is the variance of data in set, then random subset XiCiFor m, SiForViFor Si/ m,
DiFor
(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained1、X2、…、Xi、…、Xj、…、XnIn the middle of knot
Really;Assuming that set XiIntermediate variable be C respectivelyi, Si, Vi, Di;Set XjIntermediate variable be C respectivelyj, Sj, Vj, Dj, according to
Following formula, by set XiAnd XjMerge and draw the set X after mergingijVariance Dij:
Wherein, VijRepresent set Xi+XjThe average value of middle data, CijRepresent set Xi+XjThe number of middle data;
According to DijObtain set XijStandard deviation
(4) step 3 is continued executing with until All Activity flowing water subset X1、X2、…、XnCalculating is fully completed, big data is obtained
Set X={ X1、X2、…、XnPopulation variance D standard deviation sigmas corresponding with its;
(5) detecting of fluctuation is traded based on standard deviation sigma:Established standardses difference limen value, fluctuation is then thought beyond the threshold value
Compare acutely, now introduce air control control strategy and carry out risk control.In addition, in marketing domain, certain in certain time can be based on
The information such as user's class commodity transaction fluctuation more steady (standard deviation is less than certain threshold value), judge in the user section time for
The commodity class is interested.