CN104636318B

CN104636318B - The distribution or incremental calculation method of a kind of big data variance criterion difference

Info

Publication number: CN104636318B
Application number: CN201510083970.7A
Authority: CN
Inventors: 王新根; 黄滔; 胡时豪
Original assignee: Zhejiang Bang Sheng Technology Co Ltd
Current assignee: Zhejiang Bangsheng Technology Co.,Ltd.
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2017-07-14
Anticipated expiration: 2035-02-15
Also published as: CN104636318A

Abstract

The invention discloses a kind of distribution of big data variance criterion difference or incremental calculation method, the variance of big collection is calculated the variance that is split as calculating subset and, count and average value, and the above-mentioned variable joint account calculated by subset goes out the final variance and standard deviation of big collection；The present invention can handle the variance and standard deviation of super large data acquisition system (can not store with internal memory)；For super large data acquisition system, big collection can be split as by some subsets by this method, and it is published to the above-mentioned variable that subset is calculated on different machines, the last joint account function that all subsets are completed by one of machine, so as to reach the purpose of Distributed Calculation, shorten the variance criterion poor calculating time of super large data acquisition system；The present invention is more suitable for the data system of magnanimity, in mass data system, and the variance criterion difference that many traditional methods can not complete big data is calculated.

Description

The distribution or incremental calculation method of a kind of big data variance criterion difference

Technical field

The present invention relates to the super large in real-time analysis fields of big data such as financial air control in real time, real-time reference, real-time marketing The computational methods of data set (internal memory can not be preserved or processing speed is excessively slow) variance criterion difference.

Background technology

In financial field or internet arena, the subdivision field such as real-time air control, real-time reference, real-time marketing, which often has, to be needed Will control risk, judge according to turnover fluctuation situation etc. the application scenarios, these application scenarios one such as credit line of client As be required for being related to related data dimension variance, the computational problem of standard deviation.These demands are tackled, it is traditional based on database Sql technical scheme problem when tackling small data quantity be not it is very big, typically can be by filtering raw data associated, Ran Houjin Cpu of the row based on internal memory, which is calculated, can obtain the poor result of variance criterion.It is big when needing to carry out certain operation code, certain channel of disbursement etc. During the fluctuation computational problem of data dimension, such scheme due to machine internal memory limitation and excessively huge data volume the problems such as meeting Cause to calculate excessively slow or even unavailable.

Variance be each data respectively and itself and average difference square and average, represented with alphabetical D. In probability theory and mathematical statistics, variance (Variance) is inclined between stochastic variable and its mathematic expectaion (i.e. average) for measuring From degree.In many practical problems, the departure degree important in inhibiting between research stochastic variable and average.Its calculating side Method is as follows：

Often claim mean square deviation in standard deviation (Standard Deviation), Chinese environment again, but different from mean square error (mean squared error, mean square error is the average for the square distance that each data deviate actual value, namely square-error The average of sum, computing formula is in form close to variance, and its evolution is root-mean-square error, root-mean-square error ability and standard deviation shape Approached in formula), standard deviation is the root after sum of sguares of deviation from mean is averaged, and is represented with σ.Standard deviation is the arithmetic square root of variance. Standard deviation can reflect the dispersion degree of a data set.Average identical, standard deviation is not necessarily the same.Its computational methods is as follows：

Variance and standard deviation are when internal memory is sufficiently large, or in the case that data volume is little, and its computational methods is It is very simple.But when data acquisition system especially big (internal memory can not be deposited) simultaneously, must consider how initial data Collection is split, and the variance or standard deviation of different subsets, the handling process finally merged are calculated respectively, and the present invention is exactly So a kind of processing method.

The content of the invention

In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to provide a kind of big data variance criterion difference distribution or Incremental calculation method.

The purpose of the present invention is achieved through the following technical solutions：A kind of distribution of big data variance criterion difference or Incremental calculation method, comprises the following steps：

(1) collection of transaction journal：The transaction journal comes from multiple data sources, and the data source is sql, nosql, text Part system etc.；Transaction journal is loaded into internal memory by the way of incremental crawler, m transaction journal data of collection, are obtained every time To transaction journal collection be combined into X_i={ x₁,x₂,…,x_m, wherein i represents the sequence number number of collection；

(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is collection The average value of data in conjunction, D is the variance of data in set, then random subset X_iC_iFor m, S_iForV_iFor S_i/ m, D_iFor

(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained₁、X₂、…、X_i、…、X_j、…、X_nIn the middle of knot Really；Assuming that set X_iIntermediate variable be C respectively_i, S_i, V_i, D_i；Set X_jIntermediate variable be C respectively_j, S_j, V_j, D_j, according to Following formula, by set X_iAnd X_jMerge and draw the set X after merging_ijVariance D_ij：

Wherein, V_ijRepresent set X_i+X_jThe average value of middle data, C_ijRepresent set X_i+X_jThe number of middle data；

According to D_ijObtain set X_ijStandard deviation

(4) step 3 is continued executing with until All Activity flowing water subset X₁、X₂、…、X_nCalculating is fully completed, big data is obtained Set X={ X₁、X₂、…、X_nPopulation variance D standard deviation sigmas corresponding with its；

(5) detecting of fluctuation is traded based on standard deviation sigma：Established standardses difference limen value, fluctuation is then thought beyond the threshold value Compare acutely, now introduce air control control strategy and carry out risk control.

The beneficial effects of the invention are as follows：

1st, computational methods of the present invention the variance of big collection is calculated the variance that is split as calculating subset and, count and flat Average, and the above-mentioned variable joint account calculated by subset goes out the final variance and standard deviation of big collection.

2nd, the variance and standard deviation of super large data acquisition system (can not store with internal memory) can be handled.

3rd, for super large data acquisition system, big collection can be split as by some subsets by this method, and be published to difference Machine on calculate the above-mentioned variable of subset, the joint account function of all subsets is finally completed by one of machine；So as to The purpose of Distributed Calculation can be reached, shortens the variance criterion poor calculating time of super large data acquisition system.

4th, the present invention is more suitable for the data system of magnanimity, in mass data system, and many traditional methods can not be complete Variance criterion difference into big data is calculated.

Embodiment

The distribution or incremental calculation method of a kind of big data variance criterion difference of the present invention, comprise the following steps：

(1) collection of transaction journal：The transaction journal comes from multiple data sources, and the data source is sql, nosql, text Part system etc.；Transaction journal is loaded into internal memory by the way of incremental crawler, m transaction journal data of collection, are obtained every time To transaction journal collection be combined into X_i={ x₁,x₂,…,x_m, wherein i represents the sequence number number of collection；Incremental crawler is generally basede on increment Field (if major key, creation time and final updating time etc.) or based on modes such as sql verniers, file verniers.

According to D_ijObtain set X_ijStandard deviation

(5) detecting of fluctuation is traded based on standard deviation sigma：Established standardses difference limen value, fluctuation is then thought beyond the threshold value Compare acutely, now introduce air control control strategy and carry out risk control.In addition, in marketing domain, certain in certain time can be based on The information such as user's class commodity transaction fluctuation more steady (standard deviation is less than certain threshold value), judge in the user section time for The commodity class is interested.

Claims

1. the distribution or incremental calculation method of a kind of big data variance criterion difference, it is characterised in that comprise the following steps：

(1) collection of transaction journal：The transaction journal comes from multiple data sources, and the data source is sql, nosql, file system System；Transaction journal is loaded into internal memory by the way of incremental crawler, every time m transaction journal data of collection, obtained friendship Easy flowing water collection is combined into X_i={ x₁,x₂,…,x_m, wherein i represents the sequence number number of collection；

(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is in set The average value of data, D is the variance of data in set, then random subset X_iC_iFor m, S_iForV_iFor S_i/ m, D_iFor

(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained₁、X₂、…、X_i、…、X_j、…、X_nIntermediate result；It is false If set X_iIntermediate variable be C respectively_i, S_i, V_i, D_i；Set X_jIntermediate variable be C respectively_j, S_j, V_j, D_j, according to following public affairs Formula, by set X_iAnd X_jMerge and draw the set X after merging_ijVariance D_ij：

D_{i j} = \frac{C_{i} D_{i} + (V_{i} + V_{i j}) [2 S_{i} + C_{i} (V_{i} - V_{i j})] + C_{j} D_{j} + (V_{j} - V_{i j}) [2 S_{j} + C_{j} (V_{j} - V_{i j})]}{C_{i j}}

According to D_ijObtain set X_ijStandard deviation

(4) step 3 is continued executing with until All Activity flowing water subset X₁、X₂、…、X_nCalculating is fully completed, big data set is obtained X={ X₁、X₂、…、X_nPopulation variance D standard deviation sigmas corresponding with its；

(5) detecting of fluctuation is traded based on standard deviation sigma：Established standardses difference limen value, air control control is then introduced beyond the threshold value Strategy carries out risk control.