CN104636318A - Distributed or increment calculation method of big data variance and standard deviation - Google Patents

Distributed or increment calculation method of big data variance and standard deviation Download PDF

Info

Publication number
CN104636318A
CN104636318A CN201510083970.7A CN201510083970A CN104636318A CN 104636318 A CN104636318 A CN 104636318A CN 201510083970 A CN201510083970 A CN 201510083970A CN 104636318 A CN104636318 A CN 104636318A
Authority
CN
China
Prior art keywords
data
variance
standard deviation
transaction journal
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510083970.7A
Other languages
Chinese (zh)
Other versions
CN104636318B (en
Inventor
王新根
黄滔
胡时豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Bangsheng Technology Co.,Ltd.
Original Assignee
Hangzhou Bangsun Financial Information Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Bangsun Financial Information Technology Ltd filed Critical Hangzhou Bangsun Financial Information Technology Ltd
Priority to CN201510083970.7A priority Critical patent/CN104636318B/en
Publication of CN104636318A publication Critical patent/CN104636318A/en
Application granted granted Critical
Publication of CN104636318B publication Critical patent/CN104636318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a distributed or increment calculation method of big data variance and standard deviation. The method includes: the variance calculation of a large set is separated into the calculation of the variance, sum, count and mean value of subsets, and the final variance and standard deviation of the large set are calculated by combining the variables obtained through subset calculation. The method has the advantages that the variance and standard deviation of oversized data sets (which cannot be stored in an internal memory); the oversized data sets can be separated into a plurality of subsets and distributed to different machines so as to calculate the variables of the subsets, the combination calculation of all subsets is completed by one machine, distributed calculation is achieved, and the variance and standard deviation calculation time of the oversized data set is shortened; the method is applicable to massive data systems whose big data variance and standard deviation calculation cannot be completed by most traditional methods.

Description

A kind of distributed or incremental calculation method of large data variance standard deviation
Technical field
The present invention relates to the computing method of super large data set (internal memory cannot preserve or processing speed excessively slow) the variance criterion difference in large data real-time analysis fields such as the controls of finance real-time wind, in real time reference, in real time marketing.
Background technology
In financial field or internet arena, the segmentation fields such as real-time wind control, in real time reference, marketing in real time often have needs to control risk according to turnover fluctuation situation etc., judge the application scenarios such as credit line of client, and these application scenarioss generally all need the computational problem relating to related data dimension variance, standard deviation.Tackle these demands, traditional technical scheme based on database sql problem when tackling small data quantity is not very large, generally can, by filtering raw data associated, then carry out calculating based on the cpu of internal memory obtaining variance criterion difference result.When needing to carry out the fluctuation computational problem of the large data dimension such as certain operation code, certain channel of disbursement, such scheme is due to the restriction of machine internal memory and data volume is too huge etc. that problem can cause calculating too slowly even can not use.
Variance be each data respectively with itself and average difference square and average, represent with alphabetical D.In theory of probability and mathematical statistics, variance (Variance) is used for measuring the departure degree between its mathematical expectation of random sum (i.e. average).In many practical problemss, the departure degree important in inhibiting between research random sum average.Its computing method are as follows:
( Σ i = 1 n ( s i - s ‾ ) 2 n )
Standard deviation (Standard Deviation), often mean square deviation is claimed again in Chinese environment, but be different from square error (mean squared error, square error is the average that each data depart from the square distance of actual value, and be also the average of error sum of squares, computing formula is in form close to variance, its evolution is root-mean-square error, root-mean-square error just and standard deviation in form close), standard deviation be sum of sguares of deviation from mean on average after root, represent with σ.Standard deviation is the arithmetic square root of variance.Standard deviation can reflect the dispersion degree of a data set.Average is identical, and standard deviation may not be identical.Its computing method are as follows:
σ ( r ) = 1 N Σ i = 1 N ( x i - r ) 2
Variance and standard deviation are when internal memory is enough large, or when data volume is little, its computing method are very simple.But when data acquisition large especially (internal memory cannot be deposited) simultaneously, must consider how to be split by raw data set, calculate variance or the standard deviation of different subset respectively, finally carry out the treatment scheme merged, the present invention is exactly so a kind of disposal route.
Summary of the invention
The object of the invention is to for the deficiencies in the prior art, a kind of distributed or incremental calculation method of large data variance standard deviation is provided.
The object of the invention is to be achieved through the following technical solutions: a kind of distributed or incremental calculation method of large data variance standard deviation, comprises the steps:
(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X i={ x 1, x 2..., x m, wherein i represents the sequence number number of collection;
(2) define intermediate variable C, S, V, D, wherein C is data amount check in set, and S is data sum in set, and V is the mean value of data in set, and D is the variance of data in set, then random subset X ic ifor m, S ifor v ifor S i/ m, D ifor 1 m Σ k = 1 m ( x k - V i ) 2 ;
(3) constantly perform step 1 and 2, obtain n transaction journal subset X 1, X 2..., X i..., X j..., X nintermediate result; Assumption set X iintermediate variable be C respectively i, S i, V i, D i; Set X jintermediate variable be C respectively j, S j, V j, D j, according to following formula, will X be gathered iand X jcarry out merging the set X after drawing merging ijvariance D ij:
D ij = C i D i + ( V i + V ij ) [ 2 S i + C i ( V i - V ij ) ] + C j D j + ( V j - V ij ) [ 2 S j + C j ( V j - V ij ) ] C ij
Wherein, V ijrepresent set X i+ X jthe mean value of middle data, C ijrepresent set X i+ X jthe number of middle data;
According to D ijobtain gathering X ijstandard deviation
(4) continue to perform step 3 until All Activity flowing water subset X 1, X 2..., X nall complete calculating, obtain large data sets and close X={X 1, X 2..., X npopulation variance D and the standard deviation sigma of its correspondence;
(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.
The invention has the beneficial effects as follows:
1, computing method of the present invention the variance of big collection is calculated the variance that is split as subset of computations and, counting and mean value, and gone out final variance and the standard deviation of big collection by the above-mentioned variable joint account that subset calculates.
2, variance and the standard deviation of super large data acquisition (cannot store with internal memory) can be processed.
3, for super large data acquisition, by the method, big collection can be split as some subsets, and be published to the above-mentioned variable of subset of computations on different machines, finally be completed the joint account function of all subsets by one of them machine; Thus the object of Distributed Calculation can be reached, shorten the variance criterion difference computing time of super large data acquisition.
4, the present invention is more suitable for the data system of magnanimity, and in mass data system, the variance criterion difference that much traditional method cannot complete large data calculates.
Embodiment
Distributed or the incremental calculation method of a kind of large data variance standard deviation of the present invention, comprises the steps:
(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X i={ x 1, x 2..., x m, wherein i represents the sequence number number of collection; Incremental crawler is generally based on delta field (if major key, creation-time and final updating time etc.) or based on modes such as sql vernier, file verniers.
(2) define intermediate variable C, S, V, D, wherein C is data amount check in set, and S is data sum in set, and V is the mean value of data in set, and D is the variance of data in set, then random subset X ic ifor m, S ifor v ifor S i/ m, D ifor 1 m Σ k = 1 m ( x k - V i ) 2 ;
(3) constantly perform step 1 and 2, obtain n transaction journal subset X 1, X 2..., X i..., X j..., X nintermediate result; Assumption set X iintermediate variable be C respectively i, S i, V i, D i; Set X jintermediate variable be C respectively j, S j, V j, D j, according to following formula, will X be gathered iand X jcarry out merging the set X after drawing merging ijvariance D ij:
D ij = C i D i + ( V i + V ij ) [ 2 S i + C i ( V i - V ij ) ] + C j D j + ( V j - V ij ) [ 2 S j + C j ( V j - V ij ) ] C ij
Wherein, V ijrepresent set X i+ X jthe mean value of middle data, C ijrepresent set X i+ X jthe number of middle data;
According to D ijobtain gathering X ijstandard deviation
(4) continue to perform step 3 until All Activity flowing water subset X 1, X 2..., X nall complete calculating, obtain large data sets and close X={X 1, X 2..., X npopulation variance D and the standard deviation sigma of its correspondence;
(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.In addition, at marketing domain, can based on the comparatively steadily information such as (standard deviation is lower than certain threshold value) of certain user's class commodity transaction fluctuation in certain period, to judge in this user period for this commodity class it is interested.

Claims (1)

1. a distributed or incremental calculation method for large data variance standard deviation, is characterized in that, comprise the steps:
(1) collection of transaction journal: described transaction journal is from multiple data sources, and described data source is sql, nosql, file system etc.; Adopt the mode of incremental crawler to be loaded in internal memory by transaction journal, each collection m transaction journal data, the transaction journal set obtained is X i={ x 1, x 2..., x m, wherein i represents the sequence number number of collection;
(2) define intermediate variable C, S, V, D, wherein C is data amount check in set, and S is data sum in set, and V is the mean value of data in set, and D is the variance of data in set, then random subset X ic ifor m, S ifor v ifor S i/ m, D ifor 1 m Σ k = 1 m ( x k - V i ) 2 ;
(3) constantly perform step 1 and 2, obtain n transaction journal subset X 1, X 2..., X i..., X j..., X nintermediate result; Assumption set X iintermediate variable be C respectively i, S i, V i, D i; Set X jintermediate variable be C respectively j, S j, V j, D j, according to following formula, will X be gathered iand X jcarry out merging the set X after drawing merging ijvariance D ij:
D ij = C i D i + ( V i + V ij ) [ 2 S i + C i ( V i - V ij ) ] + C j D j + ( V j - V ij ) [ 2 S j + C j ( V j - V ij ) ] C ij
Wherein, V ijrepresent set X i+ X jthe mean value of middle data, C ijrepresent set X i+ X jthe number of middle data;
According to D ijobtain gathering X ijstandard deviation
(4) continue to perform step 3 until All Activity flowing water subset X 1, X 2..., X nall complete calculating, obtain large data sets and close X={X 1, X 2..., X npopulation variance D and the standard deviation sigma of its correspondence;
(5) carry out based on standard deviation sigma the detecting of fluctuating of concluding the business: established standards difference limen value, exceed this threshold value and then think that fluctuation ratio is comparatively violent, now introduce wind control control strategy and carry out risk control.
CN201510083970.7A 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference Active CN104636318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510083970.7A CN104636318B (en) 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510083970.7A CN104636318B (en) 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference

Publications (2)

Publication Number Publication Date
CN104636318A true CN104636318A (en) 2015-05-20
CN104636318B CN104636318B (en) 2017-07-14

Family

ID=53215091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510083970.7A Active CN104636318B (en) 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference

Country Status (1)

Country Link
CN (1) CN104636318B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407161A (en) * 2016-11-22 2017-02-15 重庆邮电大学 Distributed calculating method of standard deviation
CN107040608A (en) * 2017-05-19 2017-08-11 宁波绮耘软件股份有限公司 A kind of data processing method and system
WO2018058633A1 (en) * 2016-09-30 2018-04-05 深圳市华傲数据技术有限公司 Data processing method and apparatus based on increment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11224366A (en) * 1998-02-05 1999-08-17 Toshiba Corp Operation supporting system for automatic transaction device
US20130290162A1 (en) * 2009-10-14 2013-10-31 Chicago Mercantile Exchange Inc. Leg Pricer
US20130317963A1 (en) * 2012-05-22 2013-11-28 Applied Academics Llc Methods and systems for creating a government bond volatility index and trading derivative products thereon
CN103577681A (en) * 2013-06-26 2014-02-12 长沙理工大学 Factor analysis-based quantitative evaluation method on of boiler efficiency influence indexes
CN104123668A (en) * 2014-03-30 2014-10-29 广州天策软件科技有限公司 Standard quantization parameter based mass data dynamic screening method and application thereof in financial security field

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11224366A (en) * 1998-02-05 1999-08-17 Toshiba Corp Operation supporting system for automatic transaction device
US20130290162A1 (en) * 2009-10-14 2013-10-31 Chicago Mercantile Exchange Inc. Leg Pricer
US20130317963A1 (en) * 2012-05-22 2013-11-28 Applied Academics Llc Methods and systems for creating a government bond volatility index and trading derivative products thereon
CN103577681A (en) * 2013-06-26 2014-02-12 长沙理工大学 Factor analysis-based quantitative evaluation method on of boiler efficiency influence indexes
CN104123668A (en) * 2014-03-30 2014-10-29 广州天策软件科技有限公司 Standard quantization parameter based mass data dynamic screening method and application thereof in financial security field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宋明顺,方兴华,黄佳,张俊亮: "校准和检测中微小样本测量不确定度评定方法研究", 《仪器仪表学报》 *
黄志剑,王积福,向伟: "表象训练对技能学习绩效影响的元分析", 《体育科学》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018058633A1 (en) * 2016-09-30 2018-04-05 深圳市华傲数据技术有限公司 Data processing method and apparatus based on increment
CN110050268A (en) * 2016-09-30 2019-07-23 深圳市华傲数据技术有限公司 Data processing method and device based on increment
CN106407161A (en) * 2016-11-22 2017-02-15 重庆邮电大学 Distributed calculating method of standard deviation
CN107040608A (en) * 2017-05-19 2017-08-11 宁波绮耘软件股份有限公司 A kind of data processing method and system

Also Published As

Publication number Publication date
CN104636318B (en) 2017-07-14

Similar Documents

Publication Publication Date Title
Ball et al. Convergence of productivity: an analysis of the catch-up hypothesis within a panel of states
CN108763277B (en) Data analysis method, computer readable storage medium and terminal device
US7925560B2 (en) Systems and methods for valuing a derivative involving a multiplicative index
Li et al. Research and application of random forest model in mining automobile insurance fraud
CN112215398A (en) Power consumer load prediction model establishing method, device, equipment and storage medium
CN104636318A (en) Distributed or increment calculation method of big data variance and standard deviation
Mbaye Determinants of domestic private investments in Kenya
Nurlybayeva et al. Algorithmic scoring models
CN112884363A (en) Nuclear power project economic evaluation probability risk analysis method
CN116843483A (en) Vehicle insurance claim settlement method, device, computer equipment and storage medium
Omar et al. Forecasting Inflation in Egypt (2019-2022) by using AutoRegressive Integrated Moving Average (ARIMA) Models
Levy et al. A note on portfolio selection and investors' wealth
Elbahlawan et al. The economic impact of immigration on host countries: the case of saudi arabia
Xiao et al. Parameter identification for drift fractional brownian motions with application to the chinese stock markets
CN114493035A (en) Enterprise default probability prediction method and device
Viedienieiev et al. Forecasting the selling price of the agricultural products in ukraine using deep learning algorithms
Dong et al. Network evolution analysis of nickel futures and the spot price linkage effect based on a distributed lag model
Britto et al. Optimal investment in energy efficiency as a problem of growth rate maximisation
Bender et al. Entropy-Regularized Mean-Variance Portfolio Optimization with Jumps
Ramasubramanian et al. Sampling and resampling techniques
CN101853474A (en) Value-at-risk metering method in maritime industry
CN117540212A (en) Initialization parameter determining method and device and electronic equipment
CN116051289A (en) Method and system for researching relevance of internal yield of fan
Loi et al. Network interlinkages between artificial intelligence and green energy dynamics during the War in a Pandemic: An application of Bayesian vector heterogeneous autoregressions
Song et al. The Measurement of Operational Risk of China's Commercial Banks under the Background of Internet of Things Technology

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 310012 Zhejiang, Xihu District, World Trade Center, the city of Victoria and the center of the C District, room 607, room 609,

Applicant after: Zhejiang Bang Sheng Technology Co., Ltd.

Address before: 310012 Zhejiang, Xihu District, World Trade Center, the city of Victoria and the center of the C District, room 607, room 609,

Applicant before: HANGZHOU BANGSUN FINANCIAL INFORMATION TECHNOLOGY LTD.

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310012 rooms 607 and 609, Zone C, European and American Center, world trade Regent City, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Zhejiang Bangsheng Technology Co.,Ltd.

Address before: 310012 rooms 607 and 609, Zone C, European and American Center, world trade Regent City, Xihu District, Hangzhou City, Zhejiang Province

Patentee before: ZHEJIANG BANGSUN TECHNOLOGY Co.,Ltd.