CN104636318B - The distribution or incremental calculation method of a kind of big data variance criterion difference - Google Patents

The distribution or incremental calculation method of a kind of big data variance criterion difference Download PDF

Info

Publication number
CN104636318B
CN104636318B CN201510083970.7A CN201510083970A CN104636318B CN 104636318 B CN104636318 B CN 104636318B CN 201510083970 A CN201510083970 A CN 201510083970A CN 104636318 B CN104636318 B CN 104636318B
Authority
CN
China
Prior art keywords
data
variance
standard deviation
collection
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510083970.7A
Other languages
Chinese (zh)
Other versions
CN104636318A (en
Inventor
王新根
黄滔
胡时豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Bangsheng Technology Co.,Ltd.
Original Assignee
Zhejiang Bang Sheng Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Bang Sheng Technology Co Ltd filed Critical Zhejiang Bang Sheng Technology Co Ltd
Priority to CN201510083970.7A priority Critical patent/CN104636318B/en
Publication of CN104636318A publication Critical patent/CN104636318A/en
Application granted granted Critical
Publication of CN104636318B publication Critical patent/CN104636318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of distribution of big data variance criterion difference or incremental calculation method, the variance of big collection is calculated the variance that is split as calculating subset and, count and average value, and the above-mentioned variable joint account calculated by subset goes out the final variance and standard deviation of big collection;The present invention can handle the variance and standard deviation of super large data acquisition system (can not store with internal memory);For super large data acquisition system, big collection can be split as by some subsets by this method, and it is published to the above-mentioned variable that subset is calculated on different machines, the last joint account function that all subsets are completed by one of machine, so as to reach the purpose of Distributed Calculation, shorten the variance criterion poor calculating time of super large data acquisition system;The present invention is more suitable for the data system of magnanimity, in mass data system, and the variance criterion difference that many traditional methods can not complete big data is calculated.

Description

The distribution or incremental calculation method of a kind of big data variance criterion difference
Technical field
The present invention relates to the super large in real-time analysis fields of big data such as financial air control in real time, real-time reference, real-time marketing The computational methods of data set (internal memory can not be preserved or processing speed is excessively slow) variance criterion difference.
Background technology
In financial field or internet arena, the subdivision field such as real-time air control, real-time reference, real-time marketing, which often has, to be needed Will control risk, judge according to turnover fluctuation situation etc. the application scenarios, these application scenarios one such as credit line of client As be required for being related to related data dimension variance, the computational problem of standard deviation.These demands are tackled, it is traditional based on database Sql technical scheme problem when tackling small data quantity be not it is very big, typically can be by filtering raw data associated, Ran Houjin Cpu of the row based on internal memory, which is calculated, can obtain the poor result of variance criterion.It is big when needing to carry out certain operation code, certain channel of disbursement etc. During the fluctuation computational problem of data dimension, such scheme due to machine internal memory limitation and excessively huge data volume the problems such as meeting Cause to calculate excessively slow or even unavailable.
Variance be each data respectively and itself and average difference square and average, represented with alphabetical D. In probability theory and mathematical statistics, variance (Variance) is inclined between stochastic variable and its mathematic expectaion (i.e. average) for measuring From degree.In many practical problems, the departure degree important in inhibiting between research stochastic variable and average.Its calculating side Method is as follows:
Often claim mean square deviation in standard deviation (Standard Deviation), Chinese environment again, but different from mean square error (mean squared error, mean square error is the average for the square distance that each data deviate actual value, namely square-error The average of sum, computing formula is in form close to variance, and its evolution is root-mean-square error, root-mean-square error ability and standard deviation shape Approached in formula), standard deviation is the root after sum of sguares of deviation from mean is averaged, and is represented with σ.Standard deviation is the arithmetic square root of variance. Standard deviation can reflect the dispersion degree of a data set.Average identical, standard deviation is not necessarily the same.Its computational methods is as follows:
Variance and standard deviation are when internal memory is sufficiently large, or in the case that data volume is little, and its computational methods is It is very simple.But when data acquisition system especially big (internal memory can not be deposited) simultaneously, must consider how initial data Collection is split, and the variance or standard deviation of different subsets, the handling process finally merged are calculated respectively, and the present invention is exactly So a kind of processing method.
The content of the invention
In view of the above-mentioned deficiencies in the prior art, it is an object of the present invention to provide a kind of big data variance criterion difference distribution or Incremental calculation method.
The purpose of the present invention is achieved through the following technical solutions:A kind of distribution of big data variance criterion difference or Incremental calculation method, comprises the following steps:
(1) collection of transaction journal:The transaction journal comes from multiple data sources, and the data source is sql, nosql, text Part system etc.;Transaction journal is loaded into internal memory by the way of incremental crawler, m transaction journal data of collection, are obtained every time To transaction journal collection be combined into Xi={ x1,x2,…,xm, wherein i represents the sequence number number of collection;
(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is collection The average value of data in conjunction, D is the variance of data in set, then random subset XiCiFor m, SiForViFor Si/ m, DiFor
(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained1、X2、…、Xi、…、Xj、…、XnIn the middle of knot Really;Assuming that set XiIntermediate variable be C respectivelyi, Si, Vi, Di;Set XjIntermediate variable be C respectivelyj, Sj, Vj, Dj, according to Following formula, by set XiAnd XjMerge and draw the set X after mergingijVariance Dij
Wherein, VijRepresent set Xi+XjThe average value of middle data, CijRepresent set Xi+XjThe number of middle data;
According to DijObtain set XijStandard deviation
(4) step 3 is continued executing with until All Activity flowing water subset X1、X2、…、XnCalculating is fully completed, big data is obtained Set X={ X1、X2、…、XnPopulation variance D standard deviation sigmas corresponding with its;
(5) detecting of fluctuation is traded based on standard deviation sigma:Established standardses difference limen value, fluctuation is then thought beyond the threshold value Compare acutely, now introduce air control control strategy and carry out risk control.
The beneficial effects of the invention are as follows:
1st, computational methods of the present invention the variance of big collection is calculated the variance that is split as calculating subset and, count and flat Average, and the above-mentioned variable joint account calculated by subset goes out the final variance and standard deviation of big collection.
2nd, the variance and standard deviation of super large data acquisition system (can not store with internal memory) can be handled.
3rd, for super large data acquisition system, big collection can be split as by some subsets by this method, and be published to difference Machine on calculate the above-mentioned variable of subset, the joint account function of all subsets is finally completed by one of machine;So as to The purpose of Distributed Calculation can be reached, shortens the variance criterion poor calculating time of super large data acquisition system.
4th, the present invention is more suitable for the data system of magnanimity, in mass data system, and many traditional methods can not be complete Variance criterion difference into big data is calculated.
Embodiment
The distribution or incremental calculation method of a kind of big data variance criterion difference of the present invention, comprise the following steps:
(1) collection of transaction journal:The transaction journal comes from multiple data sources, and the data source is sql, nosql, text Part system etc.;Transaction journal is loaded into internal memory by the way of incremental crawler, m transaction journal data of collection, are obtained every time To transaction journal collection be combined into Xi={ x1,x2,…,xm, wherein i represents the sequence number number of collection;Incremental crawler is generally basede on increment Field (if major key, creation time and final updating time etc.) or based on modes such as sql verniers, file verniers.
(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is collection The average value of data in conjunction, D is the variance of data in set, then random subset XiCiFor m, SiForViFor Si/ m, DiFor
(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained1、X2、…、Xi、…、Xj、…、XnIn the middle of knot Really;Assuming that set XiIntermediate variable be C respectivelyi, Si, Vi, Di;Set XjIntermediate variable be C respectivelyj, Sj, Vj, Dj, according to Following formula, by set XiAnd XjMerge and draw the set X after mergingijVariance Dij
Wherein, VijRepresent set Xi+XjThe average value of middle data, CijRepresent set Xi+XjThe number of middle data;
According to DijObtain set XijStandard deviation
(4) step 3 is continued executing with until All Activity flowing water subset X1、X2、…、XnCalculating is fully completed, big data is obtained Set X={ X1、X2、…、XnPopulation variance D standard deviation sigmas corresponding with its;
(5) detecting of fluctuation is traded based on standard deviation sigma:Established standardses difference limen value, fluctuation is then thought beyond the threshold value Compare acutely, now introduce air control control strategy and carry out risk control.In addition, in marketing domain, certain in certain time can be based on The information such as user's class commodity transaction fluctuation more steady (standard deviation is less than certain threshold value), judge in the user section time for The commodity class is interested.

Claims (1)

1. the distribution or incremental calculation method of a kind of big data variance criterion difference, it is characterised in that comprise the following steps:
(1) collection of transaction journal:The transaction journal comes from multiple data sources, and the data source is sql, nosql, file system System;Transaction journal is loaded into internal memory by the way of incremental crawler, every time m transaction journal data of collection, obtained friendship Easy flowing water collection is combined into Xi={ x1,x2,…,xm, wherein i represents the sequence number number of collection;
(2) intermediate variable C, S, V, D are defined, wherein C is data amount check in set, and S is data sum in set, and V is in set The average value of data, D is the variance of data in set, then random subset XiCiFor m, SiForViFor Si/ m, DiFor
(3) step 1 and 2 is constantly performed, n transaction journal subset X is obtained1、X2、…、Xi、…、Xj、…、XnIntermediate result;It is false If set XiIntermediate variable be C respectivelyi, Si, Vi, Di;Set XjIntermediate variable be C respectivelyj, Sj, Vj, Dj, according to following public affairs Formula, by set XiAnd XjMerge and draw the set X after mergingijVariance Dij
D i j = C i D i + ( V i + V i j ) [ 2 S i + C i ( V i - V i j ) ] + C j D j + ( V j - V i j ) [ 2 S j + C j ( V j - V i j ) ] C i j
Wherein, VijRepresent set Xi+XjThe average value of middle data, CijRepresent set Xi+XjThe number of middle data;
According to DijObtain set XijStandard deviation
(4) step 3 is continued executing with until All Activity flowing water subset X1、X2、…、XnCalculating is fully completed, big data set is obtained X={ X1、X2、…、XnPopulation variance D standard deviation sigmas corresponding with its;
(5) detecting of fluctuation is traded based on standard deviation sigma:Established standardses difference limen value, air control control is then introduced beyond the threshold value Strategy carries out risk control.
CN201510083970.7A 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference Active CN104636318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510083970.7A CN104636318B (en) 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510083970.7A CN104636318B (en) 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference

Publications (2)

Publication Number Publication Date
CN104636318A CN104636318A (en) 2015-05-20
CN104636318B true CN104636318B (en) 2017-07-14

Family

ID=53215091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510083970.7A Active CN104636318B (en) 2015-02-15 2015-02-15 The distribution or incremental calculation method of a kind of big data variance criterion difference

Country Status (1)

Country Link
CN (1) CN104636318B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018058633A1 (en) * 2016-09-30 2018-04-05 深圳市华傲数据技术有限公司 Data processing method and apparatus based on increment
CN106407161A (en) * 2016-11-22 2017-02-15 重庆邮电大学 Distributed calculating method of standard deviation
CN107040608A (en) * 2017-05-19 2017-08-11 宁波绮耘软件股份有限公司 A kind of data processing method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577681A (en) * 2013-06-26 2014-02-12 长沙理工大学 Factor analysis-based quantitative evaluation method on of boiler efficiency influence indexes
CN104123668A (en) * 2014-03-30 2014-10-29 广州天策软件科技有限公司 Standard quantization parameter based mass data dynamic screening method and application thereof in financial security field

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11224366A (en) * 1998-02-05 1999-08-17 Toshiba Corp Operation supporting system for automatic transaction device
US8229838B2 (en) * 2009-10-14 2012-07-24 Chicago Mercantile Exchange, Inc. Leg pricer
US20130317963A1 (en) * 2012-05-22 2013-11-28 Applied Academics Llc Methods and systems for creating a government bond volatility index and trading derivative products thereon

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577681A (en) * 2013-06-26 2014-02-12 长沙理工大学 Factor analysis-based quantitative evaluation method on of boiler efficiency influence indexes
CN104123668A (en) * 2014-03-30 2014-10-29 广州天策软件科技有限公司 Standard quantization parameter based mass data dynamic screening method and application thereof in financial security field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
校准和检测中微小样本测量不确定度评定方法研究;宋明顺,方兴华,黄佳,张俊亮;《仪器仪表学报》;20140228;第35卷(第2期);419-426 *
表象训练对技能学习绩效影响的元分析;黄志剑,王积福,向伟;《体育科学》;20131231;第33卷(第5期);25-30 *

Also Published As

Publication number Publication date
CN104636318A (en) 2015-05-20

Similar Documents

Publication Publication Date Title
Lyu et al. Artificial Intelligence and emerging digital technologies in the energy sector
CN110400021B (en) Bank branch cash usage prediction method and device
Sun et al. Data mining method for listed companies’ financial distress prediction
CN107633265A (en) For optimizing the data processing method and device of credit evaluation model
CN108763277B (en) Data analysis method, computer readable storage medium and terminal device
US20230237329A1 (en) Method and System Using a Neural Network for Prediction of Stocks and/or Other Market Instruments Price Volatility, Movements and Future Pricing
CN101963983A (en) Data mining method of rough set and optimization neural network
CN104636318B (en) The distribution or incremental calculation method of a kind of big data variance criterion difference
Yuan et al. A dynamic clustering ensemble learning approach for crude oil price forecasting
Agarwal et al. Merger and acquisition pricing using agent based modelling
CN104102716A (en) Imbalance data predicting method based on cluster stratified sampling compensation logic regression
Moubariki et al. Enhancing cash management using machine learning
WO2022143431A1 (en) Method and apparatus for training anti-money laundering model
De Pooter et al. Bayesian near-boundary analysis in basic macroeconomic time-series models
Kumar et al. Regression model approach to predict missing values in the Excel sheet databases
CN104063601B (en) The monitoring method and system calculated based on small micro- loan assets pond loss late
Lin et al. New DEA performance evaluation indices and their applications in the American fund market
Mazed Stock price prediction using time series data
Zhou The application of fractional Brownian motion in option pricing
Britto et al. Optimal investment in energy efficiency as a problem of growth rate maximisation
Yanhong Listed company financial risk prediction based on BP neural work
Guerra et al. Market Application of the Fuzzy-Stochastic Approach in the Heston Option Pricing Model.
Thach et al. Reconsidering Hofstede’s Cultural Dimensions: A Different View on South and Southeast Asian Countries
Kang Correlating Cryptocurrency Price with Twitter Tweet Sentiment
Lou et al. Research on the Performance of Targeted Poverty Alleviation from the Perspective of Financial Support Based on 20 Provinces in Central-Western China

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 310012 Zhejiang, Xihu District, World Trade Center, the city of Victoria and the center of the C District, room 607, room 609,

Applicant after: Zhejiang Bang Sheng Technology Co., Ltd.

Address before: 310012 Zhejiang, Xihu District, World Trade Center, the city of Victoria and the center of the C District, room 607, room 609,

Applicant before: HANGZHOU BANGSUN FINANCIAL INFORMATION TECHNOLOGY LTD.

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 310012 rooms 607 and 609, Zone C, European and American Center, world trade Regent City, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: Zhejiang Bangsheng Technology Co.,Ltd.

Address before: 310012 rooms 607 and 609, Zone C, European and American Center, world trade Regent City, Xihu District, Hangzhou City, Zhejiang Province

Patentee before: ZHEJIANG BANGSUN TECHNOLOGY Co.,Ltd.