CN104199942A - Hadoop platform time series data incremental computation method and system - Google Patents
Hadoop platform time series data incremental computation method and system Download PDFInfo
- Publication number
- CN104199942A CN104199942A CN201410456262.9A CN201410456262A CN104199942A CN 104199942 A CN104199942 A CN 104199942A CN 201410456262 A CN201410456262 A CN 201410456262A CN 104199942 A CN104199942 A CN 104199942A
- Authority
- CN
- China
- Prior art keywords
- series data
- time series
- computing
- sub
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 17
- 238000004364 calculation method Methods 0.000 claims abstract description 120
- 238000012545 processing Methods 0.000 claims abstract description 15
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000004064 recycling Methods 0.000 claims 4
- 238000010586 diagram Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种Hadoop平台时序数据增量计算方法及系统,其中,该方法包括:当启动时序数据增量计算任务时,从缓存服务器中获取该时序数据的历史计算状态;根据所述历史计算状态利用包含SubCp和ReduceCP子运算的分段时序数据增量计算方法进行增量计算;其中,SubCp子运算为分别对分段时序数据进行自定义的子运算并把中间结果保存;ReduceCP子运算为运算合并阶段,根据自定义的操作对已分段时序数据的计算结果归并操作,且所述SubCp和ReduceCP子运算的计算状态均由缓存服务器维护。通过采用本发明公开的方法及系统,通过增量计算可以节省大量不必要的重复计算,从而提高了数据处理的效率。
The invention discloses a Hadoop platform time series data incremental calculation method and system, wherein the method includes: when starting the time series data incremental calculation task, acquiring the historical calculation status of the time series data from a cache server; The calculation state uses the incremental calculation method of segmented time series data including SubCp and ReduceCP sub-operations to perform incremental calculations; among them, the SubCp sub-operation is a sub-operation that is customized for the segmented time-series data and saves the intermediate results; the ReduceCP sub-operation In the operation merging phase, the calculation results of the segmented time series data are merged according to the user-defined operation, and the calculation status of the SubCp and ReduceCP sub-operations are maintained by the cache server. By adopting the method and system disclosed in the present invention, a large amount of unnecessary repeated calculations can be saved through incremental calculations, thereby improving the efficiency of data processing.
Description
技术领域technical field
本发明涉及计算机技术领域,尤其涉及一种Hadoop平台时序数据增量计算方法。The invention relates to the field of computer technology, in particular to a Hadoop platform time series data incremental calculation method.
背景技术Background technique
随着当今互联网技术的飞速发展,信息采集技术等的广泛应用,在电信、气象、地质、电力、金融等诸多科学工业领域中产生和积累了海量的以时间序列形式存在的各种数据。传统的时间序列处理方法一般是选择Matlab等相关数学计算工具进行,但是当处理的问题规模变大时,问题计算时间往往让人难以忍受的。With the rapid development of today's Internet technology and the wide application of information collection technology, a large amount of various data in the form of time series has been generated and accumulated in many scientific and industrial fields such as telecommunications, meteorology, geology, electric power, and finance. The traditional time series processing method is generally to choose related mathematical calculation tools such as Matlab, but when the scale of the processed problem becomes larger, the calculation time of the problem is often unbearable.
当前,随着大数据处理逐渐被人们重视,一些公司、研究机构也开始了这方面的研究,相关工作主要集中在Hadoop开源分布式计算平台上。Hadoop作为一个分布式框架,可以分布式的操作大量数据,在处理海量数据上具有很多优势,比如具有高容错性、高扩展性、高可靠性等特点。At present, as big data processing is gradually being valued by people, some companies and research institutions have also started research in this area, and related work is mainly concentrated on the Hadoop open source distributed computing platform. As a distributed framework, Hadoop can operate a large amount of data in a distributed manner, and has many advantages in processing massive data, such as high fault tolerance, high scalability, and high reliability.
目前,Hadoop平台并没有对时间序列数据处理提供很好的支持,并且对时序数据的增量计算相关研究比较少,导致时间序列数据新增时需要重复计算,从而降低数据处理的效率。At present, the Hadoop platform does not provide good support for time series data processing, and there is relatively little research on the incremental calculation of time series data, resulting in the need for repeated calculations when new time series data is added, thereby reducing the efficiency of data processing.
发明内容Contents of the invention
本发明的目的是提供一种Hadoop平台时序数据增量计算方法及系统,通过增量计算可以节省大量不必要的重复计算,从而提高了数据处理的效率。The purpose of the present invention is to provide a Hadoop platform time series data incremental calculation method and system, which can save a lot of unnecessary repeated calculations through incremental calculations, thereby improving the efficiency of data processing.
本发明的目的是通过以下技术方案实现的:The purpose of the present invention is achieved through the following technical solutions:
一种Hadoop平台时序数据增量计算方法,该方法包括:A Hadoop platform time series data incremental calculation method, the method comprising:
当启动时序数据增量计算任务时,从缓存服务器中获取该时序数据的历史计算状态;When the time series data incremental calculation task is started, the historical calculation status of the time series data is obtained from the cache server;
根据所述历史计算状态利用包含SubCp和ReduceCP子运算的分段时序数据增量计算方法进行增量计算;Incremental calculations are performed using a segmented time-series data incremental calculation method comprising SubCp and ReduceCP sub-operations according to the historical calculation state;
其中,SubCp子运算为分别对分段时序数据进行自定义的子运算并把中间结果保存;ReduceCP子运算为运算合并阶段,根据自定义的操作对已分段时序数据的计算结果归并操作,且所述SubCp和ReduceCP子运算的计算状态均由缓存服务器维护。Among them, the SubCp sub-operation is a self-defined sub-operation for the segmented time-series data and saves the intermediate results; the ReduceCP sub-operation is the operation merging stage, which merges the calculation results of the segmented time-series data according to the user-defined operation, and The calculation states of the SubCp and ReduceCP sub-operations are maintained by the cache server.
一种Hadoop平台时序数据增量计算系统,该系统包括:A Hadoop platform time series data incremental computing system, the system comprising:
时序数据增量处理模块TSI,用于当启动时序数据增量计算任务时,从缓存服务器中获取该时序数据的历史计算状态;根据所述历史计算状态利用包含SubCp和ReduceCP子运算的分段时序数据增量计算方法进行增量计算;其中,SubCp子运算为分别对分段时序数据进行自定义的子运算并把中间结果保存;ReduceCP子运算为运算合并阶段,根据自定义的操作对已分段时序数据的计算结果归并操作,且所述SubCp和ReduceCP子运算的计算状态均由缓存服务器维护;The time series data incremental processing module TSI is used to obtain the historical calculation state of the time series data from the cache server when starting the time series data incremental calculation task; according to the historical calculation state, use the segmented time series including SubCp and ReduceCP sub-operations The data incremental calculation method performs incremental calculations; among them, the SubCp sub-operation is a self-defined sub-operation for segmented time series data and saves the intermediate results; The calculation results of the segment time series data are merged, and the calculation status of the SubCp and ReduceCP sub-operations are maintained by the cache server;
缓存服务器,用于保存时序数据的历史计算状态。The cache server is used to save the historical calculation status of time series data.
由上述本发明提供的技术方案可以看出,通过缓存服务器缓存时序数据的历史计算状态,当启动增量计算时,根据获取到的历史计算状态,直接进行增量数据的计算,再快速的复用历史计算结果,避免了不必要的重复计算,从而提高了数据处理的效率。It can be seen from the above-mentioned technical solution provided by the present invention that the cache server caches the historical calculation state of time series data, and when the incremental calculation is started, the calculation of the incremental data is directly performed according to the obtained historical calculation state, and then quickly resumes. Using historical calculation results avoids unnecessary repeated calculations, thereby improving the efficiency of data processing.
附图说明Description of drawings
为了更清楚地说明本发明实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the following will briefly introduce the accompanying drawings that need to be used in the description of the embodiments. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings on the premise of not paying creative efforts.
图1为本发明实施例一提供的一种Hadoop平台时序数据增量计算方法的流程图;Fig. 1 is the flow chart of a kind of Hadoop platform time series data increment calculation method that the embodiment of the present invention provides;
图2为本发明实施例一提供的一种时序数据分段机制的示意图;FIG. 2 is a schematic diagram of a time series data segmentation mechanism provided by Embodiment 1 of the present invention;
图3为本发明实施例一提供的一种分段时序数据增量计算方法的示意图;FIG. 3 is a schematic diagram of a segmented time series data increment calculation method provided by Embodiment 1 of the present invention;
图4为本发明实施例一提供的一种带有状态的固定窗口宽度的滑动窗口增量计算方法的示意图;FIG. 4 is a schematic diagram of a sliding window incremental calculation method with a fixed window width and a state provided by Embodiment 1 of the present invention;
图5为本发明实施例一提供的一种带有状态的起始点固定的单调递增窗口的增量计算方法的示意图;5 is a schematic diagram of an incremental calculation method with a monotonically increasing window with a fixed starting point of the state provided by Embodiment 1 of the present invention;
图6为本发明实施例二提供的一种Hadoop平台时序数据增量计算系统的示意图;Fig. 6 is the schematic diagram of a kind of Hadoop platform time series data increment calculation system provided by Embodiment 2 of the present invention;
图7为本发明实施例二提供的现有Hadoop平台与增量计算系统相集成的示意图。FIG. 7 is a schematic diagram of the integration of the existing Hadoop platform and the incremental computing system provided by Embodiment 2 of the present invention.
具体实施方式Detailed ways
下面结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明的保护范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
实施例一Embodiment one
图1为本发明实施例一提供的一种Hadoop平台时序数据增量计算方法的流程图。如图1所示,该方法主要包括如下步骤:FIG. 1 is a flow chart of a Hadoop platform time series data increment calculation method provided by Embodiment 1 of the present invention. As shown in Figure 1, the method mainly includes the following steps:
步骤11、当启动时序数据增量计算任务时,从缓存服务器中获取该时序数据的历史计算状态。Step 11. When starting the time-series data incremental calculation task, obtain the historical calculation status of the time-series data from the cache server.
其中,所述时序数据,以某一时间段为单位把连续的时序数据划分为多个分段,则每一个单位时间段内的时序数据运算为一个子运算;而分段后的时序数据需要满足幺半群性质。Wherein, the time-series data is divided into a plurality of segments with a certain time period as the unit, and the operation of the time-series data in each unit time period is a sub-operation; and the time-series data after the segmentation needs Satisfy the monoid property.
所述时序数据增量计算任务则表示有新增的分段时序数据。The time-series data incremental calculation task indicates that there is newly added segmented time-series data.
步骤12、根据所述历史计算状态利用包含SubCp和ReduceCP子运算的分段时序数据增量计算方法进行增量计算。Step 12: Perform incremental calculations according to the historical calculation status using the segmented time series data incremental calculation method including SubCp and ReduceCP sub-operations.
其中,SubCp子运算为分别对分段时序数据进行自定义的子运算并把中间结果保存;ReduceCP子运算为运算合并阶段,根据自定义的操作对已分段时序数据的计算结果归并操作,且所述SubCp和ReduceCP子运算的计算状态均由缓存服务器维护。Among them, the SubCp sub-operation is a self-defined sub-operation for the segmented time-series data and saves the intermediate results; the ReduceCP sub-operation is the operation merging stage, which merges the calculation results of the segmented time-series data according to the user-defined operation, and The calculation states of the SubCp and ReduceCP sub-operations are maintained by the cache server.
进一步的,所述分段时序数据增量计算方法包括:Further, the incremental calculation method for segmented time series data includes:
带有状态的固定窗口宽度的滑动窗口增量计算方法:所述状态表示缓存服务器所维护的时序数据的历史计算状态,所述窗口宽度固定表示包含的时间段个数固定;设窗口的宽度固定为n,且第1至n个时间段的时序数据已完成计算并存入所述缓存服务器中,当有第n+1个新增时序数据到达时,根据缓存服务器中该时序数据的历史计算状态,利用SubCp子运算仅进行第n+1个新增时序数据的计算,再利用ReduceCP子运算进行第n+1个新增时序数据与历史计算状态中的结果归并,并减去第1个时间段的时序数据;Sliding window incremental calculation method with fixed window width with state: the state represents the historical calculation state of the time series data maintained by the cache server, and the fixed window width means that the number of time periods included is fixed; the width of the window is fixed is n, and the time series data of the 1st to n time periods have been calculated and stored in the cache server, when the n+1th new time series data arrives, according to the historical calculation of the time series data in the cache server State, use the SubCp sub-operation to calculate only the n+1th new time-series data, and then use the ReduceCP sub-operation to merge the n+1-th new time-series data with the results in the historical calculation state, and subtract the first Time series data for time periods;
带有状态的起始点固定的单调递增窗口的增量计算方法:所述状态表示缓存服务器所维护的时序数据的历史计算状态,其窗口起始时间点固定,窗口的大小随时间递增;设窗口的起始点为第1个时间段的时序数据,且第1至n个时间段的时序数据已完成计算并存入所述缓存服务器中,当有第n+1个新增时序数据到达时,根据缓存服务器中该时序数据的历史计算状态,利用SubCp子运算仅进行第n+1个新增时序数据的计算,再利用ReduceCP子运算进行第n+1个新增时序数据与历史计算状态中的结果归并。Incremental calculation method with a monotonically increasing window with a fixed starting point of the state: the state represents the historical calculation state of the time series data maintained by the cache server, the starting time of the window is fixed, and the size of the window increases with time; set the window The starting point is the time series data of the first time period, and the time series data of the first to n time periods have been calculated and stored in the cache server. When the n+1th new time series data arrives, According to the historical calculation state of the time series data in the cache server, use the SubCp sub-operation to calculate only the n+1th new time series data, and then use the ReduceCP sub-operation to calculate the n+1th new time series data and the historical calculation state The results are merged.
为了便于理解,下面结合附图2-5对本发明做进一步的介绍。For ease of understanding, the present invention will be further introduced below in conjunction with accompanying drawings 2-5.
如图2所示,为便跟发明提供的时序数据分段机制的示意图。如图2所示,对于时间序列数据,可以以某一时间段为单位把连续的时间序列数据划分为多个段,这样每个单位时间段内的时序数据运算是一个子运算。其中,划分后的子运算需要满足幺半群性质,即可对相应的子运算进行归并运算。As shown in FIG. 2 , it is a schematic diagram of the timing data segmentation mechanism provided for the convenience of the invention. As shown in Figure 2, for time series data, continuous time series data can be divided into multiple segments based on a certain time period, so that the operation of time series data in each unit time period is a sub-operation. Wherein, the divided sub-operations need to satisfy the monoid property, and the corresponding sub-operations can be merged.
如图3所示,为分段时序数据增量计算流程图,此过程利用了图2的分段时序数据机制,该计算方法包括两个子运算:SubCp子运算和ReduceCP子运算,其中,SubCp子运算为分别对分段时序数据进行自定义的子运算并把中间结果保存;示例性的,统计以天为单位的分段时序数据中,每一时间段内网站某个页面访问流量。ReduceCP为运算合并阶段,根据自定义的操作对分段数据的计算结果归并操作;示例性的,归并以天为单位的分段时序数据中,最近n天该网站某个页面总的访问流量。且上述SubCp子运算和ReduceCP子运算的状态由缓存服务器(Cache Server)维护。As shown in Figure 3, it is a flow chart of incremental calculation of segmented time series data. This process utilizes the mechanism of segmented time series data in Figure 2. This calculation method includes two sub-operations: SubCp sub-operation and ReduceCP sub-operation, wherein, SubCp sub-operation The operation is to perform a custom sub-operation on the segmented time series data and save the intermediate results; for example, count the visit traffic of a certain page of the website in each time period in the segmented time series data in units of days. ReduceCP is the calculation and merging stage, which merges the calculation results of the segmented data according to the user-defined operation; for example, merges the total visit traffic of a certain page of the website in the last n days among the segmented time series data in units of days. And the states of the above-mentioned SubCp sub-operation and ReduceCP sub-operation are maintained by a cache server (Cache Server).
本发明实施例通过增量计算可以节省大量不必要的重复计算,从而提高了数据处理的效率;本发明实施例中结合分段时序数据增量计算方法与分段时序数据的相关特性,提出两种带有状态的滑动窗口增量计算方法:固定宽度窗口,窗口包含的时间段个数固定;单调递增窗口,窗口起始时间点固定,随时间推移窗口大小递增。具体如下:In the embodiment of the present invention, a large amount of unnecessary repeated calculations can be saved through incremental calculation, thereby improving the efficiency of data processing; in the embodiment of the present invention, in combination with the incremental calculation method of segmented time series data and the correlation characteristics of segmented time series data, two methods are proposed: A sliding window incremental calculation method with state: fixed-width window, the number of time periods contained in the window is fixed; monotonically increasing window, the starting time point of the window is fixed, and the window size increases as time goes by. details as follows:
图4所示,为带有状态的固定窗口宽度的滑动窗口增量计算,所述的状态指CacheServer所维护的相关计算状态。结合图2、3中分段时序数据的相关特性及增量计算方法,如图4所示,这里假设窗口的宽度固定为n,当有第n+1个新增时序数据到达时,根据Cache Server中的历史计算状态得知左侧的数据(第1至第n个分段数序数据)已经计算过,此时只需计算增量数据(第n+1个新增时序数据)并和部分历史结果归并就可以得到所需结果,由于窗口的宽度固定为n在归并后还需要减去第1个分段数序数据;最终结合新增数序数据和历史计算结果可以得到和进行全局数据计算一样的结果,这种方法可以避免大量不必要的重复计算,从而提高了数据处理的效率。As shown in FIG. 4 , it is a sliding window incremental calculation with a fixed window width with a state, and the state refers to the relevant calculation state maintained by the CacheServer. Combining the relevant characteristics of the segmented time series data in Figures 2 and 3 and the incremental calculation method, as shown in Figure 4, it is assumed that the width of the window is fixed at n, when the n+1th new time series data arrives, according to the Cache The historical calculation status in the server knows that the data on the left (the 1st to nth segmented number sequence data) has been calculated, and at this time only need to calculate the incremental data (n+1th new time series data) and The desired result can be obtained by merging some historical results. Since the width of the window is fixed at n, the first segmental sequence data needs to be subtracted after merging; finally, the new sequence data and historical calculation results can be combined to obtain and perform global The result of the data calculation is the same, this method can avoid a large number of unnecessary repeated calculations, thereby improving the efficiency of data processing.
如图5所示,为带有状态的起始点固定的单调递增窗口的增量计算,所述状态指Cache Server所维护的相关计算状态。结合图2、3中分段时序数据的相关特性及增量计算方法,如图5所示,假设窗口的起始点为1,当有第n+1个新增数序数据到达时,根据Cache Server中的历史计算状态得知左侧的数据(第1至第n个分段数序数据)已经计算过,此时只需计算增量数据(第n+1个新增时序数据)并和部分历史结果归并就可以得到所需结果,最终结合新增数据和历史计算结果可以得到和进行全局数据计算一样的结果,这种方法可以避免大量不必要的重复计算,从而提高了数据处理的效率。As shown in Figure 5, it is an incremental calculation with a fixed starting point and a monotonically increasing window with a state, and the state refers to the relevant calculation state maintained by the Cache Server. Combining the relevant characteristics of the segmented time series data in Figures 2 and 3 and the incremental calculation method, as shown in Figure 5, assuming that the starting point of the window is 1, when the n+1th new sequence data arrives, according to Cache The historical calculation status in the server knows that the data on the left (the 1st to nth segmented number sequence data) has been calculated, and at this time only need to calculate the incremental data (n+1th new time series data) and The required results can be obtained by merging part of the historical results. Finally, the same results as the global data calculation can be obtained by combining the new data and the historical calculation results. This method can avoid a large number of unnecessary repeated calculations, thereby improving the efficiency of data processing. .
另一方面,本发明实施例中的缓存服务器还可对插入的数据设置定时机制,其在某一时间段后识别和清除无用的旧数据以确保内存数据库不会不断膨胀。On the other hand, the cache server in the embodiment of the present invention can also set a timing mechanism for the inserted data, which identifies and clears useless old data after a certain period of time to ensure that the memory database will not continue to expand.
同时,还可以将时序数据计算算法与本发明提供的增量计算方法相结合;其中,时序数据计算算法包括如下常用时间序列计算的算法:时间序列预测算法,包括简单时序平均数法、移动平均数法、加权移动平均数法等;时间序列相似性度量算法,包括ED,DTW,FastDTW等。At the same time, the time series data calculation algorithm can also be combined with the incremental calculation method provided by the present invention; wherein, the time series data calculation algorithm includes the following commonly used time series calculation algorithms: time series prediction algorithm, including simple time series average method, moving average Number method, weighted moving average method, etc.; time series similarity measurement algorithms, including ED, DTW, FastDTW, etc.
本发明实施例所提供的技术方案与现有技术相比,具有以下有益效果:Compared with the prior art, the technical solution provided by the embodiments of the present invention has the following beneficial effects:
1)基于Hadoop平台,不改变Hadoop底层架构结构,方便编程人员编写程序;1) Based on the Hadoop platform, without changing the Hadoop underlying architecture structure, it is convenient for programmers to write programs;
2)Hadoop平台之上支持时序数据的处理;2) The processing of time series data is supported on the Hadoop platform;
3)支持Hadoop平台的时序数据的增量计算,减少不必要的重复计算,提高增量数据计算效率。3) Support the incremental calculation of time series data on the Hadoop platform, reduce unnecessary repeated calculations, and improve the efficiency of incremental data calculations.
实施例二Embodiment two
图6为本发明实施例二提供的一种Hadoop平台时序数据增量计算系统的示意图。如图6所示,该系统主要包括:FIG. 6 is a schematic diagram of a Hadoop platform time series data incremental calculation system provided by Embodiment 2 of the present invention. As shown in Figure 6, the system mainly includes:
时序数据增量处理模块TSI11,用于当启动时序数据增量计算任务时,从缓存服务器中获取该时序数据的历史计算状态;根据所述历史计算状态利用包含SubCp和ReduceCP子运算的分段时序数据增量计算方法进行增量计算;其中,SubCp子运算为分别对分段时序数据进行自定义的子运算并把中间结果保存;ReduceCP子运算为运算合并阶段,根据自定义的操作对已分段时序数据的计算结果归并操作,且所述SubCp和ReduceCP子运算的计算状态均由缓存服务器维护;Time-series data incremental processing module TSI11, used to obtain the historical calculation state of the time-series data from the cache server when the time-series data incremental calculation task is started; according to the historical calculation state, use the segmented time series including SubCp and ReduceCP sub-operations The data incremental calculation method performs incremental calculations; among them, the SubCp sub-operation is a self-defined sub-operation for segmented time series data and saves the intermediate results; The calculation results of the segment time series data are merged, and the calculation status of the SubCp and ReduceCP sub-operations are maintained by the cache server;
缓存服务器12,用于保存时序数据的历史计算状态。The cache server 12 is used to save the historical calculation state of the time series data.
进一步的,所述分段时序数据增量计算方法包括:Further, the incremental calculation method for segmented time series data includes:
带有状态的固定窗口宽度的滑动窗口增量计算方法:所述状态表示缓存服务器所维护的时序数据的历史计算状态,所述窗口宽度固定表示包含的时间段个数固定;设窗口的宽度固定为n,且第1至n个时间段的时序数据已完成计算并存入所述缓存服务器中,当有第n+1个新增时序数据到达时,根据缓存服务器中该时序数据的历史计算状态,利用SubCp子运算仅进行第n+1个新增时序数据的计算,再利用ReduceCP子运算进行第n+1个新增时序数据与历史计算状态中的结果归并,并减去第1个时间段的时序数据;Sliding window incremental calculation method with fixed window width with state: the state represents the historical calculation state of the time series data maintained by the cache server, and the fixed window width means that the number of time periods included is fixed; the width of the window is fixed is n, and the time series data of the 1st to n time periods have been calculated and stored in the cache server, when the n+1th new time series data arrives, according to the historical calculation of the time series data in the cache server State, use the SubCp sub-operation to calculate only the n+1th new time-series data, and then use the ReduceCP sub-operation to merge the n+1-th new time-series data with the results in the historical calculation state, and subtract the first Time series data for time periods;
带有状态的起始点固定的单调递增窗口的增量计算方法:所述状态表示缓存服务器所维护的时序数据的历史计算状态,其窗口起始时间点固定,窗口的大小随时间递增;设窗口的起始点为第1个时间段的时序数据,且第1至n个时间段的时序数据已完成计算并存入所述缓存服务器中,当有第n+1个新增时序数据到达时,根据缓存服务器中该时序数据的历史计算状态,利用SubCp子运算仅进行第n+1个新增时序数据的计算,再利用ReduceCP子运算进行第n+1个新增时序数据与历史计算状态中的结果归并。Incremental calculation method with a monotonically increasing window with a fixed starting point of the state: the state represents the historical calculation state of the time series data maintained by the cache server, the starting time of the window is fixed, and the size of the window increases with time; set the window The starting point is the time series data of the first time period, and the time series data of the first to n time periods have been calculated and stored in the cache server. When the n+1th new time series data arrives, According to the historical calculation state of the time series data in the cache server, use the SubCp sub-operation to calculate only the n+1th new time series data, and then use the ReduceCP sub-operation to calculate the n+1th new time series data and the historical calculation state The results are merged.
进一步的,所述时序数据,以某一时间段为单位把连续的时序数据划分为多个分段,则每一个单位时间段内的时序数据运算为一个子运算;其中,分段后的时序数据满足幺半群性质。Further, for the time series data, the continuous time series data is divided into multiple segments with a certain time period as the unit, and the operation of the time series data in each unit time period is a sub-operation; wherein, the time series after segmentation The data satisfy the monoid property.
由于本系统可基于Hadoop平台实现,为便于理解,可将上述模块与现有的Hadoop平台相结合。如图7所示,基于Hadoop平台扩展了缓存服务器Cache Server和时序数据增量处理模块TSI;缓存服务器为缓存数据库模块,其缓存了必要的计算状态结果,相比Hadoop自身提供的缓存服务有着更丰富的数据结构表示功能;TSI模块主要用于时序数据增量计算。Since this system can be implemented based on the Hadoop platform, the above modules can be combined with the existing Hadoop platform for easy understanding. As shown in Figure 7, based on the Hadoop platform, the cache server Cache Server and the time-series data incremental processing module TSI are extended; the cache server is a cache database module, which caches the necessary calculation status results, and has more advantages than the cache service provided by Hadoop itself. Rich data structure representation function; TSI module is mainly used for incremental calculation of time series data.
需要说明的是,上述系统中包含的各个功能模块所实现的功能的具体实现方式在前面的各个实施例中已经有详细描述,故在这里不再赘述。It should be noted that the specific implementation manners of the functions implemented by the various functional modules included in the above system have been described in detail in the previous embodiments, so details will not be repeated here.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将系统的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. The internal structure of the system is divided into different functional modules to complete all or part of the functions described above.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例可以通过软件实现,也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解,上述实施例的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the implementation manners, those skilled in the art can clearly understand that the above embodiments can be implemented by software, or by means of software plus a necessary general hardware platform. Based on this understanding, the technical solutions of the above-mentioned embodiments can be embodied in the form of software products, which can be stored in a non-volatile storage medium (which can be CD-ROM, U disk, mobile hard disk, etc.), including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute the methods described in various embodiments of the present invention.
以上所述,仅为本发明较佳的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明披露的技术范围内,可轻易想到的变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应该以权利要求书的保护范围为准。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person familiar with the technical field can easily conceive of changes or changes within the technical scope disclosed in the present invention. Replacement should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410456262.9A CN104199942B (en) | 2014-09-09 | 2014-09-09 | A kind of Hadoop platform time series data incremental calculation method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410456262.9A CN104199942B (en) | 2014-09-09 | 2014-09-09 | A kind of Hadoop platform time series data incremental calculation method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104199942A true CN104199942A (en) | 2014-12-10 |
CN104199942B CN104199942B (en) | 2017-11-07 |
Family
ID=52085235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410456262.9A Expired - Fee Related CN104199942B (en) | 2014-09-09 | 2014-09-09 | A kind of Hadoop platform time series data incremental calculation method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104199942B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105843891A (en) * | 2016-03-22 | 2016-08-10 | 浙江大学 | Incremental online characteristic extraction and analysis method and system |
WO2017113865A1 (en) * | 2015-12-31 | 2017-07-06 | 华为技术有限公司 | Method and device for big data increment calculation |
CN108846636A (en) * | 2018-06-01 | 2018-11-20 | 北京字节跳动网络技术有限公司 | Data dispatching method, device, computer readable storage medium |
CN109948007A (en) * | 2019-03-21 | 2019-06-28 | 浙江邦盛科技有限公司 | A kind of clock synchronization ordinal number maximum processing method for being increased continuously number and number of increments according to statistics |
CN110008544A (en) * | 2019-03-21 | 2019-07-12 | 浙江邦盛科技有限公司 | A kind of processing method of clock synchronization ordinal number number of increments and reduced degree according to statistics |
CN110019367A (en) * | 2017-12-28 | 2019-07-16 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus of statistical data feature |
CN112488412A (en) * | 2020-12-11 | 2021-03-12 | 北京字跳网络技术有限公司 | Duration information determination method and device, electronic equipment and computer storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049556A (en) * | 2012-12-28 | 2013-04-17 | 中国科学院深圳先进技术研究院 | Fast statistical query method for mass medical data |
CN103676645A (en) * | 2013-12-11 | 2014-03-26 | 广东电网公司电力科学研究院 | Mining method for association rules in time series data flows |
US20140214372A1 (en) * | 2013-01-25 | 2014-07-31 | International Business Machines Corporation | Interpolation techniques used for time alignment of multiple simulation models |
-
2014
- 2014-09-09 CN CN201410456262.9A patent/CN104199942B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049556A (en) * | 2012-12-28 | 2013-04-17 | 中国科学院深圳先进技术研究院 | Fast statistical query method for mass medical data |
US20140214372A1 (en) * | 2013-01-25 | 2014-07-31 | International Business Machines Corporation | Interpolation techniques used for time alignment of multiple simulation models |
CN103676645A (en) * | 2013-12-11 | 2014-03-26 | 广东电网公司电力科学研究院 | Mining method for association rules in time series data flows |
Non-Patent Citations (2)
Title |
---|
刘学军等: "基于滑动窗口的在线数据流增量聚集查询", 《计算机工程》 * |
王文胜: "基于集群计算的网络信息采集系统的设计与实现", 《中国优秀硕士学位论文全文数据库》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017113865A1 (en) * | 2015-12-31 | 2017-07-06 | 华为技术有限公司 | Method and device for big data increment calculation |
CN106933882A (en) * | 2015-12-31 | 2017-07-07 | 华为技术有限公司 | A kind of big data incremental calculation method and device |
CN106933882B (en) * | 2015-12-31 | 2020-09-29 | 华为技术有限公司 | Big data increment calculation method and device |
CN105843891A (en) * | 2016-03-22 | 2016-08-10 | 浙江大学 | Incremental online characteristic extraction and analysis method and system |
CN110019367A (en) * | 2017-12-28 | 2019-07-16 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus of statistical data feature |
CN110019367B (en) * | 2017-12-28 | 2022-04-12 | 北京京东尚科信息技术有限公司 | Method and device for counting data characteristics |
CN108846636A (en) * | 2018-06-01 | 2018-11-20 | 北京字节跳动网络技术有限公司 | Data dispatching method, device, computer readable storage medium |
CN109948007A (en) * | 2019-03-21 | 2019-06-28 | 浙江邦盛科技有限公司 | A kind of clock synchronization ordinal number maximum processing method for being increased continuously number and number of increments according to statistics |
CN110008544A (en) * | 2019-03-21 | 2019-07-12 | 浙江邦盛科技有限公司 | A kind of processing method of clock synchronization ordinal number number of increments and reduced degree according to statistics |
CN109948007B (en) * | 2019-03-21 | 2020-07-14 | 浙江邦盛科技有限公司 | Processing method for inquiring maximum continuous increasing times and decreasing times of time sequence data statistics |
CN112488412A (en) * | 2020-12-11 | 2021-03-12 | 北京字跳网络技术有限公司 | Duration information determination method and device, electronic equipment and computer storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN104199942B (en) | 2017-11-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104199942B (en) | A kind of Hadoop platform time series data incremental calculation method and system | |
Damaskinos et al. | Fleet: Online federated learning via staleness awareness and performance prediction | |
Patel et al. | A hybrid CNN-LSTM model for predicting server load in cloud computing | |
Hadian et al. | High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs | |
CN107820611B (en) | Event processing system paging | |
US20180270158A1 (en) | Decremental autocorrelation calculation for big data using components | |
US10235415B1 (en) | Iterative variance and/or standard deviation calculation for big data using components | |
US20200327436A1 (en) | Estimating utilization of network resources using time series data | |
Kumar et al. | Noise reduction using modified wiener filter in digital hearing aid for speech signal enhancement | |
Puzis et al. | Topology manipulations for speeding betweenness centrality computation | |
US20230069347A1 (en) | Device, method, and system for concept drift detection | |
WO2017204819A1 (en) | Similarity analyses in analytics workflows | |
Chen et al. | Xgboost: Reliable large-scale tree boosting system | |
Bahl et al. | Parallel simulations for analysing portfolios of catastrophic event risk | |
Adámek et al. | GPU fast convolution via the overlap-and-save method in shared memory | |
Kaim et al. | Ensemble cnn attention-based bilstm deep learning architecture for multivariate cloud workload prediction | |
Gong et al. | Automatic mapping of the best-suited dnn pruning schemes for real-time mobile acceleration | |
Xenopoulos et al. | Big data analytics on HPC architectures: Performance and cost | |
HewaNadungodage et al. | GPU-accelerated outlier detection for continuous data streams | |
CN104869105B (en) | A kind of abnormality online recognition method | |
Roy et al. | Queues with resetting: a perspective | |
Kumar et al. | A Study on In-Time-Frequency Algorithm | |
Ketu et al. | Performance enhancement of distributed K-Means clustering for big Data analytics through in-memory computation | |
Li et al. | Two-level incremental checkpoint recovery scheme for reducing system total overheads | |
Huang et al. | Improving speculative execution performance with coworker for cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171107 |
|
CF01 | Termination of patent right due to non-payment of annual fee |