TW202009733A - Method of timely processing and scheduling big data - Google Patents

Method of timely processing and scheduling big data Download PDF

Info

Publication number
TW202009733A
TW202009733A TW107127974A TW107127974A TW202009733A TW 202009733 A TW202009733 A TW 202009733A TW 107127974 A TW107127974 A TW 107127974A TW 107127974 A TW107127974 A TW 107127974A TW 202009733 A TW202009733 A TW 202009733A
Authority
TW
Taiwan
Prior art keywords
data
huge
data sources
processing
scheduling
Prior art date
Application number
TW107127974A
Other languages
Chinese (zh)
Other versions
TWI676109B (en
Inventor
王文彥
Original Assignee
崑山科技大學
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 崑山科技大學 filed Critical 崑山科技大學
Priority to TW107127974A priority Critical patent/TWI676109B/en
Application granted granted Critical
Publication of TWI676109B publication Critical patent/TWI676109B/en
Publication of TW202009733A publication Critical patent/TW202009733A/en

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a method of timely processing and scheduling big data. It comprises the following steps of calculating a data growth rate of a plurality of big data sources by a processing module for determining whether the plurality of big data sources achieve a data growth threshold value; scheduling the plurality of big data sources conforming to the data growth threshold value according to an estimated processing time; and arranging the plurality of big data sources into a plurality of memories with low memory usage to analyze. Therefore, the present invention only processes and analyzes the plurality of big data sources conforming to the standard instead of duplicated big data sources after screening the plurality of big data sources so as to improve the processing efficiency of the plurality of big data sources.

Description

巨量資料及時處理與排班之方法Method for processing and scheduling huge amounts of data in time

本發明係有關於一種巨量資料及時處理與排班之方法,尤其係指一種能將不必要進行處理的巨量資料源排除的方法,其係透過資料成長速度去進行判斷,判斷出距離前次該巨量資料源處理完之時間點後,是否有新增新的資料量。The present invention relates to a method for processing and scheduling huge amounts of data in a timely manner, in particular to a method that can eliminate the huge amount of data sources that need not be processed, which is based on the speed of data growth to determine the distance before Whether the new data volume has been added after the processing time of the huge data source.

按,巨量資料又被稱為大數據,巨量資料代表的就是大量、複雜和非結構化的資料,單一巨量資料之資料集的大小從數太位元組(TB)至數十兆億位元組(PB)不等,巨量資料在資料的數量上日益增加,而且日益複雜。但巨量資料的分析已經是未來科技發展的趨勢,巨量資料對不同的產業都有不同的意義,舉凡大科學、RFID、感測裝置網路、天文學、大氣學、交通運輸、基因組學、生物學或大社會資料分析,皆能應用到巨量資料的處理分析,亦由於資料龐大的特性,對於巨量資料的資料倉儲系統,必須要能夠處理巨量資料的多樣性與複雜性,亦需要具備巨大的容量與及時處理等特性,以便消費者與分析師可以及時檢視巨量資料。According to this, huge amount of data is also called big data, and huge amount of data represents a large amount of complex and unstructured data. The size of a single huge amount of data set ranges from terabytes (TB) to tens of megabytes Gigabytes (PB) vary, and huge amounts of data are increasing in number and complexity. However, the analysis of huge amounts of data is already the trend of future technological development. The huge amounts of data have different meanings for different industries. Examples include big science, RFID, sensor network, astronomy, atmospherics, transportation, genomics, Biological or large social data analysis can be applied to the processing and analysis of huge amounts of data. Due to the huge characteristics of data, the data storage system for huge amounts of data must be able to deal with the diversity and complexity of huge amounts of data. Features such as huge capacity and timely processing are required so that consumers and analysts can view huge amounts of data in a timely manner.

然而,雖然前端與行動裝置提供可檢視巨量資料的分析結果;但是,目前巨量資料處理方式大部分係使用定時批次方式,讓系統依時間設定去處理分析巨量資料,此種方式無法將及時的狀況反應給使用者知道,因此,分析後的巨量資料往往失去了時效性,且亦具有較耗費系統資源與時間之缺點;而傳統方法中有部分想法是使用即時性的方式,雖可即時反應,但仍需耗費不少系統資源,且分析時並未考量排班與處理方式的關係。中華民國專利公告號TW I522827「用於非關聯式資料庫之巨量資料即時儲存與即時讀取方法」即提供一種能夠即時儲存與讀取巨量資料的方法,係依據即時運算處理效能需求可提供分散式多工存取資料機制,即時讀取程式模組視當下的資料接收介面使用數量,讓多個用戶端程式同時讀取,各自讀取不重覆的資料片段,以增加資料讀取的即時性。However, although front-end and mobile devices provide analysis results that can view huge amounts of data; however, most of the current massive data processing methods use timed batch methods, allowing the system to process and analyze huge amounts of data according to time settings. The user is informed of the timely situation. Therefore, the huge amount of data after analysis often loses its timeliness, and also has the disadvantage of consuming system resources and time; and some of the ideas in the traditional method are to use the instant method, Although it can respond immediately, it still consumes a lot of system resources, and the relationship between scheduling and processing methods is not considered in the analysis. Republic of China Patent Announcement No. TW I522827 "Real-time storage and real-time reading method for huge amounts of data in non-associated databases" provides a method for real-time storage and reading of huge amounts of data, based on real-time computing processing performance requirements. Provide decentralized multiplexing access data mechanism, real-time reading program module depends on the number of current data receiving interfaces, allowing multiple client programs to read at the same time, each reading non-repetitive data fragments, to increase data reading Immediacy.

然而,現在巨量資料的處理分析,並未先將巨量資料進行篩選,而是將所有的資料都進行分析,此不僅導致系統極大的效能負擔,亦使處理分析的時間增加;爰此,如何提供一種及時處理巨量資料的方法,並考量排班與處理的相關性,以達到降低系統處理分析之負荷量的目的,此即本發明人所思及之方向。However, the processing and analysis of huge amounts of data now does not first filter the huge amounts of data, but analyzes all the data. This not only causes a huge performance burden on the system, but also increases the processing and analysis time; How to provide a method for processing huge amounts of data in time, and consider the relevance of scheduling and processing, so as to achieve the purpose of reducing the load of system processing and analysis, which is the direction of the inventor.

今,發明人即是鑑於上述現有之巨量資料及時處理與排班之方法於實際實施使用時仍具有多處缺失,於是乃一本孜孜不倦之精神,並藉由其豐富專業知識及多年之實務經驗所輔佐,而加以改善,並據此研創出本發明。Today, the inventor is in view of the fact that the above-mentioned existing huge amount of data processing and scheduling methods still have many deficiencies in actual implementation and use, so it is a tireless spirit, and through its rich professional knowledge and years of practice With the help of experience and improvement, the invention was developed accordingly.

本發明主要目的為提供一種巨量資料及時處理與排班之方法,其係透過資料成長速度去進行判斷巨量資料源是否需要進行處理分析,若巨量資料源為與先前重複的舊資料,即屬於不必要處理的巨量資料源,並及時分析其他新的巨量資料源,以及對應分析之相關性;藉此,降低處理模組分析的負荷量,以提高巨量資料源處理的效能。The main purpose of the present invention is to provide a method for processing and scheduling huge amounts of data in a timely manner, which is to determine whether a large amount of data source needs to be processed and analyzed through the data growth rate. If the large amount of data source is old data that is repeated before, It is a huge data source that is unnecessary to process, and timely analysis of other new huge data sources and the correlation of corresponding analysis; thereby, reducing the load of analysis by the processing module to improve the processing efficiency of the huge data source .

為了達到上述實施目的,本發明一種巨量資料及時處理與排班之方法,其方法包含有一處理模組計算複數個巨量資料源之資料成長速度是否達到資料成長門檻值;達到資料成長門檻值之複數個巨量資料源會被匯入處理模組;將符合資料成長門檻值之複數個巨量資料源依照預估處理時間由小到大進行排班;處理模組將複數個巨量資料源排入記憶體使用率低的複數個記憶體,並進行分析;移除分析過後之複數個巨量資料源所佔的複數個記憶體,或保留分析過後之複數個巨量資料源所佔的複數個記憶體,以提供給其他未分析的複數個巨量資料源使用。In order to achieve the above-mentioned implementation purpose, the present invention provides a method for processing and scheduling huge amounts of data in a timely manner. The method includes a processing module to calculate whether the data growth rate of a plurality of huge data sources reaches the data growth threshold; the data growth threshold is reached The multiple huge data sources will be imported into the processing module; the multiple huge data sources that meet the data growth threshold will be scheduled according to the estimated processing time from small to large; the processing module will store the multiple huge data The source is discharged into a plurality of memories with low memory usage and analyzed; remove the plurality of memories occupied by the plurality of huge data sources after analysis, or retain the plurality of huge data sources after analysis The multiple memories are provided for use by other unanalyzed multiple sources.

於本發明之一實施例中,每一個巨量資料源皆具有至少一對應的分析結果。In an embodiment of the invention, each huge data source has at least one corresponding analysis result.

於本發明之一實施例中,複數個巨量資料源與至少一分析結果之關係分為一對一、一對多與多對一,共三種資料結構。In one embodiment of the present invention, the relationship between a plurality of huge data sources and at least one analysis result is divided into one-to-one, one-to-many, and many-to-one, with a total of three data structures.

於本發明之一實施例中,資料成長速度係以每一個巨量資料源之一新資料量除以一總資料量,總資料量為新資料量與一舊資料量之總和。In one embodiment of the present invention, the data growth rate is a new data amount divided by a total data amount of each huge data source, and the total data amount is the sum of the new data amount and the old data amount.

於本發明之一實施例中,符合資料成長門檻值之複數個巨量資料源係覆蓋原先處理模組內的舊巨量資料源。In one embodiment of the invention, the plurality of huge data sources that meet the data growth threshold cover the old huge data sources in the original processing module.

於本發明之一實施例中,複數個巨量資料源之預估處理時間為載入其中一個巨量資料源至複數個記憶體所需的時間加上處理模組處理其中一個巨量資料源所需的時間。In one embodiment of the present invention, the estimated processing time of the plurality of huge data sources is the time required to load one of the huge data sources into the plurality of memories plus the processing module to process one of the huge data sources The time required.

於本發明之一實施例中,若其中一個巨量資料源所佔用的記憶體空間大於每一個記憶體之可用空間,則將該巨量資料源分割,使處理模組分批進行分析。In one embodiment of the present invention, if the memory space occupied by one of the huge data sources is larger than the available space of each memory, the huge data source is divided, and the processing module is analyzed in batches.

於本發明之一實施例中,其中一個巨量資料源進行分割時,係分割成2的n次方。In one embodiment of the present invention, when one of the huge data sources is divided, it is divided into 2 to the nth power.

本發明之目的及其結構功能上的優點,將依據以下圖面所示之結構,配合具體實施例予以說明,俾使審查委員能對本發明有更深入且具體之瞭解。The purpose of the present invention and its structural and functional advantages will be explained based on the structure shown in the following drawings and in conjunction with specific embodiments, so that the reviewing committee can have a more in-depth and specific understanding of the present invention.

請參閱第一圖,本發明一種巨量資料及時處理與排班之方法,其方法包含有一處理模組計算複數個巨量資料源之資料成長速度是否達到資料成長門檻值,資料成長速度係以每一個巨量資料源之一新資料量除以一總資料量,總資料量為該巨量資料源之新資料量與一舊資料量之總和;達到資料成長門檻值之複數個巨量資料源會被匯入處理模組,且代表每一個達到資料成長門檻值的巨量資料源都具有至少一對應的分析結果,共可分為一對一、一對多與多對一三種資料結構;Referring to the first figure, a method for processing and scheduling huge amounts of data in time according to the present invention includes a processing module to calculate whether the data growth rate of a plurality of huge data sources reaches the data growth threshold. The data growth rate is based on One new data volume of each huge data source is divided by a total data volume, and the total data volume is the sum of the new data volume and the old data volume of the huge data source; a plurality of huge data reaching the data growth threshold The source will be imported into the processing module, and each huge data source that reaches the data growth threshold has at least one corresponding analysis result, which can be divided into one-to-one, one-to-many, and many-to-one data structure;

符合資料成長門檻值之複數個巨量資料源,可能是全部的資巨量資料源或是部份的巨量資料源,其會覆蓋原先於處理模組內的舊巨量資料源,並依照預估處理時間由小到大進行排班,其中預估處理時間為載入其中一個巨量資料源至複數個記憶體所需的時間加上處理模組處理其中一個巨量資料源所需的時間;處理模組將複數個巨量資料源排入記憶體使用率低的複數個記憶體,並進行分析,若其中一個巨量資料源所佔用的記憶體空間大於每一個記憶體之可用空間,則將該巨量資料源以2的n次方來分割,使處理模組分批進行分析;移除分析過後之複數個巨量資料源所佔的複數個記憶體,或保留分析過後之複數個巨量資料源所佔的複數個記憶體,以提供給其他未分析的複數個巨量資料源使用。Multiple huge data sources that meet the data growth threshold may be all huge data sources or some huge data sources, which will cover the old huge data sources that were originally in the processing module, and according to The estimated processing time is scheduled from small to large, where the estimated processing time is the time required to load one of the huge data sources to a plurality of memories plus the processing module to process one of the huge data sources Time; the processing module arranges a plurality of huge data sources into a plurality of memories with low memory usage, and analyzes it. If one of the huge data sources takes up more memory space than the available space of each memory , Then divide the huge data source to the nth power of 2, so that the processing module can be analyzed in batches; remove the multiple memories occupied by the multiple huge data sources after analysis, or keep the post-analysis memory The plurality of memory occupied by a plurality of huge data sources is provided to other unanalyzed plural data sources for use.

藉此,本發明巨量資料及時處理與排班之方法係讓篩選過後的多個巨量資料源匯入處理模組內進行處理分析,排除掉此次不需要分析的巨量資料源,且根據巨量資料源需要的處理效能進行排班,使本發明不僅能降低處理模組的負荷,亦可讓巨量資料源及時且依序地被處理分析,讓使用者看到的資料係為反應出當時狀況的內容。In this way, the method for timely processing and scheduling of huge amounts of data of the present invention allows the screened multiple huge amounts of data sources to be imported into the processing module for processing and analysis, eliminating the huge amounts of data sources that do not need to be analyzed this time, and Scheduling according to the processing performance required by the massive data source not only reduces the load of the processing module, but also enables the massive data source to be processed and analyzed in time and order, so that the data that the user sees is Reflect the content of the situation at that time.

此外,藉由下述具體實施例,可進一步證明本發明可實際應用之範圍,但不意欲以任何形式限制本發明之範圍。In addition, through the following specific embodiments, the scope of the present invention can be further proved to be practical, but it is not intended to limit the scope of the present invention in any form.

請繼續參閱第一圖,本發明一種巨量資料及時處理與排班之方法,主要係從複數個巨量資料源中篩選出可處理分析的巨量資料源,於實施例中,可例如有五個巨量資料源,分別為S1、S2、S3、S4、S5,而巨量資料源於處理模組處理分析後,會對應產生有至少一個分析結果,其形式基本上分為一個巨量資料源對一個分析結果、一個巨量資料源對多個分析結果以及多個巨量資料源對一個分析結果三種資料結構;以本實施例而言,取S5巨量資料源可對應分析出一個分析結果T4,取S1、S2兩個巨量資料源可對應分析出一個分析結果T1,取S3、S4兩個巨量資料源可對應分析出一個分析結果T3,而取S2之巨量資料源可再對應分析出兩個分析結果T1、T2,如第二圖所示;此種巨量資料源與分析結果具有關聯性的,可被稱為「有關係」的項目,以前述四條有關係的項目而言,分別會再對應有4個門檻C1、C2、C3、C4,此即為資料成長門檻值。此有關係的項目為一種組織、安排、儲存資料於電腦記憶體的一種結構。Please continue to refer to the first figure. A method for processing and scheduling huge amounts of data in time according to the present invention is mainly to select a large amount of data sources that can be processed and analyzed from a plurality of huge data sources. In the embodiment, for example, there may be The five huge data sources are S1, S2, S3, S4, and S5. After the huge data source is processed and analyzed by the processing module, at least one analysis result will be generated correspondingly, and its form is basically divided into a huge amount Three data structures: one data source for one analysis result, one huge data source for multiple analysis results, and multiple huge data sources for one analysis result; in this embodiment, an S5 huge data source can be used to analyze one Analysis result T4, two huge data sources S1 and S2 can be correspondingly analyzed to analyze an analysis result T1, and two huge data sources S3 and S4 can be correspondingly analyzed to analyze an analysis result T3, and the huge data source of S2 can be taken Two analysis results T1 and T2 can be analyzed correspondingly, as shown in the second figure; such a huge amount of data sources are related to the analysis results and can be called "related" items, which are related to the above four items For the project, there will be four thresholds C1, C2, C3, and C4 respectively, which is the threshold for data growth. This related item is a structure that organizes, arranges, and stores data in computer memory.

因此,處理模組就是計算S1、S2、S3、S4、S5之資料成長速度i是否有達到對應的資料成長門檻值C1、C2、C3、C4,而各資料成長門檻值皆不同,係為使用者根據過去巨量資料源成長與經驗累經驗所設定的值;例如S5巨量資料源之資料成長速度i是否有達到對應的資料成長門檻值C4,而此資料成長速度i即是以S5巨量資料源之新資料量除以總資料量,總資料量係為新資料量與舊資料量之總和,其中新資料量定義為前次S5巨量資料源處理分析完之時間點後,所增加的資料量;若巨量資料源之資料成長速度大於或等於資料成長門檻值,代表此巨量資料源可被採用,若巨量資料源之資料成長速度小於資料成長門檻值,則代表此巨量資料源不需要被處理分析;其中,處理模組於計算巨量資料源是否達到資料成長門檻值時,僅需要極短的時間即可運算出結果,對於處理模組之效能負擔係相當小,而若未進行此步驟去篩選出不需要處理分析的巨量資料源,將所有的巨量資料源皆進行分析,則會耗費處理模組龐大的效能與資源。Therefore, the processing module is to calculate whether the data growth rate i of S1, S2, S3, S4, S5 has reached the corresponding data growth threshold C1, C2, C3, C4, and each data growth threshold is different, it is used for The value is set according to the past huge data source growth and experience accumulation experience; for example, whether the data growth rate i of the S5 huge data source has reached the corresponding data growth threshold C4, and this data growth rate i is the S5 giant The amount of new data of the data source is divided by the total amount of data. The total amount of data is the sum of the amount of new data and the amount of old data. The amount of new data is defined as the time point after the processing and analysis of the previous S5 huge data source. Increased data volume; if the data growth rate of a huge data source is greater than or equal to the data growth threshold, it means that this huge data source can be used, if the data growth rate of a huge data source is less than the data growth threshold, it means this The huge data source does not need to be processed and analyzed; among them, the processing module only needs a very short time to calculate the result when calculating whether the huge data source reaches the data growth threshold, which is equivalent to the performance burden of the processing module If this step is not performed to screen out huge data sources that do not need to be processed and analyzed, and all of the huge data sources are analyzed, it will consume huge performance and resources of the processing module.

接續地,匯入有達到資料成長門檻值的巨量資料源至處理模組內,且是只要跟該條巨量資料源有關係之分析項目的所有資料源都會被匯入,例如巨量資料源S2達到資料成長門檻值,會將巨量資料源S1、S2同時匯入處理模組內、巨量資料源S3達到資料成長門檻值,會將巨量資料源S3、S4同時匯入處理模組內,匯入之後的巨量資料源會覆蓋掉舊巨量資料源,並在等待佇列上等候處理分析。Successively, import a huge amount of data sources that meet the data growth threshold into the processing module, and all data sources that are analysis items related to the huge data source will be imported, such as huge amounts of data When source S2 reaches the data growth threshold, huge data sources S1 and S2 will be imported into the processing module at the same time, and huge data source S3 reaches the data growth threshold, and huge data sources S3 and S4 will be imported into the processing module at the same time. Within the group, the huge data source after the import will overwrite the old huge data source, and wait for processing analysis on the waiting queue.

此時,處理模組會計算各個巨量資料源的預估處理時間,預估處理時間=(載入一個巨量資料源至記憶體所需的時間+處理模組處理該巨量資料源所需的時間),所有的巨量資料源之預估處理時間皆被計算出後,係將預估處理時間較少的巨量資料源排在前面,而預估處理時間較長的巨量資料源排在前面則排在後面;於排班時,還需要一併考慮到記憶體的使用率,巨量資料源在處理時通常會使用到複數個記憶體,為了增加處理分析的效率,係會將巨量資料源優先排在記憶體使用率低的記憶體,此記憶體使用率=(一記憶體已被占用的空間+巨量資料源所需要的記憶體空間)/該記憶體的總空間,計算出記憶體使用率通常會小於1,因此就可以優先將預估處理時間較少的巨量資料源先排給該記憶體內進行處理分析,其中,每一筆被排班的巨量資料源,皆會對應其有關係項目之資料結構。At this time, the processing module calculates the estimated processing time of each huge data source. The estimated processing time = (the time required to load a huge data source into the memory + the processing module to process the huge data source Time required), after the estimated processing time of all huge data sources is calculated, the huge data sources with less estimated processing time are ranked first, and the huge data with longer estimated processing time When the source is in front, it is in the back; when scheduling, you also need to take into account the memory usage. A large number of data sources usually use multiple memories when processing. In order to increase the efficiency of processing analysis, the system Large data sources will be prioritized in the memory with low memory usage, this memory usage = (a memory has been occupied space + memory space required by a large data source) / the memory Total space, calculated memory usage rate is usually less than 1, so you can prioritize the huge amount of data sources with less estimated processing time into the memory for processing and analysis. Among them, each large amount of scheduled work The data source will correspond to the data structure of related items.

然而,若有一個巨量資料源之分析的項目,對每個記憶體所計算出的記憶體使用率皆大於1,即代表該巨量資料源所需的記憶體空間大於每個記憶體當下的可用空間,因此,可將該巨量資料源依2的n次方進行分割;若n為1,該巨量資料源即分為前半項與後半項,讓記憶體分批儲存,而處理模組會先處理分析該巨量資料源的前半項,再處理分析該巨量資料源的後半項;若n為2,該巨量資料源即分為4個項目,記憶體就將其分成4批儲存,處理模組同樣會依序分批處理。However, if there is an analysis item for a huge amount of data sources, the calculated memory usage rate for each memory is greater than 1, which means that the memory space required by the huge data source is greater than the current time of each memory. Available space, so you can divide the huge data source by the power of 2 to n; if n is 1, the huge data source is divided into the first half and the second half, allowing the memory to be stored in batches and processed The module will first process and analyze the first half of the huge data source, and then process and analyze the second half of the huge data source; if n is 2, the huge data source is divided into 4 items, and the memory will divide it 4 batches of storage, the processing module will also be processed in batches in sequence.

再者,當某些巨量資料源已經結束處理分析時,可將其於所佔的記憶體中移除,或是保留巨量資料源所佔的記憶體,以提供給其他未分析的巨量資料源使用,以節省其他巨量資料源重複從處理模組之硬碟下載到記憶體的時間。In addition, when some huge data sources have finished processing and analysis, they can be removed from the occupied memory, or the memory occupied by the huge data sources can be retained to provide other unanalyzed huge data sources. Data source to save the time for other huge data sources to be repeatedly downloaded from the hard disk of the processing module to the memory.

由上述之實施說明可知,本發明與現有技術相較之下,本發明具有以下優點:It can be seen from the above implementation description that the present invention has the following advantages compared with the prior art:

1. 本發明巨量資料及時處理與排班之方法可依照巨量資料源之資料成長速度來判斷巨量資料源是否有新增資料,將不必要進行分析的巨量資料源排除,以降低處理模組的負荷量,提升巨量資料源處理分析的效率。1. The method for timely processing and scheduling of huge amounts of data according to the present invention can determine whether there is new data in the huge amount of data sources according to the data growth rate of the huge amount of data sources, and eliminate the need to analyze the huge amount of data sources to reduce Processing the load of the module to improve the efficiency of the processing and analysis of huge data sources.

2. 本發明巨量資料及時處理與排班之方法係依照巨量資料源的預估處理時間,將其由小到大進行排班,並排入記憶體使用率低的記憶體,以增加巨量資料源分析的速度,讓分析後的結果可以即時反應出當時的狀況。2. The method for timely processing and scheduling of huge amounts of data according to the present invention is based on the estimated processing time of the huge amount of data sources, scheduling them from small to large, and arranging them into a memory with a low memory usage rate to increase The speed of the analysis of huge data sources allows the results after analysis to reflect the situation at that time.

3. 本發明巨量資料及時處理與排班之方法係使用「有關係」的項目之資料結構想法,藉此想法連貫巨量資料源處理與等待排班,將整個巨量資料處理系統進行整合,節省因前後不同資料結構的使用,而使處理模組產生額外的效能需求,避免記憶體增加空間上的負擔。3. The method for processing and scheduling large amounts of data in time according to the present invention uses the data structure idea of "relevant" items, whereby the idea is to cope with the processing of large amounts of data sources and wait for scheduling, and integrate the entire huge data processing system , To save the use of different data structures before and after, so that the processing module generates additional performance requirements, to avoid the memory burden on the space.

綜上所述,本發明之巨量資料即時處理與排班方法,的確能藉由上述所揭露之實施例,達到所預期之使用功效,且本發明亦未曾公開於申請前,誠已完全符合專利法之規定與要求。爰依法提出發明專利之申請,懇請惠予審查,並賜准專利,則實感德便。In summary, the huge amount of data real-time processing and scheduling method of the present invention can indeed achieve the expected use effect by the embodiments disclosed above, and the present invention has not been disclosed before the application, and it has been fully in line with The provisions and requirements of the Patent Law. I filed an application for a patent for invention in accordance with the law, pleaded for the review, and granted the patent.

惟,上述所揭之圖示及說明,僅為本發明之較佳實施例,非為限定本發明之保護範圍;大凡熟悉該項技藝之人士,其所依本發明之特徵範疇,所作之其它等效變化或修飾,皆應視為不脫離本發明之設計範疇。However, the illustrations and descriptions disclosed above are only preferred embodiments of the present invention, and are not intended to limit the scope of protection of the present invention; those who are familiar with this skill, according to the characteristic scope of the present invention, do other things Equivalent changes or modifications should be regarded as not departing from the design scope of the present invention.

no

第一圖:本發明其較佳實施例之流程圖。Figure 1: Flow chart of the preferred embodiment of the present invention.

第二圖:本發明其較佳實施例之資料結構示意圖。Figure 2: Schematic diagram of the data structure of the preferred embodiment of the present invention.

Claims (8)

一種巨量資料及時處理與排班之方法,其方法包含有一處理模組計算複數個巨量資料源之資料成長速度是否達到資料成長門檻值;達到資料成長門檻值之該複數個巨量資料源會被匯入該處理模組;將符合資料成長門檻值之該複數個巨量資料源依照預估處理時間由小到大進行排班;該處理模組將該複數個巨量資料源排入記憶體使用率低的複數個記憶體,並進行分析;移除分析過後之該複數個巨量資料源所佔的該複數個記憶體,或保留分析過後之該複數個巨量資料源所佔的該複數個記憶體,以提供給其他未分析的該複數個巨量資料源使用。A method for processing and scheduling huge amounts of data in a timely manner, the method includes a processing module to calculate whether the data growth rate of a plurality of huge data sources reaches the data growth threshold; the plurality of huge data sources that reach the data growth threshold Will be imported into the processing module; the multiple huge data sources that meet the data growth threshold will be scheduled according to the estimated processing time from small to large; the processing module will sort the multiple huge data sources into Multiple memory with low memory usage rate and analysis; remove the multiple memory occupied by the multiple huge data sources after analysis, or keep the multiple huge data sources after analysis The plurality of memories are provided for use by other unanalyzed plurality of huge data sources. 如申請專利範圍第1項所述巨量資料及時處理與排班之方法,其中每一該複數個巨量資料源皆具有至少一對應的分析結果。The method for processing and scheduling huge amounts of data as described in item 1 of the scope of patent application, wherein each of the multiple huge data sources has at least one corresponding analysis result. 如申請專利範圍第2項所述巨量資料及時處理與排班之方法,其中該複數個巨量資料源與該至少一分析結果之關係分為一對一、一對多與多對一,共三種資料結構。The method for timely processing and scheduling of huge amounts of data as described in item 2 of the patent application scope, wherein the relationship between the multiple huge amounts of data sources and the at least one analysis result is divided into one-to-one, one-to-many and many-to-one, There are three data structures. 如申請專利範圍第1項所述巨量資料及時處理與排班之方法,其中該資料成長速度係以每一該複數個巨量資料源之一新資料量除以一總資料量,該總資料量為該新資料量與一舊資料量之總和。The method of timely processing and scheduling of huge amounts of data as described in item 1 of the patent scope, where the growth rate of the data is divided by the amount of new data for each of the huge amounts of data sources divided by a total amount of data The data volume is the sum of the new data volume and the old data volume. 如申請專利範圍第1項所述巨量資料及時處理與排班之方法,其中符合資料成長門檻值之該複數個巨量資料源係覆蓋原先該處理模組內的舊巨量資料源。The method for timely processing and scheduling of huge amounts of data as described in item 1 of the scope of the patent application, wherein the plurality of huge amounts of data sources that meet the data growth threshold cover the old huge amounts of data sources originally in the processing module. 如申請專利範圍第1項所述巨量資料及時處理與排班之方法,其中該複數個巨量資料源之預估處理時間為載入其中一該複數個巨量資料源至複數個記憶體所需的時間,加上該處理模組處理其中一該複數個巨量資料源所需的時間,再加上。The method for timely processing and scheduling of huge amounts of data as described in item 1 of the scope of patent application, wherein the estimated processing time of the multiple huge data sources is to load one of the multiple huge data sources to the multiple memories The time required, plus the time required by the processing module to process one or more of the massive data sources, is added. 如申請專利範圍第1項所述巨量資料及時處理與排班之方法,其中若其中一該複數個巨量資料源所佔用的記憶體空間大於每一該複數個記憶體之可用空間,則將其中一該複數個巨量資料源分割,使該處理模組分批進行分析。For example, the method for processing and scheduling huge amounts of data as described in item 1 of the scope of patent application, where if the memory space occupied by one of the plurality of huge data sources is greater than the available space of each of the plurality of memories, then Divide one or more of these huge data sources to analyze the processing module in batches. 如申請專利範圍第7項所述巨量資料及時處理與排班之方法,其中其中一該複數個巨量資料源進行分割時,係分割成2的n次方。The method for timely processing and scheduling of huge amounts of data as described in item 7 of the scope of patent application, in which one of the multiple huge amounts of data sources is divided into 2 to the nth power.
TW107127974A 2018-08-10 2018-08-10 Method of timely processing and scheduling big data TWI676109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW107127974A TWI676109B (en) 2018-08-10 2018-08-10 Method of timely processing and scheduling big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW107127974A TWI676109B (en) 2018-08-10 2018-08-10 Method of timely processing and scheduling big data

Publications (2)

Publication Number Publication Date
TWI676109B TWI676109B (en) 2019-11-01
TW202009733A true TW202009733A (en) 2020-03-01

Family

ID=69189024

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107127974A TWI676109B (en) 2018-08-10 2018-08-10 Method of timely processing and scheduling big data

Country Status (1)

Country Link
TW (1) TWI676109B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867219B (en) * 2012-09-27 2016-04-06 乐华建科技(北京)有限公司 A kind of business automatic arrangement program system and method
TWI503742B (en) * 2014-04-21 2015-10-11 Nat Univ Tsing Hua Multiprocessors systems and processes scheduling methods thereof
TWI534704B (en) * 2014-11-21 2016-05-21 財團法人資訊工業策進會 Processing method for time series and system thereof

Also Published As

Publication number Publication date
TWI676109B (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CA2963088C (en) Apparatus and method for scheduling distributed workflow tasks
US10002019B2 (en) System and method for assigning a transaction to a serialized execution group based on an execution group limit for parallel processing with other execution groups
CN103345514B (en) Streaming data processing method under big data environment
US8578381B2 (en) Apparatus, system and method for rapid resource scheduling in a compute farm
CN111913955A (en) Data sorting processing device, method and storage medium
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN108595254B (en) Query scheduling method
US20070143246A1 (en) Method and apparatus for analyzing the effect of different execution parameters on the performance of a database query
CN104899199A (en) Data processing method and system for data warehouse
CN103019855B (en) Method for forecasting executive time of Map Reduce operation
CN106909554B (en) Method and device for loading database text table data
Senthilkumar et al. A survey on job scheduling in big data
CN107229517A (en) Method for scheduling task and device
Mestre et al. Adaptive sorted neighborhood blocking for entity matching with mapreduce
Jiang et al. Parallel K-Medoids clustering algorithm based on Hadoop
CN106383746A (en) Configuration parameter determination method and apparatus of big data processing system
US20160034528A1 (en) Co-processor-based array-oriented database processing
CN105095255A (en) Data index creating method and device
CN105608138A (en) System for optimizing parallel data loading performance of array databases
CN104461931A (en) Method for output processing of trace logs of multi-kernel storage device and multi-kernel environment
CN111046059B (en) Low-efficiency SQL statement analysis method and system based on distributed database cluster
TW202009733A (en) Method of timely processing and scheduling big data
CN103324577A (en) Large-scale itemizing file distributing system based on minimum IO access conflict and file itemizing
CN106970837B (en) Information processing method and electronic equipment
Ji et al. Query execution optimization in spark SQL

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees