TWI626548B - Data collection and storage system and method thereof - Google Patents

Data collection and storage system and method thereof Download PDF

Info

Publication number
TWI626548B
TWI626548B TW106110943A TW106110943A TWI626548B TW I626548 B TWI626548 B TW I626548B TW 106110943 A TW106110943 A TW 106110943A TW 106110943 A TW106110943 A TW 106110943A TW I626548 B TWI626548 B TW I626548B
Authority
TW
Taiwan
Prior art keywords
data
module
message
subsystem
area
Prior art date
Application number
TW106110943A
Other languages
Chinese (zh)
Other versions
TW201837742A (en
Inventor
粘仲仁
Original Assignee
東森信息科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 東森信息科技股份有限公司 filed Critical 東森信息科技股份有限公司
Priority to TW106110943A priority Critical patent/TWI626548B/en
Application granted granted Critical
Publication of TWI626548B publication Critical patent/TWI626548B/en
Publication of TW201837742A publication Critical patent/TW201837742A/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一種資料收集與儲存系統,用於收集並儲存來自至少一使用者裝置之複數訊息資料,該資料收集與儲存系統係主要包含一資料收集子系統,該資料收集子系統包括至少一資料中介模組、至少一事件處理模組,及至少一事件載入模組,該資料中介模組包括一第一資料區及一第二資料區,該等訊息資料係儲存於該第一資料區,該事件處理模組自該資料中介模組之第一資料區讀取該等訊息資料,再根據該等訊息資料及一預設條件得到至少一資料段,並將該資料段寫入該資料中介模組之第二資料區,該事件載入模組定時地自該資料中介模組之該第二資料區讀取該資料段。 A data collection and storage system for collecting and storing plural message data from at least one user device, the data collection and storage system mainly comprising a data collection subsystem, the data collection subsystem comprising at least one data mediation module At least one event processing module and at least one event loading module, the data mediation module includes a first data area and a second data area, the information data is stored in the first data area, the event The processing module reads the message data from the first data area of the data broker module, and then obtains at least one data segment based on the message data and a predetermined condition, and writes the data segment to the data broker module In the second data area, the event loading module periodically reads the data segment from the second data area of the data broker module.

Description

資料收集與儲存系統及其方法 Data collection and storage system and method thereof

本發明是有關於一種資料收集與儲存系統及其方法,特別是指一種應用於大數據(Big Data)的資料收集與儲存系統及其方法。 The present invention relates to a data collection and storage system and method thereof, and more particularly to a data collection and storage system and method thereof for use in Big Data.

近年來,隨著大數據之應用日益廣泛,與大數據處理有關的技術及運算平台,例如,Hadoop,也隨之蓬勃發展。Hadoop為Apache軟體基金會底下的開放原始碼計劃(Open Source Project),它可以提供大量資料的分散式運算環境及儲存空間。 In recent years, with the increasing use of big data, technologies and computing platforms related to big data processing, such as Hadoop, have also flourished. Hadoop is the Open Source Project under the Apache Software Foundation, which provides a decentralized computing environment and storage space for large amounts of data.

請參閱圖1,習知的一個人化精準推薦平台1,舉例來說,個人化精準推薦平台1係為一種精準推薦和消費者行為分析平台(Etu Recommender),架構在Hadoop平台上,該個人化精準推薦平台1用於接收來自於複數使用者裝置2的訊息資料,並基於該等訊息資料給出至少一推薦清單。該個人化精準推薦平台1包含一負載平衡器(Load Balancer)11、一資料收集系統12、一Hadoop分散式檔案系統(Hadoop Distributed File System,簡稱HDFS)13、一資料轉換處理單元14,及一推薦演算法處理單元15;其中,該資料收集系統12包括複數伺服裝置121,每一伺服裝置121包括一網站伺服器(Web Server)122,及一本機檔案系統(Local File System)123,該網站伺服器122係使用NGINX來實現。 Please refer to Figure 1. A well-known accurate recommendation platform 1 is used. For example, the personalized precision recommendation platform 1 is an accurate recommendation and consumer behavior analysis platform (Etu Recommender), which is built on the Hadoop platform. The precision recommendation platform 1 is configured to receive message data from the plurality of user devices 2, and provide at least one recommendation list based on the message materials. The personalized precision recommendation platform 1 includes a load balancer (11), a data collection system 12, a Hadoop Distributed File System (HDFS) 13, a data conversion processing unit 14, and a The recommendation algorithm processing unit 15; wherein the data collection system 12 includes a plurality of servo devices 121, each of the server devices 121 includes a web server 122 and a local file system 123. The web server 122 is implemented using NGINX.

其中,來自於該等使用者裝置2的訊息資料是以GET方法(GET Method)經由該負載平衡器11傳送到該資料收集系統12的該等伺服裝置121,該等訊息資料包括複數資料鍵/資料值對(Key/Value Pairs);每一伺服裝置121的該網站伺服器122接收該等訊息資料,並將其寫入該本機檔案系統123;該資料收集系統12定期地(例如,每小時)將收集到的訊息資料寫入該Hadoop分散式檔案系統13;該資料轉換處理單元14係對該Hadoop分散式檔案系統13的內容進行Hadoop擷取-轉換-載入(Extract-Transform-Load,簡稱ETL)處理,以將資料轉換成可以匯入資料庫的格式;該推薦演算法處理單元15係根據該Hadoop分散式檔案系統13的內容進行運算分析(例如,資料相關性分析、資料相似度分析),以給出該推薦清單,其中,該推薦清單包括商品推薦資訊、內容推薦資訊、站內廣告推薦資訊、、、等。 The message data from the user devices 2 is transmitted to the server devices 121 of the data collection system 12 via the load balancer 11 by a GET method, and the message data includes a plurality of data keys/ Key/Value Pairs; the website server 122 of each server 121 receives the message data and writes it to the local file system 123; the data collection system 12 periodically (eg, each Hour) The collected message data is written into the Hadoop distributed file system 13; the data conversion processing unit 14 performs Hadoop capture-convert-load on the content of the Hadoop distributed file system 13 (Extract-Transform-Load) , referred to as ETL), to convert the data into a format that can be imported into the database; the recommendation algorithm processing unit 15 performs computational analysis based on the content of the Hadoop distributed file system 13 (eg, data correlation analysis, data similarity) Degree analysis) to give the recommendation list, wherein the recommendation list includes product recommendation information, content recommendation information, on-site advertisement recommendation information, and the like.

習知的該個人化精準推薦平台1的資料收集系統12僅是收集訊息資料,然後定期地將收集到的訊息資料批次寫入該Hadoop分散式檔案系統13,並未對訊息資料進行任何的轉換或加工,使得後續對於訊息資料的應用較為受限且不夠便利。 The data collection system 12 of the personalized accurate recommendation platform 1 only collects the message data, and then periodically writes the collected message data to the Hadoop distributed file system 13 without performing any information on the message data. Conversion or processing makes the subsequent application of the message material more limited and not convenient.

因此,本發明之目的,即在提供一種資料收集與儲存系統,適用於收集並儲存來自於至少一使用者裝置之複數訊息資料,該資料收集與儲存系統係主要包括一資料收集子系統,該資料收集子系統包括至少一資料中介模組、至少一事件處理模組,及至少一事件載入模組,其中該資料中介模組包括一第一資料區、一第二資料區及一第三資料區,該等訊息 資料係儲存於該第一資料區,其中,每一筆訊息資料對應一事件發生時間;該事件處理模組電性連接於該資料中介模組,用以自該資料中介模組的該第一資料區讀取該等訊息資料,再根據該等訊息資料及一預設條件得到至少一資料段,並將該資料段寫入該資料中介模組的該第二資料區,其中該事件處理模組包括至少一預處理單元,該預處理單元用以根據該等訊息資料進行資料預處理以得到對應之複數可用資料,該事件處理模組還將該等可用資料寫入該資料中介模組之該第三資料區;該事件載入模組電性連接於該資料中介模組,用以定時地自該資料中介模組的該第二資料區讀取該資料段。 Accordingly, it is an object of the present invention to provide a data collection and storage system suitable for collecting and storing plural message data from at least one user device, the data collection and storage system primarily comprising a data collection subsystem, The data collection subsystem includes at least one data broker module, at least one event processing module, and at least one event loading module, wherein the data broker module includes a first data area, a second data area, and a third Data area, such information The data is stored in the first data area, wherein each message data corresponds to an event occurrence time; the event processing module is electrically connected to the data mediation module for the first data from the data mediation module The area reads the message data, and then obtains at least one data segment according to the message data and a predetermined condition, and writes the data segment to the second data area of the data mediation module, wherein the event processing module Including at least one pre-processing unit, the pre-processing unit is configured to perform data pre-processing according to the message data to obtain a corresponding plurality of available data, and the event processing module further writes the available data into the data mediation module. The third data area is electrically connected to the data mediation module for periodically reading the data segment from the second data area of the data mediation module.

本發明之另一目的,即在提供一種資料收集與儲存方法,適用於收集並儲存來自於至少一使用者裝置的複數訊息資料,該資料收集與儲存方法係以一資料收集與儲存系統來執行,該資料收集與儲存系統包含一資料收集子系統及一分散式檔案子系統,該資料收集子系統包括一資料中介模組,該資料中介模組包括一第一資料區、一第二資料區,及一第三資料區,該資料收集與儲存方法包含下列步驟:(a)接收該等訊息資料並將該等訊息資料儲存於該資料中介模組之第一資料區;(b)自該資料中介模組之第一資料區讀取該等訊息資料;(c)根據該等訊息資料及一預設條件得到至少一資料段,其中該步驟(c)包括下列子步驟:(c-1)根據該等訊息資料進行資料預處理以得到對應之複數可用資料,及(c-2)將對應於該等訊息資料的可用資料依該等事件發生時間進行排序,並以一時間區間為單位彙整成該資料段;(d)將該資料段寫入該資料中介模組之第二資料區;及(e)定時地自該資料中介模組之第二資料區讀取該資料段,並將該資料段寫入該分 散式檔案子系統。 Another object of the present invention is to provide a data collection and storage method suitable for collecting and storing plural message data from at least one user device, the data collection and storage method being executed by a data collection and storage system. The data collection and storage system includes a data collection subsystem and a distributed file subsystem. The data collection subsystem includes a data mediation module, and the data mediation module includes a first data area and a second data area. And the third data area, the data collection and storage method comprises the steps of: (a) receiving the message data and storing the message data in the first data area of the data mediation module; (b) from the The first data area of the data broker module reads the message data; (c) obtaining at least one data segment according to the message data and a predetermined condition, wherein the step (c) comprises the following sub-steps: (c-1) Pre-processing the data according to the information to obtain the corresponding plural available data, and (c-2) sorting the available data corresponding to the information according to the time of occurrence of the events, and (i) writing the data segment to the second data area of the data broker module; and (e) periodically reading from the second data area of the data broker module Take the data segment and write the data segment to the score Bulk file subsystem.

本發明之功效在於:該事件處理模組根據該等訊息資料及該預設條件得到該資料段,再由該事件載入模組定時地將該資料段寫入該分散式檔案子系統,增加了該等訊息資料在後續的應用範圍及便利性。 The effect of the present invention is that the event processing module obtains the data segment according to the message data and the preset condition, and then the event loading module periodically writes the data segment to the distributed file subsystem, increasing The scope and convenience of such information materials in the subsequent application.

3‧‧‧資料收集與儲存系統 3‧‧‧ Data collection and storage system

4‧‧‧使用者裝置 4‧‧‧User device

5‧‧‧資料收集子系統 5‧‧‧Data collection subsystem

51‧‧‧格式檢查模組 51‧‧‧ format check module

52‧‧‧資料中介模組 52‧‧‧Data mediation module

521‧‧‧第一資料區 521‧‧‧First data area

522‧‧‧第二資料區 522‧‧‧Second data area

523‧‧‧第三資料區 523‧‧‧ Third data area

524‧‧‧資料段 524‧‧‧data segment

53‧‧‧事件處理模組 53‧‧‧ Event Processing Module

531‧‧‧預處理單元 531‧‧‧Pretreatment unit

54‧‧‧事件載入模組 54‧‧‧ Event Loading Module

55‧‧‧控制單元 55‧‧‧Control unit

6‧‧‧分散式檔案子系統 6‧‧‧Distributed file subsystem

7‧‧‧資料分析應用子系統 7‧‧‧Data Analysis Application Subsystem

71‧‧‧應用單元 71‧‧‧Application unit

S1~9‧‧‧步驟 S1~9‧‧‧ steps

本發明之其他的特徵及功效,將於參照圖式之實施方式中清楚地呈現,其中:圖1是一方塊圖,說明習知之一個人化精準推薦平台;圖2是一方塊圖,說明本發明資料收集與儲存系統之一第一較佳實施例;圖3是一流程圖,說明對應該第一較佳實施例的一資料收集與儲存方法;圖4是一方塊圖,說明本發明資料收集與儲存系統之一第二較佳實施例;及圖5是一方塊圖,說明本發明資料收集與儲存系統之一第三較佳實施例。 Other features and effects of the present invention will be apparent from the following description of the drawings, wherein: FIG. 1 is a block diagram illustrating one of the conventional precision recommendation platforms; FIG. 2 is a block diagram illustrating the present invention. A first preferred embodiment of a data collection and storage system; FIG. 3 is a flow chart illustrating a data collection and storage method corresponding to the first preferred embodiment; and FIG. 4 is a block diagram illustrating the data collection of the present invention And a second preferred embodiment of the storage system; and FIG. 5 is a block diagram illustrating a third preferred embodiment of the data collection and storage system of the present invention.

有關本發明之前述及其他技術內容、特點與功效,在以下配合參考圖式之三個較佳實施例之詳細說明中,將可清楚的呈現。 The above and other technical contents, features and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments of the invention.

在本發明被詳細描述之前,應當注意在以下的說明內容中,類似或等同之元件是以相同之編號來表示。 Before the present invention is described in detail, it should be noted that in the following description, similar or equivalent elements are denoted by the same reference numerals.

請參閱圖2,本發明資料收集與儲存系統3之一第一較佳實 施例,適用於收集並儲存來自於至少一使用者裝置4的複數訊息資料,該資料收集與儲存系統3電性連接於該使用者裝置4;該資料收集與儲存系統3包括一資料收集子系統5、電性連接於該資料收集子系統5之一分散式檔案子系統6,及電性連接於該資料收集子系統5及該分散式檔案子系統6之一資料分析應用子系統7。 Referring to FIG. 2, a first preferred embodiment of the data collection and storage system 3 of the present invention The embodiment is applicable to collecting and storing the plurality of message data from the at least one user device 4, the data collection and storage system 3 is electrically connected to the user device 4; the data collection and storage system 3 includes a data collector The system 5 is electrically connected to the distributed file subsystem 6 of the data collection subsystem 5, and is electrically connected to the data collection subsystem 5 and the data analysis application subsystem 7 of the distributed archive subsystem 6.

在本第一較佳實施例中,該分散式檔案子系統6為一Hadoop分散式檔案子系統;該資料分析應用子系統7包括至少一應用單元71,其中,該資料分析應用子系統7可以為具有各種特定應用或功能的一伺服器,該應用單元71可以軟體的方式來實施,其實施態樣為建置於該資料分析應用子系統7內的應用程式,然,該應用單元71亦可以韌體或硬體的方式來實施,並不限於本第一較佳實施例所揭露。 In the first preferred embodiment, the distributed file subsystem 6 is a Hadoop distributed file subsystem; the data analysis application subsystem 7 includes at least one application unit 71, wherein the data analysis application subsystem 7 can For a server having various specific applications or functions, the application unit 71 can be implemented in a software manner, and the implementation aspect is an application built in the data analysis application subsystem 7, and the application unit 71 is also It can be implemented in a firmware or hardware manner, and is not limited to the first preferred embodiment.

該資料收集子系統5包含一格式檢查模組51、電性連接於該格式檢查模組51之一資料中介模組52、電性連接於該資料中介模組52之一事件(Event)處理模組53、及電性連接於該資料中介模組52之一事件載入模組54。其中,該資料中介模組52包括一第一資料區521、一第二資料區522,及一第三資料區523。 The data collection subsystem 5 includes a format check module 51, a data mediation module 52 electrically connected to the format check module 51, and an event processing module electrically connected to the data mediation module 52. The group 53 is electrically connected to an event loading module 54 of the data broker module 52. The data mediation module 52 includes a first data area 521, a second data area 522, and a third data area 523.

該資料收集子系統5之該格式檢查模組51接收來自於該使用者裝置4之該等訊息資料,並對該等訊息資料進行格式檢查,格式正確之該等訊息資料會傳送至該資料收集子系統5之該資料中介模組52,並儲存於該第一資料區521。在本第一較佳實施例中,該等訊息資料是以POST方法(POST Method)由使用者裝置4傳送至資料處理系統5,更進一步來說,該等訊息資料是採用JSON格式、放在訊息體(message-body)內傳送到該資料收集 子系統5。 The format check module 51 of the data collection subsystem 5 receives the message data from the user device 4, and performs format check on the message data. The formatted message information is transmitted to the data collection. The data mediation module 52 of the subsystem 5 is stored in the first data area 521. In the first preferred embodiment, the message data is transmitted from the user device 4 to the data processing system 5 by the POST method. Further, the message data is placed in the JSON format. Transmitted to the data collection within the message-body Subsystem 5.

該事件處理模組53自該資料中介模組52之該第一資料區521讀取該等訊息資料,每一筆訊息資料對應一事件發生時間,該事件處理模組53係根據該等訊息資料及一預設條件得到至少一個資料段524,並將該資料段524寫入該資料中介模組52的該第二資料區522。 The event processing module 53 reads the message data from the first data area 521 of the data mediation module 52. Each message data corresponds to an event occurrence time, and the event processing module 53 is based on the message data and At least one data segment 524 is obtained by a predetermined condition, and the data segment 524 is written to the second data region 522 of the data broker module 52.

在本第一較佳實施例中,該事件處理模組53包括至少一預處理(Pre-Processing)單元531,該事件處理模組53之該預處理單元531用以根據該等訊息資料進行資料預處理以得到對應之複數可用資料。舉例來說,該預處理單元531可由該等訊息資料中之IP位址(IP Adress)得到這個IP所屬國別、來源等可用資料;其中,該預處理單元531可視後端之該資料分析應用子系統7之需求來進行開發或透過實體介面介接擴充。更進一步來說,在本第一較佳實施例中,該預設條件係相關於該等事件發生時間,換言之,該事件處理模組53係將對應於該等訊息資料之可用資料依事件發生時間進行排序,然後以一時間區間(例如,每小時)為單位彙整成該資料段524後,寫入該資料中介模組52之該第二資料區522;此外,該事件處理模組53還可將未依事件發生時間進行排序之該等可用資料直接寫入該資料中介模組52之該第三資料區523;而該事件載入模組54定時地(例如,每小時)自該資料中介模52之該第二資料區522讀取該資料段524,並將該資料段524寫入該分散式檔案子系統6。 In the first preferred embodiment, the event processing module 53 includes at least one pre-processing unit 531, and the pre-processing unit 531 of the event processing module 53 is configured to perform data according to the message data. Pre-processing to obtain the corresponding plural available data. For example, the pre-processing unit 531 can obtain the available data of the country, the source, and the like of the IP from the IP address (IP Adress) in the message data; wherein the pre-processing unit 531 can view the data analysis application of the back end. The requirements of subsystem 7 are developed or extended through physical interface. Further, in the first preferred embodiment, the preset condition is related to the event occurrence time. In other words, the event processing module 53 generates the available data corresponding to the message data according to the event. The time is sorted, and then merged into the data segment 524 in units of a time interval (for example, hourly), and then written into the second data area 522 of the data mediation module 52; in addition, the event processing module 53 further The available data that is not sorted according to the event occurrence time can be directly written into the third data area 523 of the data mediation module 52; and the event loading module 54 periodically (for example, every hour) from the data. The second data area 522 of the intermediation module 52 reads the data segment 524 and writes the data segment 524 to the distributed archive subsystem 6.

該資料分析應用子系統7用以根據該分散式檔案子系統6之該資料段524,或該資料中介模組52之該第三資料區523中的可用資料進行資料的後處理;其中,資料的後處理包括由該應用單元71進行的資料統計、 分析、歸納等處裡。在本第一較佳實施例中,當在該資料分析應用子系統7之應用單元71中,有需要根據即時更新之可用資料來進行資料的後處理時,該資料分析應用子系統7可以直接自該資料中介模組52之該第三資料區523進行讀取;當在該資料分析應用子系統7之應用單元71中,有需要根據每一時間區間所彙整而成之該資料段524來進行統計、分析時,該資料分析應用子系統7即可自該分散式檔案子系統6讀取其所需之該資料段524。更進一步來說,在本第一較佳實施例中,該分散式檔案子系統6為Hadoop分散式檔案子系統,由於Hadoop分散式檔案子系統不支援隨機存取(Random Access)的方式,只支援批次存取方式;再加上當每一訊息資料產生(對應事件發生時間)後,可能會有臨時斷訊情況發生,若不將該等可用資料彙整成該資料段524,無法保證該資料中介模組52中該等可用資料是依其等對應之事件發生時間儲存,當該資料分析應用子系統7之應用單元71需要某一時間區間內的可用資料時,將無法得知需要自該分散式檔案子系統6連續讀取多少位移量(Offset)的資料;因此,將該等可用資料彙整成該資料段524可使得該資料分析應用子系統7之應用單元71在資料上讀取更為便利。 The data analysis application subsystem 7 is configured to perform post-processing of the data according to the data segment 524 of the distributed file subsystem 6 or the available data in the third data area 523 of the data broker module 52; Post-processing includes data statistics performed by the application unit 71, Analysis, induction, etc. In the first preferred embodiment, when the application unit 71 of the data analysis application subsystem 7 needs to perform post-processing of the data according to the available data of the instant update, the data analysis application subsystem 7 can directly Reading from the third data area 523 of the data broker module 52; when in the application unit 71 of the data analysis application subsystem 7, there is a data segment 524 that needs to be aggregated according to each time interval. When performing statistics and analysis, the data analysis application subsystem 7 can read the data segment 524 required by the distributed archive subsystem 6. Further, in the first preferred embodiment, the distributed file subsystem 6 is a Hadoop distributed file subsystem. Since the Hadoop distributed file subsystem does not support the random access method, only The batch access mode is supported; in addition, when each message data is generated (corresponding to the event occurrence time), a temporary disconnection may occur. If the available data is not aggregated into the data segment 524, the data cannot be guaranteed. The available data in the mediation module 52 is stored according to the corresponding event occurrence time. When the application unit 71 of the data analysis application subsystem 7 needs the available data in a certain time interval, it is impossible to know that it needs to be The distributed file subsystem 6 continuously reads the data of the amount of offset (Offset); therefore, merging the available data into the data segment 524 allows the application unit 71 of the data analysis application subsystem 7 to read the data on the data. For convenience.

此外,在本第一較佳實施例中,各元件之間於網路上資料傳輸皆是經過壓縮,以降低整體之網路流量負載。 In addition, in the first preferred embodiment, the data transmission between the components on the network is compressed to reduce the overall network traffic load.

請參閱圖2與圖3,對應上述第一較佳實施例之一資料收集方法包含下列步驟:如步驟S1所示,接收來自於該使用者裝置4之該等訊息資料,並對該等訊息資料進行格式檢查;如步驟S2所示,將格式正確之該等訊息資料儲存於該資料 中介模組52之該第一資料區521;如步驟S3所示,自該資料中介模組52之該第一資料區521讀取該等訊息資料;如步驟S4所示,根據步驟S3中所讀取之該等訊息資料得到對應之該等可用資料;如步驟S5所示,將該等可用資料寫入該資料中介模組52之該第三資料區523;如步驟S6所示,根據該預設條件將該等可用資料彙整成該資料段524;如步驟S7所示,將該資料段524寫入該資料中介模組52之該第二資料區522;及如步驟S8所示,定時地自該資料中介模組52之該第二資料區522讀取該資料段524,並將該資料段524寫入該分散式檔案子系統6。 Referring to FIG. 2 and FIG. 3, the data collection method corresponding to the first preferred embodiment includes the following steps: receiving the message data from the user device 4 and performing the message as shown in step S1. Data is formatted; as shown in step S2, the correct format of the message data is stored in the data. The first data area 521 of the mediation module 52; as shown in step S3, the message data is read from the first data area 521 of the data mediation module 52; as shown in step S4, according to step S3 And reading the available information to obtain the corresponding available data; as shown in step S5, writing the available data to the third data area 523 of the data broker module 52; as shown in step S6, according to the The preset condition merges the available data into the data segment 524; as shown in step S7, the data segment 524 is written into the second data region 522 of the data broker module 52; and as shown in step S8, timing The data segment 524 is read from the second data area 522 of the data broker module 52, and the data segment 524 is written to the distributed archive subsystem 6.

如步驟S9所示,根據該分散式檔案子系統6之該資料段524,或該資料中介模組52之該第三資料區523中的可用資料進行資料的後處理。 As shown in step S9, the data is post-processed according to the data segment 524 of the distributed file subsystem 6, or the available data in the third data area 523 of the data broker module 52.

請參閱圖4,本發明資料收集與儲存系統3之一第二較佳實施例,電性連接於一負載平衡器8;該負載平衡器8電性連接於至少一使用者裝置4。該資料收集與儲存系統3包含一資料收集子系統5、一分散式檔案子系統6,及一資料分析應用子系統7;其中,該資料收集子系統5包括複數格式檢查模組51、電性連接於該等格式檢查模組51的複數資料中介模組52、電性連接於該等資料中介模組52的複數事件處理模組53,及電性連接 於該等資料中介模組52之複數事件載入模組54。 Referring to FIG. 4, a second preferred embodiment of the data collection and storage system 3 of the present invention is electrically connected to a load balancer 8; the load balancer 8 is electrically connected to at least one user device 4. The data collection and storage system 3 includes a data collection subsystem 5, a distributed file subsystem 6, and a data analysis application subsystem 7; wherein the data collection subsystem 5 includes a complex format check module 51, electrical a plurality of data mediation modules 52 connected to the format check modules 51, a plurality of event processing modules 53 electrically connected to the data mediation modules 52, and electrical connections The plurality of event loading modules 54 of the data broker modules 52.

由於該等格式檢查模組51、該等資料中介模組52、該等事件處理模組53,及該等事件載入模組54之運作,類似於該第一較佳實施例,故以下不再贅述其等元件的實施細節,僅就其配置上的差異進行描述。 The operations of the format check module 51, the data mediation modules 52, the event processing modules 53, and the event loading modules 54 are similar to the first preferred embodiment, so The implementation details of the components are described in detail, and only the differences in their configurations are described.

該負載平衡器8用於接收來自於該使用者裝置4之複數訊息資料,並將該等訊息資料分配至該等格式檢查模組51進行格式檢查;較佳地,每一資料中介模組52係與一事件載入模組54整合在同一裝置內,以更進一步地降低整體網路流量負載。在本第二較佳實施例中,該等資料中介模組52及該等事件載入模組54係以Kafka佇列(Kafka Queues)方式來實現,每一資料中介模組52為一Kafka服務器(Kafka-Broker)。 The load balancer 8 is configured to receive the plurality of message data from the user device 4 and distribute the message data to the format check module 51 for format check. Preferably, each data mediation module 52 It is integrated with an event loading module 54 in the same device to further reduce the overall network traffic load. In the second preferred embodiment, the data mediation module 52 and the event loading module 54 are implemented in a Kafka Queues manner. Each data mediation module 52 is a Kafka server. (Kafka-Broker).

請參閱圖5,本發明資料收集與儲存系統3之一第三較佳實施例,電性連接於至少一使用者裝置4,該資料收集與儲存系統3接收來自於該使用者裝置4的複數訊息資料,該資料收集與儲存系統3包含一資料收集子系統5、一分散式檔案子系統6,及一資料分析應用子系統7;其中,該資料收集子系統5包括一格式檢查模組51、電性連接於該格式檢查模組51的一資料中介模組52、電性連接於該資料中介模組52的一控制單元55、電性連接於該控制單元55及該資料中介模組52的一事件處理模組53,及電性連接於該資料中介模組52之一事件載入模組54。其中,該資料中介模組52包括一第一資料區521、一第二資料區522,及一第三資料區523。與上述第一、第二較佳實施例類似之處不再贅述,以下僅就差異處進行描述。 Referring to FIG. 5, a third preferred embodiment of the data collection and storage system 3 of the present invention is electrically connected to at least one user device 4, and the data collection and storage system 3 receives a plurality of components from the user device 4. The data collection and storage system 3 includes a data collection subsystem 5, a distributed file subsystem 6, and a data analysis application subsystem 7; wherein the data collection subsystem 5 includes a format check module 51. A data mediation module 52 electrically connected to the format check module 51, a control unit 55 electrically connected to the data mediation module 52, electrically connected to the control unit 55 and the data mediation module 52 An event processing module 53 is electrically connected to an event loading module 54 of the data broker module 52. The data mediation module 52 includes a first data area 521, a second data area 522, and a third data area 523. The similarities with the first and second preferred embodiments described above are not described again, and only the differences will be described below.

在本第三較佳實施例中,該控制單元55可供進行條件設定,並將設定好的條件提供給該事件處理模組53以進行相關處理。舉例來 說,可於該控制單元55設定一篩選條件,以供該事件處理模組53篩選來自於該資料中介模組52之第一資料區521之該等訊息資料;也可以於該控制單元55設定一預設條件,以供該事件處理模組53根據該等訊息資料及該預設條件得到至少一個資料段524,並將該資料段524寫入該資料中介模組52的該第二資料區522;廣義言之,該事件處理模組53根據該控制單元55設定好的條件進行相關處理,可得到不同分群之資料段524,使得該等訊息資料之應用範圍更為廣泛而便利。 In the third preferred embodiment, the control unit 55 is configured to perform condition setting and provide the set conditions to the event processing module 53 for related processing. For example In the control unit 55, a filter condition may be set for the event processing module 53 to filter the message data from the first data area 521 of the data broker module 52. The control unit 55 may also be configured. a preset condition for the event processing module 53 to obtain at least one data segment 524 according to the message data and the preset condition, and write the data segment 524 to the second data region of the data mediation module 52 522; In a broad sense, the event processing module 53 performs related processing according to the conditions set by the control unit 55, and can obtain different data segments 524 of the group, so that the application range of the message materials is wider and more convenient.

歸納上述,請參閱圖2、圖4及圖5,本發明資料收集與儲存系統3係將大數據之資料收集與儲存進行最佳化,其功效如下:其一、該等訊息資料以POST方法傳送,不用擔心資料量大小之限制,該等訊息資料內容可以更為豐富且多樣,且以POST方法傳送之資料安全性也較高;其二、藉由該(等)事件處理模組53根據該等訊息資料進行資料預處理以得到該等可用資料,並根據該預設條件將該等可用資料彙整成該資料段524後,由該(等)事件載入模組54定時地寫入該分散式檔案子系統6,而該資料分析應用子系統7可視其應用單元71之需求,直接自該資料中介模組52的該第三資料區523讀取該等可用資料,或讀取該分散式檔案子系統6內的該資料段524,大大地增加了該等訊息資料之應用範圍及便利性;其三、各元件之間於網路上的資料傳輸皆是經過壓縮,可降低整體的網路流量負載;故確實能達成本發明之目的。 In summary, please refer to FIG. 2, FIG. 4 and FIG. 5. The data collection and storage system 3 of the present invention optimizes data collection and storage of big data, and its functions are as follows: First, the information is based on the POST method. The content of the message data can be richer and more diverse, and the data transmitted by the POST method is also more secure; secondly, the event processing module 53 is based on the transmission. The message data is preprocessed to obtain the available data, and the available data is aggregated into the data segment 524 according to the preset condition, and the event loading module 54 periodically writes the information. Decentralized file subsystem 6, and the data analysis application subsystem 7 can read the available data directly from the third data area 523 of the data broker module 52, or read the dispersion, depending on the requirements of its application unit 71. The data segment 524 in the file subsystem 6 greatly increases the application range and convenience of the message data; third, the data transmission between the components on the network is compressed, which can reduce the overall network. Road flow A load; it can really achieve the object of the present invention.

惟以上所述者,僅為本發明之實施例而已,當不能以此限定本發明實施之範圍,凡是依本發明申請專利範圍及專利說明書內容所作之簡單的等效變化與修飾,皆仍屬本發明專利涵蓋之範圍內。 However, the above is only the embodiment of the present invention, and the scope of the invention is not limited thereto, and all the equivalent equivalent changes and modifications according to the scope of the patent application and the patent specification of the present invention are still The scope of the invention is covered.

Claims (15)

一種資料收集與儲存系統,適用於收集並儲存來自於至少一使用者裝置之複數訊息資料,該資料收集與儲存系統包含:一資料收集子系統,包括:至少一資料中介模組,包括一第一資料區、一第二資料區,及一第三資料區,該等訊息資料係儲存於該第一資料區,其中,每一筆訊息資料對應一事件發生時間;至少一事件處理模組,電性連接於該資料中介模組,該事件處理模組自該資料中介模組之該第一資料區讀取該等訊息資料,再根據該等訊息資料及一預設條件得到至少一資料段,並將該資料段寫入該資料中介模組之該第二資料區,其中該事件處理模組包括至少一預處理單元,該預處理單元用以根據該等訊息資料進行資料預處理以得到對應之複數可用資料,該事件處理模組還將該等可用資料寫入該資料中介模組之該第三資料區;及至少一事件載入模組,電性連接於該資料中介模組,該事件載入模組定時地自該資料中介模組之第二資料區讀取該資料段。 A data collection and storage system for collecting and storing plural message data from at least one user device, the data collection and storage system comprising: a data collection subsystem, comprising: at least one data mediation module, including a first a data area, a second data area, and a third data area, wherein the message data is stored in the first data area, wherein each message data corresponds to an event occurrence time; at least one event processing module, electricity The event processing module reads the message data from the first data area of the data mediation module, and obtains at least one data segment according to the message data and a predetermined condition. And writing the data segment to the second data area of the data mediation module, where the event processing module includes at least one preprocessing unit, and the preprocessing unit is configured to perform data preprocessing according to the message data to obtain a corresponding The plurality of available data, the event processing module also writing the available data to the third data area of the data broker module; and at least one event loading module Electrically connected to the intermediary data module, the module load event from time to time to the second data area of the information intermediary module reads the data segment. 如請求項1所述之資料收集與儲存系統,其中該資料收集子系統更包括電性連接於該資料中介模組之至少一格式檢查模組,該格式檢查模組用以對該等訊息資料進行格式檢查,並將格式正確的該等訊息資料儲存於該資料中介模組之該第一資料區。 The data collection and storage system of claim 1 , wherein the data collection subsystem further comprises at least one format check module electrically connected to the data broker module, wherein the format check module is configured to use the information module Performing a format check and storing the correctly formatted message data in the first data area of the data mediation module. 如請求項1所述之資料收集與儲存系統,其中該資料收集子系統更包括電性連接於該資料中介模組及該事件處理模組之一控制單元,用以設定該預設條件。 The data collection and storage system of claim 1, wherein the data collection subsystem further comprises a control unit electrically connected to the data broker module and the event processing module for setting the preset condition. 如請求項1所述之資料收集與儲存系統,更包括一分散式檔案子系統,該分散式檔案子系統係電性連接該資料收集子系統之事件載入模組,用以接收並儲存自該事件載入模組所傳送之資料段。 The data collection and storage system of claim 1, further comprising a distributed file subsystem, wherein the distributed file subsystem is electrically connected to the event loading module of the data collection subsystem for receiving and storing This event loads the data segment transmitted by the module. 如請求項4所述之資料收集與儲存系統,更包括電性連接於該資料收集子系統及該分散式檔案子系統之一資料分析應用子系統,該資料分析應用子系統用以根據該分散式檔案子系統之該資料段,或該資料中介模組之該第三資料區中之可用資料進行資料之後處理。 The data collection and storage system of claim 4, further comprising a data analysis application subsystem electrically connected to the data collection subsystem and the distributed file subsystem, wherein the data analysis application subsystem is configured to be based on the dispersion The data segment of the file subsystem or the available data in the third data area of the data mediation module is processed and processed. 如請求項5所述之資料收集與儲存系統,其中該資料分析應用子系統包括至少一應用單元。 The data collection and storage system of claim 5, wherein the data analysis application subsystem comprises at least one application unit. 如請求項1所述之資料收集與儲存系統,其中該資料中介模組與該事件載入模組係整合在同一裝置內。 The data collection and storage system of claim 1, wherein the data mediation module and the event loading module are integrated in the same device. 如請求項1所述之資料收集與儲存系統,其中每一資料中介模組為一Kafka服務器。 The data collection and storage system of claim 1, wherein each data brokering module is a Kafka server. 一種資料收集與儲存方法,適用於收集並儲存來自於至少一使用者裝置之複數訊息資料,該資料收集與儲存方法係以一資料收集與儲存系統來執行,該資料收集與儲存系統包含一分散式檔案子系統及一資料收集子系統,該資料收集子系統包括一資料中介模組,該資料中介模組包括一第一資料區、一第二資料區,及一第三資料區,該資料收集與儲存方法包 含下列步驟:(a)接收該等訊息資料並將該等訊息資料儲存於該資料中介模組之該第一資料區;(b)自該資料中介模組之第一資料區讀取該等訊息資料;(c)根據該等訊息資料及一預設條件得到至少一資料段,其中該步驟(c)包括下列子步驟:(c-1)根據該等訊息資料進行資料預處理以得到對應之複數可用資料;及(c-2)將對應於該等訊息資料的可用資料依該等事件發生時間進行排序,並以一時間區間為單位彙整成該資料段;(d)將該資料段寫入該資料中介模組之第二資料區;及(e)定時地自該資料中介模組之第二資料區讀取該資料段,並將該資料段寫入該分散式檔案子系統。 A data collection and storage method for collecting and storing plural message data from at least one user device, the data collection and storage method being performed by a data collection and storage system, the data collection and storage system comprising a dispersion And a data collection subsystem, the data collection subsystem comprising a data mediation module, the data mediation module comprising a first data area, a second data area, and a third data area, the data Collection and storage method package The method includes the following steps: (a) receiving the message data and storing the message data in the first data area of the data broker module; (b) reading the first data area of the data broker module (c) obtaining at least one data segment based on the message data and a predetermined condition, wherein the step (c) comprises the following sub-steps: (c-1) performing data pre-processing according to the message data to obtain a correspondence And (c-2) sorting the available data corresponding to the information according to the time of occurrence of the events, and merging the data segments into units of time intervals; (d) the data segments Writing to the second data area of the data broker module; and (e) periodically reading the data segment from the second data area of the data broker module and writing the data segment to the distributed file subsystem. 如請求項9所述之資料收集與儲存方法,更包含步驟(a)之前的下列步驟:(f)對該等訊息資料進行格式檢查;及(g)將格式正確該等訊息資料儲存於該資料中介模之該第一資料區。 The method for collecting and storing data according to claim 9 further includes the following steps before step (a): (f) performing a format check on the message data; and (g) storing the message data in the correct format. The first data area of the data mediation module. 如請求項9所述之資料收集與儲存方法,其中該步驟(c)所述之該預設條件係相關於對應該等訊息資料的複數事件發生時間。 The method for collecting and storing data according to claim 9, wherein the preset condition described in the step (c) is related to the occurrence time of the plurality of events corresponding to the message material. 如請求項9所述之資料收集與儲存方法,還包括子步驟(c-1)之後的下列步驟: (h)將該等可用資料寫入該資料中介模組之第三資料區。 The method for collecting and storing data according to claim 9 further includes the following steps after sub-step (c-1): (h) writing the available data to the third data area of the data broker module. 如請求項12所述之資料收集與儲存方法,還包括步驟(e)之後的下列步驟:(i)根據該分散式檔案子系統之該資料段,或該資料中介模組之該第三資料區中的可用資料進行資料的後處理。 The method for collecting and storing data according to claim 12, further comprising the following steps after step (e): (i) according to the data segment of the distributed file subsystem, or the third data of the data broker module The available data in the area is used for post processing of the data. 如請求項9所述之資料收集與儲存方法,其中步驟(a)之訊息資料是以POST方法(POST Method)由使用者裝置傳送至資料處理子系統。 The method of collecting and storing data according to claim 9, wherein the message data of step (a) is transmitted by the user device to the data processing subsystem by a POST method. 如請求項9所述之資料收集與儲存方法,其中步驟(a)之訊息資料是採用JSON格式、放在訊息體(message-body)內傳送到該資料收集子系統。 The method of collecting and storing data according to claim 9, wherein the message data of step (a) is transmitted to the data collection subsystem in a message-body by using a JSON format.
TW106110943A 2017-03-31 2017-03-31 Data collection and storage system and method thereof TWI626548B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW106110943A TWI626548B (en) 2017-03-31 2017-03-31 Data collection and storage system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW106110943A TWI626548B (en) 2017-03-31 2017-03-31 Data collection and storage system and method thereof

Publications (2)

Publication Number Publication Date
TWI626548B true TWI626548B (en) 2018-06-11
TW201837742A TW201837742A (en) 2018-10-16

Family

ID=63255770

Family Applications (1)

Application Number Title Priority Date Filing Date
TW106110943A TWI626548B (en) 2017-03-31 2017-03-31 Data collection and storage system and method thereof

Country Status (1)

Country Link
TW (1) TWI626548B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI754293B (en) * 2020-06-03 2022-02-01 鴻海精密工業股份有限公司 Document processing method, computer device and readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI800743B (en) * 2020-07-17 2023-05-01 開曼群島商粉迷科技股份有限公司 Recommendation method for personalized content, graphical user interface and system thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6675133B2 (en) * 2001-03-05 2004-01-06 Ncs Pearsons, Inc. Pre-data-collection applications test processing system
TWI333380B (en) * 2002-07-01 2010-11-11 Microsoft Corp A system and method for providing user control over repeating objects embedded in a stream
US9047717B2 (en) * 2006-09-25 2015-06-02 Appareo Systems, Llc Fleet operations quality management system and automatic multi-generational data caching and recovery
TW201640352A (en) * 2015-05-14 2016-11-16 Alibaba Group Services Ltd Stream computing system and method
TW201643707A (en) * 2014-12-31 2016-12-16 英特爾股份有限公司 Methods, apparatus, instructions and logic to provide vector packed tuple cross-comparison functionality

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6675133B2 (en) * 2001-03-05 2004-01-06 Ncs Pearsons, Inc. Pre-data-collection applications test processing system
TWI333380B (en) * 2002-07-01 2010-11-11 Microsoft Corp A system and method for providing user control over repeating objects embedded in a stream
US9047717B2 (en) * 2006-09-25 2015-06-02 Appareo Systems, Llc Fleet operations quality management system and automatic multi-generational data caching and recovery
TW201643707A (en) * 2014-12-31 2016-12-16 英特爾股份有限公司 Methods, apparatus, instructions and logic to provide vector packed tuple cross-comparison functionality
TW201640352A (en) * 2015-05-14 2016-11-16 Alibaba Group Services Ltd Stream computing system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI754293B (en) * 2020-06-03 2022-02-01 鴻海精密工業股份有限公司 Document processing method, computer device and readable storage medium

Also Published As

Publication number Publication date
TW201837742A (en) 2018-10-16

Similar Documents

Publication Publication Date Title
WO2020258290A1 (en) Log data collection method, log data collection apparatus, storage medium and log data collection system
CN108563739B (en) Weather data acquisition method and device, computer device and readable storage medium
WO2019114423A1 (en) Method and apparatus for merging model prediction values, and device
US20160380875A1 (en) Identifying referral pages based on recorded url requests
CN106815254B (en) Data processing method and device
CN110597946B (en) Case storage method, device, equipment and storage medium
CN108052679A (en) A kind of Log Analysis System based on HADOOP
CN112241506A (en) User behavior backtracking method, device, equipment and system
CN107707644A (en) Processing method, device, storage medium, processor and the terminal of request message
TWI626548B (en) Data collection and storage system and method thereof
CN111666298A (en) Method and device for detecting user service class based on flink, and computer equipment
Parres-Peredo et al. Building and evaluating user network profiles for cybersecurity using serverless architecture
CN111552696A (en) Data processing method and device based on big data, computer equipment and medium
CN112860659B (en) Data warehouse construction method, device, equipment and storage medium
US10691653B1 (en) Intelligent data backfill and migration operations utilizing event processing architecture
JP6237633B2 (en) Distributed storage device, storage node, data providing method and program
CN113760983A (en) Data processing method, system and non-transitory computer readable storage medium
CN103778223B (en) Pervasive word-reciting system based on cloud platform and construction method thereof
JP6680663B2 (en) Information processing apparatus, information processing method, prediction model generation apparatus, prediction model generation method, and program
CN116737838A (en) Data synchronization method and device, computer equipment and storage medium
CN115374109A (en) Data access method, device, computing equipment and system
CN114238823A (en) Method and device for accessing website, computer equipment and storage medium
CN114428704A (en) Method and device for full-link distributed monitoring, computer equipment and storage medium
CN115314513B (en) Trust twinning method based on block chain and related equipment
CN115442272B (en) Method, device, equipment and storage medium for detecting lost data

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees