CN103699660A

CN103699660A - Large-scale network streaming data cache-write method

Info

Publication number: CN103699660A
Application number: CN201310741116.6A
Authority: CN
Inventors: 汪东升; 王丽婧
Original assignee: Tsinghua University
Current assignee: CERTUSNET CORP
Priority date: 2013-12-26
Filing date: 2013-12-26
Publication date: 2014-04-02
Anticipated expiration: 2033-12-26
Also published as: CN103699660B

Abstract

The invention provides a large-scale network streaming data cache-write method including that a client side packages collected data and sent the same to a server side; the server side analyzes the received data to determine data resources and unifies data formats; the data with uniform formats are written into cache regions of different inner memories according the data resources; different cache strategies are made according to the data which are written into the cache regions of the different inner memories, and the data meeting strategy trigger conditions are written into a local file system from the inner memories; the data written into the local file system are loaded to a distributed file system or distributed database according to the upper application demand; and to the data which are loaded to the distributed file system or distributed database, small data blocks are combined into large data blocks at fixed time. The large-scale network streaming data cache-write method is applicable to write of large-scale network data that are of different resources and different inflow velocity by the aid of a multi-stage cache mechanism.

Description

A kind of method that large scale network stream data buffer memory writes

Technical field

The present invention relates to networking technology area, be specifically related to a kind of method that large scale network stream data buffer memory writes.

Background technology

In recent years, along with the fast development of internet, the rapid growth of data become the opportunities and challenges of many industry facings.At current net environment, mass data source is real-time continual, and requiring is also real-time to user's response time.These data are collected with the form of streaming, calculate and inquire about.For example Network anomaly detection system, by data such as collection network bag, network logs, analyzes, and guarantee to return analysis result within the scope of certain hour, ensures the high availability of network.The feature of this system is: each has the network data of magnanimity of all kinds to flow into system constantly, and inflow velocity is different, data structure complexity various (comprising binary file, text, compressed file etc.), and Network anomaly detection is a kind of application.For this type of application, need bottom storage system to support: the data that flow into are stored with consolidation form, to upper layer application, provide unified interface, convenient search, and real-time be there are certain requirements.

For large data trend now, emerged in large numbers a collection of large data processing platform (DPP), application comprises more widely and adopts the Hadoop distributed system of MapReduce parallel processing framework to process framework, comprises again HDFS(Hadoop Distributed File System in this Open Source Framework), Hbase(Hadoop Database), Hive(Tool for Data Warehouse) etc. sub-project.The design of HDFS can be stored the data (normally PB is even higher) of magnanimity, and the mode of application program file reading is assumed that streaming reads, and has done much optimization in the performance that HDFS reads in streaming; HBase is distributed non-relational Database Systems, HBase builds on HDFS distributed memory system, that is to say that HDFS supports for HBase provides the distributed bottom storage of high reliability, Hadoop MapReduce provides high performance Distributed Calculation engine for HBase; Hive is a Tool for Data Warehouse based on Hadoop bottom, structurized data file can be mapped as to a database table, and high-rise SQL query function is provided, the very applicable operation of the statistical study for massive structured data; In addition, at present application more distributed streaming processing platform has the S4 of Yahoo, the Storm of Twitter, the two be all freely increase income, distributed, high fault-tolerant real time computation system.

But the batch mode of Hadoop framework can not meet the requirement of real-time calculating, system processing speed slows down, and is not suitable for data and flows directly into; Hbase and Hive belong to distributed data base, yet it is different from the different pieces of information source and course of network, to enter the data layout of system, and the not treated distributed data base that cannot directly offer is used; The S4 of Yahoo, this class of the Storm of Twitter are for the problem of distributed streaming processing platform specially, they are more to provide a kind of computing power of stream data, the data of all arrival directly enter in internal memory and calculate after treatment, the data that flow into are not carried out to persistent storage, can not meet the demand of application.

Summary of the invention

(1) technical matters solving

For the deficiencies in the prior art, the invention provides a kind of large scale network stream data buffer memory wiring method, can utilize writing of the different large scale network data in multi-level buffer mechanism reply different pieces of information source, inflow velocity.

(2) technical scheme

In order to reach above object, the present invention is achieved by the following technical programs:

The method that large scale network stream data buffer memory writes, the method comprises:

Client, by after the data encapsulation gathering, sends to server end;

Server end is resolved the data that receive, and confirms the source of data, and data layout is unified;

By the data after uniform format, according to its source, write the buffer zone of different internal memories;

To writing the data in different memory caches region, formulate different cache policies, and the data that meet tactful trigger condition are write to local file system from internal memory;

The data based upper layer application demand that writes local file system is loaded in distributed file system or distributed data base;

For the data that are loaded into distributed file system or distributed data base, regularly small data is merged into large data.

Wherein, described data layout after reunification, should be Key-Value form or relational data form.

Preferably, the method further comprises: for the data that will be loaded in distributed data base, further judge whether it is relational data, if relational data, be loaded into distributed relation database, if not relational data is loaded into distributed No-SQL database.

Preferably, the method also comprises: for the data that are loaded into distributed data base, the storage format of the large data after merging is converted to the ranks storage format for the special optimization of distributed data base.

The system that large scale network stream data buffer memory writes, this system comprises:

Data transmission blocks, for by after the data encapsulation gathering, sends to server end;

Providing data formatting module, for the data that receive are resolved, confirms the source of data, and data layout is unified;

Data cache module, for by the data after uniform format, writes the buffer zone of different internal memories according to its source;

Data persistence module, for to writing the different cache policy of data customization of different buffer zones, and writes local file system by the data that meet tactful trigger condition from internal memory;

Data load-on module, for being loaded into distributed file system or distributed data base by the data based upper layer application demand that writes local file system;

The regular module of data, for to being loaded into the data of distributed file system or distributed data base, is regularly merged into small data large data.

Wherein, described providing data formatting module, by data layout after reunification, the form of its data should be Key-Value form or relational data form.

Preferably, described data load-on module, whether the data that further judgement is loaded in distributed data base are relational data, if relational data, be loaded into distributed relation database, if not relational data is loaded into distributed No-SQL database.

Preferably, the regular module of described data, further, by the storage format of the large data after merging in distributed data base, converts the ranks storage format for the special optimization of distributed data base to.

(3) beneficial effect

The present invention has following beneficial effect at least:

Method provided by the invention utilizes multi-level buffer mechanism to carry out buffer memory to data, when server receives after data, the form of data is unified, and formulate different cache policies according to the separate sources of data, and can tackle writing of the different large scale network data in different pieces of information source, inflow velocity, scrappy small data is merged into long data block, improved data processing speed, reduce data space, reduced data management cost, meet large data processing demand.

Method provided by the invention, data layout is unified, and data layout is Key-Value form or relational data form after reunification, facilitates subsequent calculations directly to call.

Method provided by the invention, except the small data being loaded in the data of distributed data base is merged into long data block, also data storing format conversion is become to the ranks storage format for the special optimization of distributed data base, when saving storage area, optimized search efficiency.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these figure other accompanying drawing.

Fig. 1 is the process flow diagram of the method that writes of a kind of large scale network stream data buffer memory that the embodiment of the present invention provides;

Fig. 2 is the process flow diagram of the method that writes of a kind of extensive stream data buffer memory that a preferred embodiment of the present invention provides;

Fig. 3 is illustrating about cache policy.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Embodiment based in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

Referring to Fig. 1, a kind of method that the embodiment of the present invention provides large scale network stream data buffer memory to write, the step of the method is as follows:

Step 101: client, by after the data encapsulation gathering, sends to server end;

Step 102: server end is resolved the data that receive, the source of confirmation data, and data layout is unified;

Step 103: by the data after uniform format, write the buffer zone of different internal memories according to its source;

Step 104: formulate different cache policies to writing the data in different memory caches region, and the data that meet tactful trigger condition are write to local file system from internal memory;

Step 105: the data based upper layer application demand that writes local file system is loaded in distributed file system or distributed data base;

Step 106: for the data that are loaded into distributed file system distributed data base, according to set strategy, small data is merged into large data.

The method that the embodiment of the present invention provides utilizes multi-level buffer mechanism to carry out buffer memory to data, when server receives after data, the form of data is unified, and formulate different cache policies according to the separate sources of data, can tackle writing of the different large scale network data in different pieces of information source, inflow velocity, scrappy small data is merged into long data block, improved data processing speed, reduce data space, reduce data management cost, met large data processing demand.

Below by an example more specifically, the implementation procedure of a preferred embodiment of the present invention is described, referring to Fig. 2, the step of the method is as follows:

Step 201: client, by after the data encapsulation gathering, sends to server end.

In this step, client is only responsible for the data simplified package that each is gathered constantly, indicates its Data Source, and the mode with POST, sends to server end by http protocol.

Step 202: server end is resolved the data that receive, the source of confirmation data, and data layout is unified.

In this step, by data layout data layout after reunification, be Key-Value form or relational data form, facilitate subsequent calculations directly to call.

Step 203: by the data after uniform format, write the buffer zone of different internal memories according to its source, and formulate different cache policies to writing the data in different memory caches region.

In this step, for dissimilar data, according to its inflow velocity, data volume and upper layer application to data ageing requirement formulate different cache policies.As shown in Figure 3, illustrate concrete cache policy, for upper layer application, to monitor in real time, belong to time-sensitive class, guarantee the ageing of data, if just can accumulate mass data within the very short time interval, as the first row in Fig. 2 simultaneously, within one minute, can accumulate 400M data, comprehensive above-mentioned several conditions set cache policies be every 1 minute persistence once; And for another kind of data, as the second row in Fig. 2,10 minutes ability accumulative total 50M, and drawing according to application demand, not high to requirement of real-time, can set cache policy and be every accumulation 128M data persistence once;

Step 204: accordingly whether decision data the threshold condition of cache policy, if meet, goes to step 205, if do not meet, goes to step 203.

Step 205: data are write local file system from internal memory.

Step 206: according to upper layer application demand, whether decision data will be loaded in distributed data base, if so, goes to step 207, if not, go to step 210.

Step 207: for the data that are loaded into distributed data base, judge whether it is relational data, if so, goes to step 208, if not, go to step 209.

Step 208: data are loaded into distributed relational database.

Step 209: data are loaded into distributed No-SQL database.

Step 210: data are loaded in distributed file system.

Step 211: judge whether the data that are loaded in distributed file system meet the condition of set strategy, if met, go to step 214, if do not met, go to step 210.

In this step, the set policy condition need meeting is formulated according to the inflow velocity of data etc., as not being very fast data for flowing velocity, as the data of 1 day ability accumulation 64M, can regularly to it, carry out the merging of small data piece, to meet the demand of distributed file system to large data processing, if the default tile size of HDFS file system storing documents is 64M, data for not enough 64M are stored with 64M, if being less than 64M, file can cause space waste, metadata is too much, cause system operation slowly, this method is regularly merged into the multiple of 64M to small data piece, reduce disk fragments, improve storage space utilization factor and promote recall precision.

Step 212: judge whether the data that are loaded in distributed No-SQL database meet the condition of set strategy, if met, go to step 214, if do not met, go to step 210.

Step 213: judge whether the data that are loaded in distributed relation database meet the condition of set strategy, if met, go to step 214, if do not met, go to step 210.

Step 214: the small data satisfying condition in distributed file system is merged into long data block.

Step 215: the small data satisfying condition in distributed No-SQL database is merged into long data block.

In this step, except small data is merged into long data block, also data storing format conversion is become as RCFile etc. for the ranks storage format of the special optimization of distributed data base, in saving storage area, optimized search efficiency.

Step 216: the small data piece satisfying condition in distributed relation database is merged into long data block.

Step 217: end data buffer memory writes.

Above embodiment only, in order to technical scheme of the present invention to be described, is not intended to limit; Although the present invention is had been described in detail with reference to previous embodiment, those of ordinary skill in the art is to be understood that; Its technical scheme that still can record aforementioned each embodiment is modified, or part technical characterictic is wherein equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution depart from the spirit and scope of various embodiments of the present invention technical scheme.

Claims

1. the method that large scale network stream data buffer memory writes, is characterized in that, the method comprises:

Client, by after the data encapsulation gathering, sends to server end;

Data for being loaded into distributed file system or distributed data base, are regularly merged into long data block by small data.

2. method according to claim 1, is characterized in that, described data layout after reunification, should be Key-Value form or relational data form.

3. method according to claim 1, is characterized in that, the method further comprises:

For the data that will be loaded in distributed data base, further judge whether it is relational data, if relational data is loaded into distributed relation database, if not relational data is loaded into distributed No-SQL database.

4. method according to claim 1, is characterized in that, the method also comprises:

For the data that are loaded into distributed data base, the storage format of the large data after merging is converted to the ranks storage format for the special optimization of distributed data base.

5. the system that large scale network stream data buffer memory writes, is characterized in that, this system comprises:

6. system according to claim 5, is characterized in that, described providing data formatting module, and by data layout after reunification, the form of its data should be Key-Value form or relational data form.

7. system according to claim 5, it is characterized in that, described data load-on module, whether the data that further judgement is loaded in distributed data base are relational data, if relational data, be loaded into distributed relation database, if not relational data is loaded into distributed No-SQL database.

8. system according to claim 5, is characterized in that, the regular module of described data further, by the storage format of the large data after merging in distributed data base, converts the ranks storage format for the special optimization of distributed data base to.