CN102841944A

CN102841944A - Method achieving real-time processing of big data

Info

Publication number: CN102841944A
Application number: CN2012103064203A
Authority: CN
Inventors: 张真; 王磊; 陈伟; 王胤然; 杨震宇; 周亮亮
Original assignee: NANJING INNOVATIVE CLOUD STORAGE TECHNOLOGY Co Ltd
Current assignee: NANJING INNOVATIVE CLOUD STORAGE TECHNOLOGY Co Ltd
Priority date: 2012-08-27
Filing date: 2012-08-27
Publication date: 2012-12-26

Abstract

The invention discloses a big data real-time processing method and relates to the field of computer application systems. According to the method, storage of data in a database, query and transmission are concurrent and real-time. At the time of distributing tasks, filtering and indexing is conducted simultaneously. At the time of filtering and indexing, filtered index files are distributed onto a data node which simultaneously finishes queries of local files and returns data to a client. After finishing any query, the data node immediately returns query results to a user. Due to the fact that the processing process of the method is executed concurrently, hardware equipment of a computer is utilized to the uttermost degree. Concurrent execution of an efficient B+ structure and queries enables the queries to be finished in real time and greatly improves efficiency of the queries. Therefore, the user can obtain the query results when executing query operation.

Description

A kind of method that realizes that big data in real time is handled

Technical field

The present invention relates to the computer application system field, be specifically related to a kind of method that mass data is handled in real time that relates to.

Background technology

Along with informationalized development, the growth of the data explosion formula that enterprise will handle, data volume has all reached the TB level, and the PB level has been brought a series of problem thus.Increasing of data volume, the load of system is increasing, and the warehouse-in and the query performance of data descend thereupon.Under the situation that does not increase hardware cost, how to bring into play the maximum performance of system, make warehouse-in, inquiry velocity is the fastest, is the difficult problem that many enterprises face.

The mass data processing that appears as of cloud computing provides solution route effectively; In common cloud computing solution; HDFS (a kind of distributed file system) through Hadoop (a kind of distributed system architecture) can realize mass data storage easily; Simultaneously effectively prevent Single Point of Faliure, avoid unnecessary loss.But when the enterprising line data of HDFS was retrieved, method commonly used was to open global search MapReduce (large-scale data concurrent operation), and this needs HDFS of complete filtration to go up all data of storage.In cloud computing, especially under the mass data situation, to do like this and can cause huge waste system resource, the time of labor, this obviously is not a mode that is fit to drop into real production environment.

Summary of the invention

The objective of the invention is to overcome that the frequently-used data disposal route can cause system resource waste in the existing cloud computing solution, the shortcoming that data processing time is long provides a kind of effective mass data real-time processing method.

The big data in real time disposal route of realization of the present invention, the warehouse-in of its data, inquiry, transmission all is concurrent, real-time:

(1) put in storage in real time: be the basis with existing HDFS, go up the startup multithreading at every datanode (back end) and create index, parallel establishment index file, the establishment of index is with the structure generation of B+ tree;

(2) inquiry in real time: use distributed computing system, inquire about in server end establishment and submission job (task), inquiry was divided into for three steps:

A. the enterprising line index of namenode (Control Node) is filtered, because index file name was created according to the time, according to time in the querying condition and index file name coupling, screens the index file that satisfies condition;

B. task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition;

C. carry out the distribution of task once more,, and return Query Result according to position reading of data on every machine of the last data that obtain of step;

(3) real-time results transmission: use jetty (a servlet container of increasing income), on HDFS, to do in the data query jetty repeating query Query Result catalogue as the web container; If be not empty, then read the Query Result file and return to client, client continues to send continue (continuation) request to server end; Server end starts multithreading and reads Query Result; Reading of data is returned to client, if the reading of data of returning is sky, flow process finishes; If be not empty, client continues to send the continue request; In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.

In the big data in real time disposal route of the present invention, the warehouse-in of data, inquiry, transmission all is concurrent, real-time.In creation task, filter index, when filtering index, the index file that has filtered is distributed to above the datanode, datanode accomplishes the inquiry of local file simultaneously, and to the client return data.The inquiry of any datanode is accomplished, and promptly returns Query Result to the user.The inventive method processing procedure all is concurrent execution; Utilized the hardware device of computing machine to greatest extent, the executed in parallel of efficient B+ structure and inquiry makes inquiry reach real-time completion; Greatly improve the efficient of inquiry, just can obtain Query Result when the user carries out query manipulation.

Description of drawings

Fig. 1 is based on the field index synoptic diagram of B+ tree;

Fig. 2 is the querying flow figure of jetty;

Fig. 3 is the query transmission process flow diagram of data.

Embodiment

Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:

All processing of the present invention all are concurrent execution, have utilized the hardware device of computing machine to greatest extent, have greatly improved treatment effeciency.Just can obtain Query Result when making the user carry out query manipulation.

The present invention includes the real-time warehouse-in of data, inquiry in real time, the real-time results transmission, the warehouse-in of data, inquiry, transmission all is concurrent, real-time.

1. put in storage in real time

Be the basis with existing HDFS, on every datanode, start multithreading and create index, the parallel index file of creating.The establishment mode of index is as shown in Figure 1:

Some significant fields are set up index, and with the structure generation of B+ tree, the record that each bar is new only need be inserted in the B+ tree.The insertion of B+ tree is only carried out on leaf node.To judge all behind (key-pointer) index entry of every insertion whether the subtree number in the node goes beyond the scope.Subtree number in inserting postjunction need be split into two nodes with leaf node during greater than m (exponent number of B+ tree).The maximum key and the node address that should comprise these two nodes in their parents' node simultaneously.After this, problem has belonged to the insertion in non-leaf node.The insertion of the insertion of key and leaf node is similar in non-leaf node, is limited to m on the number of the subtree in the non-leaf node, exceeds this scope and also will carry out the node division.When doing the root node division,, just must create new parents' node, as the new root of tree because there is not parents' node.

2. inquiry in real time

Use distributed computing system, at first the user end to server end sends the get request, and server end is created and submitted to job to inquire about, and inquiry was divided into for three steps:

1). namenode carries out index and filters; Because index file name was created according to the time; According to time in the querying condition (necessary free condition in the querying condition, start time and concluding time) and index file name coupling, the index file that screening satisfies condition.

2). task is distributed on every datanode, passes through the B+ tree query, be met the position of the data of condition according to index file that filters out and querying condition.

Data Position has write down the ip and the side-play amount of storage data place machines (datanode), finds machine according to ip, just can find corresponding data according to side-play amount again.

3). carry out the distribution of task once more,, and return Query Result, query results is write in the local file according to position reading of data on every machine of the last data that obtain of step.

The executed in parallel of efficient B+ structure and inquiry makes inquiry reach real-time completion.

3. real-time results transmission

Use jetty as the web container, on HDFS, do in the data query jetty repeating query Query Result catalogue; If be not empty, then read the Query Result file and return to client, client continues to send the continue request to server end; Server end starts multithreading and reads Query Result, and reading of data is returned to client, if the reading of data of returning is for empty; Flow process finishes, if be not empty, client continues to send the continue request.In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.

Fig. 2 is the querying flow figure of jetty, and Fig. 3 is the query transmission entire flow of data.Use jetty as the web container; At first the user end to server end sends the get request, and server end is resolved json (a kind of data interchange format of lightweight) string, according to the querying condition instantiation job object in the json string; Submit to job to carry out distributed query, last return results.On HDFS, do in the inquiry, jetty repeating query Query Result catalogue is not if be empty; Then read file and return to client; Client continues to send the continue request to service end, and server end starts multithreading and reads Query Result, and reading of data is returned to client.Finish if Query Result is the inquiry on sky and the HDFS, then return sky, flow process finishes.

Claims

1. one kind big data in real time disposal route is characterized in that the warehouse-in of data, inquiry, and transmission all is concurrent, real-time:

(1) put in storage in real time: be the basis with existing HDFS, on every datanode, start multithreading and create index, the parallel index file of creating, the establishment of index generates with the structure that B+ sets;

(2) inquiry in real time: use distributed computing system, inquire about in server end establishment and submission job, inquiry was divided into for three steps:

A. the enterprising line index of namenode is filtered, because index file name was created according to the time, according to time in the querying condition and index file name coupling, screens the index file that satisfies condition;

(3) real-time results transmission: use jetty as the web container, on HDFS, do in the data query jetty repeating query Query Result catalogue; If be not empty, then read the Query Result file and return to client, client continues to send the continue request to server end; Server end starts multithreading and reads Query Result, and reading of data is returned to client, if the reading of data of returning is for empty; Flow process finishes, if be not empty, client continues to send the continue request; In the query script, any datanode successful inquiring promptly to the client return data, does not need all datanode inquiries to accomplish.