CN106649847A

CN106649847A - A large data real-time processing system based on Hadoop

Info

Publication number: CN106649847A
Application number: CN201611255956.1A
Authority: CN
Inventors: 陈嵩荣; 郑志伟; 张木辉; 蔡剑齐; 王晓强
Original assignee: Linewell Software Co Ltd
Current assignee: Linewell Software Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-05-10

Abstract

The invention discloses a large data real-time processing system based on Hadoop. In the embodiment of the large data real-time processing system based on Hadoop, filtration and index can be carried out while the query task is created, and the filtered index files is distributed to the datanode while filtration and index is carried out; the query of the local file is achieved by datanode at the same time, and the query result is returned to the client. And when any datanode query in the embodiment of the large data real-time processing system based on Hadoop is achieved, the query result can be returned quickly to the client through the periodic polling mechanism of the real-time transport middleware. In the embodiment of the large data real-time processing system based on Hadoop, the data query processing process in HDFS is concurrently executed, which makes maximum use of the hardware device of a computer, enables query to be achieved in real-time, and greatly improves the efficiency of query. When users execute query operations, the query results can be obtained, which improves the efficiency of data query and enables the query requests of clients to be quickly responded.

Description

A kind of big data real time processing system based on Hadoop

Technical field

The present invention relates to field of computer technology, and in particular to a kind of big data real time processing system based on Hadoop.

Background technology

With informationalized development, enterprise's data to be processed are in explosive growth, and data volume has all reached super large rule Mould (such as from TB levels to PB levels), thus brings a series of problem.Data volume increases, and the load of system is increasing, The warehouse-in and query performance of data declines therewith.In the case where hardware cost is not increased, the maximum performance of system how is played, The fastest of warehouse-in and inquiry is made, is the difficult problem that many enterprises face.

Cloud computing appear as mass data processing provide efficiently solve approach, in common cloud computing solution Middle to exist based on the Frame Design of Hadoop, Hadoop includes：Distributed file system (Hadoop Distributed File System, HDFS) and MapReduce.HDFS provides storage for the data of magnanimity, and MapReduce is the data of magnanimity There is provided calculating.Mass data storage can easily be realized by the HDFS of Hadoop, while effectively preventing Single Point of Faliure, kept away Exempt from unnecessary loss.But, when carrying out data retrieval on HDFS, conventional method is to open global search MapReduce, The concurrent operation for carrying out large-scale data is needed, this needs all data stored on time HDFS of full filter.In cloud meter In calculation, especially in the case of mass data, carrying out global search using MapReduce on HDFS in prior art can be to being System resource causes huge waste, takes a substantial amount of time.

The content of the invention

It is an object of the invention to provide a kind of big data real time processing system based on Hadoop, looks into for improving data The efficiency of inquiry, the inquiry request of quick response client.

In order to achieve the above object, the present invention is using such following technical scheme：

The present invention provides a kind of big data real time processing system based on Hadoop, the big data reality based on Hadoop When processing system include：Client, real-time Transmission middleware, distributed file system HDFS, wherein,

The HDFS includes：Control node namenode and multiple back end datanode；

The control node, for starting multithreading on the plurality of back end, create needs warehouse-in in real time Multiple data distinguish corresponding index, and multiple indexes are stored in multiple index files according to creation time；

The client, for sending data acquisition get requests to the HDFS by the real-time Transmission middleware；

The real-time Transmission middleware, for the data acquisition request that the client sends to be transmitted into the control section Point；

The control node, the data acquisition request for being sent according to the client creates query task, described to look into Inquiry task includes：The querying condition that target data is met, the querying condition includes：Query time condition；Looked into according to described Query time condition and the plurality of index file in inquiry condition is matched, and is filtered out and is met the query time condition Index condition；The query task is distributed on the plurality of back end, according to the index file for filtering out and institute State querying condition and inquire about the plurality of back end, so as to be met the position of the data of the querying condition；Again to institute State multiple back end and distribute the query task, according to the position of the data for meeting the querying condition the plurality of Data are read on back end, when any one back end successful inquiring in the plurality of back end, inquiry knot is returned Really；

The real-time Transmission middleware, for according to preset polling cycle poll inquiry result list, if described look into It is not sky to ask result list, then read the Query Result file in the Query Result catalogue and return to client；

The client, for getting the Query Result file in real time by the real-time Transmission middleware.

After above-mentioned technical proposal, the technical scheme that the present invention is provided will have the following advantages：

In big data real time processing system based on Hadoop provided in an embodiment of the present invention, it is possible to achieve to big data reality When process, the warehouse-in of data can be realized in big data real time processing system, inquire about, transmission is all concurrent, and is real-time 's.In the embodiment of the present invention while query task is created, filtration index is carried out, will can have been filtered while filtering index Index file be distributed to above datanode, while datanode completes the inquiry of local file, and return to client and look into Ask result.And the inquiry of any datanode is completed in the embodiment of the present invention, the week of real-time Transmission middleware can be passed through Phase polling mechanism quickly returns Query Result to client.In the embodiment of the present invention, the data query in HDFS was processed Journey is all concurrently performed, and the hardware device of computer is make use of to greatest extent, has been reached inquiry and is completed in real time, greatly The efficiency of inquiry is improve, user just can obtain Query Result when performing inquiry operation, improve the efficiency of data query, it is quick to ring Answer the inquiry request of client.

Description of the drawings

Fig. 1 provides a kind of composition structural representation of the big data real time processing system based on Hadoop for the embodiment of the present invention Figure；

Fig. 2 is provided based on the querying flow schematic diagram of jetty for the embodiment of the present invention.

Specific embodiment

A kind of big data real time processing system based on Hadoop is embodiments provided, for improving data query Efficiency, the inquiry request of quick response client.

To enable goal of the invention, feature, the advantage of the present invention more obvious and understandable, below in conjunction with the present invention Accompanying drawing in embodiment, is clearly and completely described, it is clear that disclosed below to the technical scheme in the embodiment of the present invention Embodiment be only a part of embodiment of the invention, and not all embodiments.Based on the embodiment in the present invention, this area The every other embodiment that technical staff is obtained, belongs to the scope of protection of the invention.

Term " comprising " and " having " in description and claims of this specification and above-mentioned accompanying drawing and they Any deformation, it is intended that cover it is non-exclusive includes, so as to include a series of units process, method, system, product or set It is standby to be not necessarily limited to those units, but may include clearly not list or for these processes, method, product or equipment are solid Other units having.

It is described in detail individually below.

The one embodiment of the present invention based on the big data real time processing system of Hadoop, it is possible to achieve in distributed system The quick real-time query of data is completed in architecture.The embodiment of the present invention can overcome cloud computing solution party of the prior art Frequently-used data processing method can cause system resource waste, the shortcoming of data processing time length in case, there is provided a kind of effective sea Amount Real-time Data Processing Method.The warehouse-in of data in the embodiment of the present invention, inquiry, transmission is all concurrent, real-time.Refer to Shown in Fig. 1, the big data real time processing system based on Hadoop that the present invention is provided, including：In the middle of client, real-time Transmission Part, distributed file system (Hadoop Distributed File System, HDFS), wherein,

HDFS includes：Control node (namenode) and multiple back end (datanode)；

Control node, for starting multithreading on multiple back end, creates in real time the multiple data for needing warehouse-in The corresponding index of difference, and multiple indexes are stored in multiple index files according to creation time；

Client, for sending data acquisition get requests to HDFS by real-time Transmission middleware；

Real-time Transmission middleware, the data acquisition request for client to be sent is transmitted to control node；

Control node, the data acquisition request for being sent according to client creates query task, and query task includes：Mesh The querying condition that mark data are met, querying condition includes：Query time condition；Query time condition in querying condition Matched with multiple index files, filtered out the index condition for meeting query time condition；Query task is distributed to multiple On back end, multiple back end are inquired about according to the index file and querying condition that filter out, so as to be met inquiry bar The position of the data of part；Distribute query task to multiple back end again, existed according to the position of the data for meeting querying condition Data are read on multiple back end, when any one back end successful inquiring in multiple back end, inquiry knot is returned Really；

Real-time Transmission middleware, for according to preset polling cycle poll inquiry result list, if Query Result mesh Record is not sky, then read the Query Result file in Query Result catalogue and return to client；

Client, for getting Query Result file in real time by real-time Transmission middleware.

In the embodiment of the present invention, HDFS is realized based on Hadoop, the characteristics of HDFS has high fault tolerance, HDFS can provide height Handling capacity (high throughput) carrys out the data of access application, and being adapted to those has super large data set (large data Set application program), HDFS relaxes the requirement of POSIX, can in the form of streaming access (streaming access) file Data in system.

Wherein, real-time Transmission middleware is arranged between client and HDFS, and the interaction between client and HDFS passes through Real-time Transmission middleware completing, the forwarding of such as inquiry request and the forwarding of Query Result etc..In some enforcements of the present invention In example, real-time Transmission middleware, specially：Using jetty as network (web) container.Jetty is one and increases income Used as web container, it is that, based on the web container of Java, such as JSP and servlet provides running environment to servlet containers, Servlet (server applet), full name Java Servlet, with the server of written in Java, its major function exists In interactively browsing and changing data, dynamic web content is generated.

In embodiments of the present invention, control node and multiple back end are included in HDFS, back end is that data are deposited Storage unit, the needs that physical layer interface is transmitted can be stored on back end, when client needs to read data from HDFS When, read on back end that can be from HDFS.Control node in the embodiment of the present invention can be divided into main control node With standby control node, so as to ensure that HDFS can timely respond to the request of client by active-standby mode.In the embodiment of the present invention, It is independent that multiple back end in HDFS perform query task, and each back end query feedback is also independent realization, only There is any back end inquiry to complete, can be by real-time Transmission middleware to client feedback Query Result, it is not necessary to The inquiry of all back end finish after unified feedback Query Result again, therefore with very high efficiency data query.

In some embodiments of the invention, control node, specifically for the structure according to B+ trees needs are created in real time Multiple data of warehouse-in distinguish corresponding index；And,

Control node, specifically for passing through the multiple data sections of B+ tree queries according to the index file and querying condition that filter out Point, so as to be met the position of the data of querying condition.

Wherein, the big data real time processing system in the embodiment of the present invention based on Hadoop can realize entering in real time for data Storehouse.Based on existing HDFS, start multithreading on every datanode and create index, index file is created parallel.It is right Some significant fields set up index, and with the structural generation of B+ trees, each new record only needs to be inserted into B+ trees.B+ trees Insertion is carried out only on leaf node.Often insert the subtree number that will be judged after (key-pointer) index entry in node Whether go beyond the scope.When the subtree number in inserting postjunction is more than m (exponent number of B+ trees), need for leaf node to be split into two Individual node.The maximum key and node address of the two nodes should be simultaneously included in their parents' node.Hereafter, problem is returned In inserting in non-leaf node.The insertion of key and the insertion of leaf node are similar in non-leaf node, in non-leaf node The upper limit of subtree number be m, the super node split that also to carry out that goes beyond the scope.When root knot dot splitting is done, because without double Close node, must just create new parents' node, used as the new root of tree.

In some embodiments of the invention, client, is additionally operable to continue to send data acquisition continuation request to HDFS；

HDFS, is additionally operable to be asked according to continuation is obtained by control node response, and from multiple back end multithreading is started Obtain data acquisition to continue to ask corresponding Query Result；

Real-time Transmission middleware, is additionally operable to read data acquisition according to preset polling cycle and continues to ask corresponding to look into Result is ask, and returns to client.

In the embodiment of the present invention, real-time Transmission middleware can use jetty as web container, and data are done on HDFS While inquiry, jetty repeating query Query Result catalogues, if being not sky, read Query Result file and return to client. Client continues to send data acquisition continuation (continue) request to HDFS ends, and control node starts multithreading reading inquiry and ties Really, the Query Result for reading is returned to by client by jetty, if the reading data for returning are sky, flow process terminates, such as Fruit is not sky, and client continues to send continue requests.In query script, any datanode successful inquirings, i.e., to client End returned data, it is not necessary to which all datanode inquiries are completed.

As shown in Fig. 2 for the querying flow figure of jetty in inventive embodiments, using jetty as web container, first visitor Family end to HDFS ends send get requests, and control node end parsing json goes here and there, and json is a kind of data interchange format of lightweight, Querying condition instantiation job objects in json strings, submit to job to carry out distributed query, finally return to result.In HDFS On while inquire about, jetty repeating query Query Result catalogues, if being not sky, read file and simultaneously return to client, client End continues to send continue requests to HDFS ends, and control node end starts multithreading and reads Query Result, will read data and returns Back to client.If Query Result is the inquiry on empty and HDFS be over, sky is returned, flow process terminates.Wherein Any one step, returns if failure is produced, and for example extremely, request error, index file folder is situations such as exist.

In some embodiments of the invention, client, is additionally operable to real-time Transmission middleware and sends ending request；

Real-time Transmission middleware, after being additionally operable to the ending request for receiving client transmission, stops poll inquiry result Catalogue.

Wherein, if client sends ending request to real-time Transmission middleware, real-time Transmission middleware is no longer to client End returns Query Result, so as to realize that client is timely responded to, reduces the occupancy to transfer resource, improves resource and uses effect Rate.

In some embodiments of the invention, control node, the position of the data specifically for being met querying condition Afterwards, according to the position of the data for meeting querying condition determine target data place back end internet protocol address and Side-play amount, according to IP the back end of storage target data is found, further according to side-play amount from the back end for storing target data In find target data.

Big data real time processing system in the embodiment of the present invention based on Hadoop can realize real-time query：Using distribution Formula computing system, creates at control node end and submits to query task (job) to be inquired about, and inquiry is divided into following process：First Filtration is indexed in control node, because the title of index file was created according to the time, according in querying condition The title of query time condition and index file is matched, and screening meets the index file of condition.Query task is distributed to On every datanode, according to the index file and querying condition for filtering out by B+ tree queries, the data of condition are met Position, the distribution of task is carried out again, data are read on every machine according to the position of data obtained in the previous step, and return Return Query Result.Efficiently B+ structures and the executed in parallel of inquiry, have reached inquiry and complete in real time.Wherein, the position note of data The IP address and side-play amount at data storage place machine (datanode) are recorded, machine has been found according to IP address, further according to skew Amount can just find corresponding data.

In some embodiments of the invention, query time condition, including：Inquiry time started and poll-final time, Wherein, there must be query time condition in querying condition, you can to obtain inquiry time started and the inquiry knot of client setting The beam time, such that it is able to perform query task according to the data acquisition request of client, start according to the inquiry time started Query task, according to the poll-final time query task is terminated.

All process in the embodiment of the present invention are all concurrently performed, and the hardware that computer is make use of to greatest extent sets It is standby, drastically increase treatment effeciency.Make user just can obtain Query Result when performing inquiry operation.The present invention includes data Real-time warehouse-in, real-time query, real-time results transmission, the warehouse-in of data, inquiry, transmission is all concurrent, real-time.It is of the invention real During the big data real time processing system of example offer is provided, the warehouse-in of data, inquiry, it is all concurrent to transmit, in real time.Appoint creating While business, filtration index is carried out, while filtering index, the index file for having filtered is distributed to above datanode, together When datanode complete the inquiry of local file, and to client returned data.The inquiry of any datanode is completed, i.e., to Family returns Query Result.The inventive method processing procedure is all concurrently performed, and the hardware of computer is make use of to greatest extent The executed in parallel of equipment, efficient B+ structures and inquiry, has reached inquiry and completes in real time, drastically increases the efficiency of inquiry, User just can obtain Query Result when performing inquiry operation.

In addition it should be noted that, device embodiment described above is only schematic, wherein described as separating The unit of part description can be or may not be it is physically separate, can be as the part that unit shows or Can not be physical location, you can be located at a place, or can also be distributed on multiple NEs.Can be according to reality The purpose for needing to select some or all of module therein to realize this embodiment scheme on border.In addition, what the present invention was provided In device embodiment accompanying drawing, the annexation between module is represented and have between them communication connection, specifically can be implemented as one Bar or a plurality of communication bus or holding wire.Those of ordinary skill in the art are not in the case where creative work is paid, you can with Understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that the present invention can be borrowed Software is helped to add the mode of required common hardware to realize, naturally it is also possible to include special IC, specially by specialized hardware Realized with CPU, private memory, special components and parts etc..Generally, all functions of being completed by computer program can Easily realized with corresponding hardware, and, for realizing that the particular hardware structure of same function can also be various many Sample, such as analog circuit, digital circuit or special circuit etc..But, it is more for the purpose of the present invention in the case of software program reality It is now more preferably embodiment.Based on such understanding, technical scheme is substantially made to prior art in other words The part of contribution can be embodied in the form of software product, and the computer software product is stored in the storage medium that can read In, such as floppy disk, USB flash disk, portable hard drive, read-only storage (ROM, Read-Only Memory), the random access memory of computer Device (RAM, Random Access Memory), magnetic disc or CD etc., including some instructions use is so that a computer sets Standby (can be personal computer, server, or network equipment etc.) performs the method described in each embodiment of the invention.

In sum, above example is only to illustrate technical scheme, rather than a limitation；Although with reference to upper State embodiment to be described in detail the present invention, it will be understood by those within the art that：It still can be to upper State the technical scheme described in each embodiment to modify, or equivalent is carried out to which part technical characteristic；And these Modification is replaced, and does not make the spirit and scope of the essence disengaging various embodiments of the present invention technical scheme of appropriate technical solution.

Claims

1. a kind of big data real time processing system based on Hadoop, it is characterised in that the big data reality based on Hadoop When processing system include：Client, real-time Transmission middleware, distributed file system HDFS, wherein,

The HDFS includes：Control node namenode and multiple back end datanode；

The control node, for starting multithreading on the plurality of back end, create needs the multiple of warehouse-in in real time Data distinguish corresponding index, and multiple indexes are stored in multiple index files according to creation time；

The real-time Transmission middleware, for the data acquisition request that the client sends to be transmitted into the control node；

The control node, the data acquisition request for being sent according to the client creates query task, and the inquiry is appointed Business includes：The querying condition that target data is met, the querying condition includes：Query time condition；According to the inquiry bar Query time condition and the plurality of index file in part is matched, and filters out the index for meeting the query time condition Condition；The query task is distributed on the plurality of back end, according to the index file for filtering out and described is looked into The plurality of back end of condition query is ask, so as to be met the position of the data of the querying condition；Again to described many Individual back end distributes the query task, according to the position of the data for meeting the querying condition in the plurality of data Data are read on node, when any one back end successful inquiring in the plurality of back end, Query Result is returned；

The real-time Transmission middleware, for according to preset polling cycle poll inquiry result list, if the inquiry knot Fruit catalogue is not sky, then read the Query Result file in the Query Result catalogue and return to client；

2. a kind of big data real time processing system based on Hadoop according to claim 1, it is characterised in that the control Node processed, creating in real time specifically for the structure according to B+ trees needs multiple data of warehouse-in to distinguish corresponding index；And,

The control node, specifically for being looked into by the B+ trees according to the index file for filtering out and the querying condition The plurality of back end is ask, so as to be met the position of the data of the querying condition.

3. a kind of big data real time processing system based on Hadoop according to claim 1, it is characterised in that the visitor Family end, is additionally operable to continue to send data acquisition continuation request to the HDFS；

The HDFS, is additionally operable to continue to ask according to acquisition by the way that control node response is described, from the plurality of back end Middle startup multithreading obtains the data acquisition to be continued to ask corresponding Query Result；

The real-time Transmission middleware, is additionally operable to read the data acquisition continuation request correspondence according to preset polling cycle Query Result, and return to client.

4. a kind of big data real time processing system based on Hadoop according to claim 1, it is characterised in that the visitor Family end, is additionally operable to the real-time Transmission middleware and sends ending request；

The real-time Transmission middleware, is additionally operable to receive after the ending request that the client sends, and stops poll inquiry Result list.

5. a kind of big data real time processing system based on Hadoop according to claim 1, it is characterised in that the control Node processed, specifically for being met the position of the data of the querying condition after, meet the querying condition according to described The position of data determine the internet protocol address and side-play amount of target data place back end, according to the IP The back end for storing the target data is found, further according to the side-play amount from the back end for storing the target data Find the target data.

6. a kind of big data real time processing system based on Hadoop according to claim 1, it is characterised in that described to look into Time conditions are ask, including：Inquiry time started and poll-final time.

7. a kind of big data real time processing system based on Hadoop according to claim 1, it is characterised in that the reality When transmit middleware, specially：Using jetty as network web container.