CN202634489U

CN202634489U - Real-time analysis processing system of mass data based on Hadoop

Info

Publication number: CN202634489U
Application number: CN 201220257946
Authority: CN
Inventors: 包丽霞
Original assignee: Individual
Current assignee: Beijing Yonghong Tech Co ltd
Priority date: 2012-06-04
Filing date: 2012-06-04
Publication date: 2012-12-26
Anticipated expiration: 2022-06-04

Abstract

The utility model relates to a real-time analysis processing system of mass data based on Hadoop. The real-time analysis processing system comprises multiple servers, wherein the servers are networked and are allocated to a cloud platform, the cloud platform at least comprises a Client server, an Naming server, an Map server and an Reduce server, wherein the Map server is used for deploying original data, receiving an Map Task and executing the task; the Client server is used for firstly obtaining the current Map-Reduce state from the Naming server for formulating a Job when an analysis requirement from a user is received, and transmitting the Reduce Task to the Reduce server and transmitting the Map Task to the Map server after the Job is generated; the Naming server is used for generating a schedule according to the states of the current Map server and the Reduce server when a plan requirement from the Client server is received, and being communicated with the Client server so as to generate the Job after the Client server receives the schedule; the Reduce server is used for receiving Reduce Task, executing the Task, reading a Client Key in a return result after a result from the Map Task is received, generating the corresponding Reduce Key, recording one Map Task in a white board to complete the task, and returning the final result to the Client server so as to show the result to a terminal user in a visual manner by the Client server.

Description

A kind of mass data real-time analysis treatment system based on Hadoop

Technical field

The utility model relates to the mass data real-time processing requirement system based on the Hadoop framework in the cloud computing field; More specifically, the real-time treatment characteristic that relates in the data application mass data is applied in the data analysis and data processing of automated processing system.

Background technology

Cloud computing has been regarded as the new trend of IT industry; Can be rough be defined as the scalable computational resource that uses a certain service outside the own environment to provide; And by the use amount paying; Can visit any resource in " cloud " through Internet, and need not worry computing capability, loan, storage, fail safe and integrity problem.

See from enterprise's angle; Growing information has been difficult to be stored in standard relationship database even the data warehouse; For example, how to inquire about the table of one 1,000,000,000 row, a data query is carried out in all daily records of how to cross on data center's Servers-all; A lot of data that more complicated is are destructuring or half hitch structure words, this just more difficulty inquired about.

Hadoop is a kind of framework, and it can handle mass data by distributed earth, in mass data processing, has possessed a lot of advantages:

1. high fault tolerance: supposed each server node machine of possibly delaying from designing HDFS, perhaps network possibly cut apart.These problems can cause some machine unavailable.Hadoop has realized high fault tolerance through following means:

1.1 heartbeat detection and file copy;

1.2 data integrity detects;

1.3 backup of metadata multi-source and Log mechanism;

1.4 cluster is balanced.

2. high scalability: the Hadoop cluster can expand thousands of machines to from a machine.Possesses the professional ability that changes of very strong reply.It can be the conversion of traffic carrying capacity between from the crest to the trough within the several hrs that the business here changes, and also can be medium-term and long-term traffic growth or changes.

3. high maturity: industry how tame traditional IT giant all work on Hadoop, polishes this system quite mature and stable.Based on the application of Hadoop, need not worry the stability of Hadoop itself.And HBase, Hive, Zookeeper or the like can or combine Hadoop to launch utilization based on Hadoop to have a lot of relevant external members based on Hadoop for example can use.

But; The Hadoop framework architecture also exists certain inferior position when handling mass data; Map Reduce framework based on Hadoop is a target to support high power capacity to visit, and has ignored the delay issue that task is handled on the contrary, is that some typically are unfavorable for realizations of processing in real time here:

1.Hadoop onto server can not shifted information in the Task Distribution center of Map Reduce framework, but lets server go the application task through heartbeat.And the time interval of heartbeat generally is 3 seconds, also will increase along with server count increases.

As far as real-time treatment requirement, this is quite time-consuming work.

2.Hadoop itself be a framework.Based on the versatility of framework, the code file collection of Map Reduce itself also can transmit in HDFS (file system), to server, launches then, and new JVM process is loaded and operation through starting.This is quite time-consuming work.And in the running of a Job, similarly the JVM process initiation stops to have 5,6 more than, can't adapt to the demand of real-time processing.

3.Hadoop itself be a framework.Based on the versatility of framework, the result of Map Reduce also can write among the HDFS.The user can only obtain corresponding results through visit HDFS again.This has expended a period of time again in vain.

This shows that based on above Hadoop Map Reduce Architecture characteristic, we can find out, Hadoop Map Reduce is fit to through batch processing mode visit mass data, but the demand of real-time processing that can't the satisfying magnanimity data.The main target of real-time business intelligence construction is to support decision-making in real time, and this just has higher requirement to the instant, quick, stable of mass data processing.

Summary of the invention

The utility model main purpose is based on characteristics and the inferior position of in the Hadoop Map-Reduce framework mass data real-time being handled; Structure is served the Map-Reduce framework of BO self, promotes the ability of Hadoop platform real time execution Job greatly.Accomplish the exchange of high effective information, reduce the duration of real-time Transmission and deployment, make whole business intelligence system obtain a very big lifting the ability of mass data processing aspect in real time.

More specifically, the utility model relates to a kind of mass data real-time analysis treatment system based on Hadoop, and this system comprises: multiple servers; Wherein said server is networked; Be deployed into the cloud platform, comprise at least in the said cloud platform: Client server, Naming server; The Map server, the Reduce server; Wherein, the Map server is used for disposing initial data, and receives Map Task, carries out this task; The Client server is used for when receiving the Client-initiated analyze demands, at first obtains current Map-Reduce state to the Naming server, so that formulate Job; And after generating Job, with Reduce Task issue the Reduce server, Map Task issues the Map server; The Naming server; Be used for when receiving the plan of the obtaining demand of Client server initiation; According to the current Map server that gets access to, the state of Reduce server; Produce a planning chart, and with the communication of Client server so that after making that the Client server receives planning chart, begin to generate Job; The Reduce server is used for receiving Reduce Task, and carries out this Task; And after receiving the result that Map Task returns; Read the Client Key in the result who returns; Generate corresponding with it Reduce Key; Remove to write down in the blank existing Map Task and finished the work, and final result is returned to the Client server so that make the Client server result represented to the terminal use with visual means again.

Description of drawings

Accompanying drawing 1 is the structured flowchart of the described mass data real-time analysis treatment system based on Hadoop of the utility model.

Embodiment

For solving the problems of the technologies described above, the utility model provides a kind of mass data real-time analysis treatment system based on Hadoop, and its technical scheme that adopts is following:

1. with the multiple servers networking, be deployed into the cloud platform.One is configured to the Client server, and one is configured to the Naming server, and a part is configured to the Map server, and a part is configured to the Reduce server.

Said Client server is responsible for receiving client requests, and decomposes the analyzing and processing demand, gives the Map-Reduce framework and handles.The Client server is formulated Job (task) voluntarily, and notice Reduce server and Map server execution Task (subtask).

Said Naming server is responsible for name work.It knows current have how many platform Map servers and Reduce server, and the configuration state of these servers.Map server and Reduce server can regularly send configuring condition separately, workload (workload), CPU, information such as internal memory.

Said Map server is responsible for handling Map Task.It has been disposed client's initial data in advance, has also been disposed the code file collection of Map Task in advance, when it receives the Map Task that the Client server sends, can directly carry out this task.

Said Reduce server is responsible for handling Reduce Task.It has been disposed the code file collection of Reduce Task in advance, can directly carry out this task.

Every station server has always been opened servo progress, owing to the code file collection of carrying out is disposed in advance, and when receiving request, start/stop process again.Servo progress can be managed voluntarily, and in thread pool, executes the task.

Communication mechanism between the server adopts the communication modes of independent research.The characteristics of this communication modes are multichannels, and are multiplexing, asynchronous.Improve the stability of mass data transfers exchange through this communication modes, reduce CPU, internal memory cost, improve internodal efficiency of transmission.

2. when the Client server is received the analyze demands of client's initiation, at first to obtain current Map-Reduce state, so that formulate Job (task) to the Naming server.

3. when the Naming server was received the plan of the obtaining demand of Client server initiation, according to the current Map server that gets access to, the state of Reduce server produced a planning chart.Planning chart comprises configuration (CPU, the internal memory) situation of every machine, task amount, and the fileinfo that the Map server is disposed in advance, or the like.The Naming server returns to the Client server to planning chart.

4. after the Client server receives planning chart, begin to generate Job (task).This Job comprises a Client Key (ticket), a plurality of Map Task, a Reduce Task.

Client Key is a ticket that supplies message mechanism to use.The Client server has a Message Board (message blank).This blank is used for monitoring the completion status of certain task, and ticket is exactly to be used for registering the unique identification that certain task is used.When task there is not feedback for a long time, notify the Client server to report an error automatically.When the task execution is over, need to nullify this ticket from blank.

Map Task provides the subtask to the Map server.This task comprises the address of Reduce server, needs the pairing storage file name of data of operation, needs the instruction of the statistics task of execution, Client Key (ticket), or the like.Reduce Task provides the subtask to the Reduce server.This task comprises this Client Key (ticket), needs the instruction of the statistics task of execution, the number of Map Task, or the like.

5.Client server is issued the Reduce server with Reduce Task after having generated Job (task), Map Task issues the Map server.The Client server sends Reduce Task to available Reduce server, if task is not successfully sent to, representes that this Reduce server is unavailable; Just need look for next Reduce server; Up to find available till, if all unavailable, return miscue.If the Reduce server can be used, the machine of not delaying.The Client server joins Map Task to available Reduce server address, and sends to the Map server.The Client server has the fault-tolerant mechanism of oneself.If planning chart shows five available Map servers are arranged; The Client server has only a Map Task; That Client server looks for the higher Map server of priority to connect the transmission task earlier, if this Map server has been delayed machine, that Map server of just looking for priority to take second place again sends task.

6. work as the Map server and receive Map Task, carry out this task.The pairing storage file name of data that taking-up needs operation finds file, reading of data through finding file system.Taking-up needs the instruction of the statistics task of execution, for example according to the summation of product grouping statistics sales volume, carries out and is somebody's turn to do instruction, obtains the result.Read the address of Reduce server, result and Client Key (ticket) are sent to the Reduce server.

7. work as the Reduce server and receive Reduce Task, carry out this Task.Read Client Key, generate corresponding with it Reduce Key.The Reduce server has also been safeguarded a Message Board (message blank).This blank is used for monitoring the completion status of Map Task, and ticket is exactly a unique identification.Read the number of Map Task.Be registered to Reduce Key on the blank, and import the number of Map Task into.Suppose to have 3 Map Task to need to accomplish, but only returned the result of 2 Task, long when the time, Reduce server notice Client server automatically reports an error, and removes Reduce Task.If receive the result of 3 Map Task in normal time, also can nullify this ticket, and begin to carry out Reduce Task from blank.

8. work as the Reduce server and receive the result that Map Task returns, the Reduce server reads the Client Key in the result who returns, and generates corresponding with it Reduce Key, removes to write down in the blank existing Map Task and has finished the work.When the result of all Map Task has sent to the Reduce server; The Reduce server reads all intermediate object programs, takes out the instruction that needs the statistics task of execution on the Reduce Task again, for example adds up the summation of sales volume according to product grouping; Carry out and to instruct, obtain final result.

The Reduce server returns to the Client server with final result, and the Client server represents the result to the terminal use with mode such as visual again.

Below in conjunction with accompanying drawing 1 and concrete implementation the utility model is further elaborated.

As shown in Figure 1, with six machine networkings, the business intelligence platform software is installed simultaneously.A Client server, one is configured to the Naming server, three Map servers, a Reduce server.

1. initial data is deployed to the Map server.Data base administration interface through on the master server is connected to database, reads the initial data tabulation, again data list is divided into 3 sub-data lists, stores into respectively on the Map server.For example a certain sales data tabulation of a certain sales department is named as Table, and comprising field is ProductName, sales volume, date.Through disposing, be broken down into Table1, Table2, Table3 stores into respectively on three Map servers.

Here the parameter setting with backup is 2, and promptly every piece of data fragment can remain on two Map servers.As shown in Figure 1, Table1 is stored on Map server A and the B; Table2 is stored on Map server B and the C; Table3 is stored on Map server A and the C.These mapping relations and server configures and work at present amount can regularly be reported and given the Naming server.

2. the user asks the Client server, for example inquiry the gathering of sales volume of each product in each season then.The Client server changes into concrete mathematics computing model with business prototype, and begins to formulate Job (task).

1) the Client server is to Naming server requests task scheduling, and informs the data rows table name Table that will carry out computing.

2) the Naming server is according to the information of collecting, eligible task planning chart.Through inquiry, know that Table is broken down into 3 table Table1, Table2, Table3.Through the mapping relations table, obtain Table1 and be stored on Map server A and the B, Table2 is stored on Map server B and the C, and Table3 is stored on Map server A and the C.Through state table, reflect the workload (workload) of each Map server, CPU, the state of internal memory.Join these information in the task scheduling table and to go, send to the Client server.

3) the task scheduling table turns back to the Client server.The Client server begins to formulate Job (task), generates Client Key, regeneration Reduce Task and Map Task1, Map Task2, Map Task3 earlier.

4) the Client server sends Reduce Task to the Reduce server.Get nowhere if send, return miscue and give the terminal use.If send successfully, just join the Reduce server Host address of response in the Map Task.

5) the Client server is arranged preferential time-scale according to mapping relations in the task scheduling table and state table.For example Table1 is stored on Map server A and the B, but the CPU of Map server B and memory configurations are relatively poor, and that just issues the Map server A to the task of Map Task1 earlier, if this server has been delayed machine, just issues the Map server B again.And the like, Map Task2 and Map Task3 also send on the suitable Map server.

6) the Map server is executed the task after receiving Map Task, and sends to the result on the Reduce server.

7) receive the statistics that 3 Map servers send when the Reduce server, the Reduce server begins to carry out Reduce Task, and statistical computation goes out final result.

8) the Reduce server is issued the Client server to final result.

The Client server receives the result, and the result is presented to the user with visual form.

Though accompanying drawing and above stated specification have provided the embodiment of the utility model.But it is understandable that, it will be appreciated by those skilled in the art that and can the one or more assemblies in this assembly be combined into the individual feature assembly well.In alternative, specific assembly can be divided into a plurality of functional units, otherwise or.Simultaneously, the scope of the utility model does not receive the restriction of these particular instances.Multiple variation all is possible, the difference on structure etc. for example, and no matter whether it is clearly provided in specification.The scope of the utility model is the same wide with the scope that accompanying claims provides at least.

Claims

1. mass data real-time analysis treatment system based on Hadoop, this system comprises:

Multiple servers, and said multiple servers networked, be deployed into the cloud platform, comprise at least in the said cloud platform: Client server, Naming server, Map server, Reduce server; Wherein

The Map server is used for disposing initial data, and when receiving Map Task, carries out this task;

The Client server is used for when receiving the Client-initiated analyze demands, at first obtains current Map-Reduce state to the Naming server, so that formulate Job; And after generating Job, with Reduce Task issue the Reduce server, Map Task issues the Map server;

The Naming server; Be used for when receiving the plan of the obtaining demand of Client server initiation; According to the current Map server that gets access to, the state of Reduce server; Produce a planning chart, and with the communication of Client server so that after making that the Client server receives planning chart, begin to generate Job;

The Reduce server is used for when receiving Reduce Task, carrying out this Task; And after receiving the result that Map Task returns; Read the Client Key in the result who returns; Generate corresponding with it Reduce Key; Remove to write down in the blank existing Map Task and finished the work, and final result is returned to the Client server so that make the Client server result represented to the terminal use with visual means again.

2. according to the mass data real-time analysis treatment system based on Hadoop of claim 1, wherein said machine has six, is configured to a Client server respectively, a Naming server, three Map servers, a Reduce server.