CN102880503B

CN102880503B - Data analysis system and data analysis method

Info

Publication number: CN102880503B
Application number: CN201210307198.9A
Authority: CN
Inventors: 王�锋; 漆兴; 赵国贤; 王志强
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2012-08-24
Filing date: 2012-08-24
Publication date: 2015-04-15
Anticipated expiration: 2032-08-24
Also published as: CN102880503A

Abstract

The invention discloses a data analysis system and a data analysis method. The data analysis system comprises a to-be-scheduled task generation module, a to-be-scheduled task storage module, a task scheduling module and a task processing module, wherein the to-be-scheduled task generation module is used for generating collected data into tasks to be scheduled according to pre-defined task parameters; the to-be-scheduled task storage module is used for storing the generated tasks to be scheduled; the task scheduling module is used for loading the tasks to be scheduled and calling the corresponding task processing module according to task type; the task processing module is used for generating a corresponding Hive structured query language (SQL) sentence according to an analysis requirement in the tasks, and sending the corresponding Hive SQL sentence to a Hadoop-based data warehouse server; and after data which are returned by the server are received, accomplishing data analysis of the tasks. The data analysis is performed on the bottom layer of the data analysis system by using a Hadoop system, and the tasks can be comprehensively managed on the top layer of the data analysis system by using the task scheduling module, so a data analysis flow can be simplified by using the Hadoop system, and a system for conveniently scheduling and managing the tasks is provided.

Description

Data analysis system and data analysing method

Technical field

The present invention relates to data analysis technique field, particularly relate to a kind of data analysis system and data analysing method.

Background technology

In recent years, because data are in the sustainable growth of internet arena, each company all faced the processing demands of mass data.Data analysis in department is mainly the service of corporate departments O&M, the daily record that data analysis mainly produces based on the server (Server) of all departments is analyzed, as apache daily record, nginx daily record etc., object be based on these daily records to user's access, data traffic has clear and definite quantized data in time dimension, product line dimension, domain name dimension etc., thus to offer suggestions for corporate server operation management, assignment of traffic, estimate etc.

Such as collect the nginx daily record of certain product line, after collecting daily record, need to clean daily record, such as, tentatively merged according to every 5 minutes, and add up all following field datas needed, what may comprise has click volume, downloading data byte.Need to add up according to multiple dimension, as according to product line, domain name, add up according to dimensions such as product line, server ip, the data volume that this stage relates to is very large simultaneously.

Final user can take certain day detailed data access discharge curve figure based on these data, or the click of certain day, certain hour or bandwidth.Also speed of download data that can obtain different dimensions etc.

The process of carrying out data analysis based on relevant database of prior art, generally includes following link: the links such as log collection, daily record data obtain and rough handling, Data Division warehouse-in, sublist data merge, database is polymerized outward, data exhibiting.

Particularly, data analytics server is after receiving the data to be analyzed obtained from daily record, first usage data checks that shell script checks data and formats process, then rough handling is done (as 5 minutes cleaning treatment, every row daily record can be merged in 5 minutes of each place by this link (per hour comprise 12 5 minutes, 5 minutes analysis sites)), then the intermediate analysis that other isomery shell scripts do other is re-used on this basis, this intermediate demand relates to the submeter to relevant database, tear table open and merge, also need to consider the processing speed of relevant database under millions data volume simultaneously, thus adopt the equalization scheme for data-base cluster, and the shell script completing specific needs that pulling data re-uses other from the database different server is polymerized.Through complicated Multilevel method link, finally count data, and display data.

But, along with the sustainable growth of website scale and customer volume, data volume is increased sharply, in the data analysis process of prior art by complicated submeter, tear table open, merging draws concrete analysis result, its technology realization flow is complicated, needs special maintainer to safeguard.And, if need to increase new business diagnosis demand, then need to increase new analysis script, be unfavorable for expansion.

In sum, the data analysing method of prior art, because it needs relevant database is carried out to complicated submeter, tears the operations such as table, merging open, realization flow complexity and not easy care; Further, the data analysing method of prior art is unfavorable for expanding new business diagnosis demand.

Summary of the invention

The embodiment provides a kind of data analysis system and data analysing method, provide a kind of not based on the data analysing method of relevant database, thus reduced data analysis process, be convenient to safeguard.

According to an aspect of the present invention, provide a kind of data analysis system, comprising:

Treat scheduler task generation module, for the data genaration task to be scheduled of will collect according to predefined task parameters;

Treat scheduler task memory module, described in storing, treat the task to be scheduled that scheduler task generation module generates;

Task scheduling modules and task processing module, from described, described task scheduling modules treats that scheduler task memory module loads task to be scheduled, and call corresponding task processing module according to the task type of loading of task;

Described task processing module generates corresponding type of structured query language Hive SQL statement according to the analysis demand in described task and sends to the data warehouse server based on Distributed Calculation Hadoop; The data analysis to described task is completed after receiving the data that described server returns.

Wherein, described task scheduling modules specifically comprises: main task schedule component and from task scheduling assembly;

Described main task schedule component is used for treating that scheduler task memory module loads task to be scheduled from described, and calls corresponding task processing module according to the task type of loading of task;

Described be used for from task scheduling assembly out of service or after cannot normally running in described main task schedule component, treat that scheduler task memory module loads task to be scheduled from described, and call corresponding task processing module according to the task type of loading of task.

Further, described system also comprises: master-priority queue unit and from priority query's unit; And

The task that task execution time in described scheduling stack arrives also for after loading task to be scheduled to scheduling stack, is encapsulated in priority object, and described priority object is sent to described master-priority queue unit by described main task schedule component;

After described master-priority queue unit is used for receiving priority object, the priority of other priority object in the priority of this priority object and described master-priority queue unit is compared, according to comparative result, this priority object is sorted;

Describedly be used for timing from priority query's unit and keep the consistent of its data and the data described master-priority queue unit.

Preferably, described from task scheduling assembly also for out of service in described main task schedule component or after cannot normally running, load task to be scheduled to scheduling stack, the task that task execution time in described scheduling stack arrives is encapsulated in priority object, and described priority object is sent to described from priority query's unit;

Described from priority query's unit also for after receiving priority object, the priority of the priority of this priority object and described other priority object from priority query's unit is compared, according to comparative result, this priority object is sorted.

Preferably, described system also comprises: task management module, for receiving the task parameters of definition, and treats scheduler task generation module described in described task parameters being sent to.

According to another aspect of the present invention, additionally provide a kind of data analysing method, comprising:

Until scheduler task generation module according to predefined task parameters by task to be scheduled for the data genaration of collection after be stored into and treat scheduler task memory module;

From described, task scheduling modules treats that scheduler task memory module loads task to be scheduled, and call corresponding task processing module according to the task type of loading of task;

Preferably, before the task type of the described task according to loading calls corresponding task processing module, also comprise:

The task of loading is encapsulated in task processing threads by described task processing module; And

Describedly call corresponding task processing module and be specially: described task processing threads calls corresponding task processing module according to the task type of described task in operational process.

Wherein, from described, described task scheduling modules treats that scheduler task memory module loads task to be scheduled and is specially:

Task to be scheduled is loaded into scheduling stack by described task scheduling modules; And

Before the task of loading is encapsulated into task processing threads by described task scheduling modules, also comprise:

Described task scheduling modules monitors the task execution time of each task to be scheduled in described scheduling stack; The task that task execution time arrives is taken out from scheduling stack.

Preferably, described from scheduling stack take out task execution time arrive task after, also comprise:

The Task Switching of taking out from described scheduling stack is task instances by described task scheduling modules, and the task instances of conversion is encapsulated as priority object, and the priority of this priority object determines according to the task attribute of this task;

Described priority object is sent to priority query's module by described task scheduling modules;

Described priority query module, after receiving priority object, compares according to the priority of other priority object in the priority of this priority object and described priority query module, automatically sorts to this priority object according to comparative result;

Described task scheduling modules obtains the highest priority object of priority from described priority query module; And initialization task processing threads, the task instances in the priority object of acquisition is reached in this task processing threads; And,

Describedly call corresponding task processing module and be specially:

This task processing threads calls the task processing module corresponding to the task type of the task in described task instances.

Further, after this task processing threads described calls the task processing module corresponding to the task type of the task in described task instances, also comprise:

If described task scheduling modules determines that the task type of the task in described task instances is serial task type, then obtain other task of following this task scheduler task memory module from described treating;

Described task scheduling modules is for other task each of following this task, and initialization task processing threads carries respectively, calls corresponding task processing module respectively by each task processing threads.

Wherein, described task scheduling modules is for other task each of following this task, and initialization task processing threads carries respectively, calls corresponding task processing module respectively specifically comprise by each task processing threads:

Described task scheduling modules is for each parallel task in other task described, and parallel initialization carries the task processing threads of each parallel task, calls corresponding task processing module respectively by the task processing threads after each parallel initialization.

Or described task scheduling modules is for other task each of following this task, and initialization task processing threads carries respectively, calls corresponding task processing module respectively specifically comprise by each task processing threads:

Described task scheduling modules is for lower floor's level task of following in other task of this task, and after this task is disposed, the task processing threads of described lower floor level task is carried in initialization, calls corresponding task processing module by this task processing threads.

The embodiment of the present invention carries out data analysis owing to utilizing Hadoop system (the assembly Hive based on Hadoop installed in as HiveServer) at the bottom of data analysis system, then the overall management to task is realized with task scheduling modules on the upper strata of data analysis system, and provide humanized, more convenient interactive maintenance pattern, thus Hadoop system both can have been utilized to carry out data analysis and avoid numerous and diverse submeter to relevant database, tear table open, the operations such as merging, simplify data analysis flow process, further provide and task is dispatched more conveniently, the system of management.

Accompanying drawing explanation

Fig. 1 is the data analysis system structural representation of the embodiment of the present invention;

Fig. 2 is the method flow diagram that the data analysis system of the embodiment of the present invention carries out data analysis;

The process flow diagram of JobProcessor thread process that task processing module is carried out in operational process that Fig. 3 is the embodiment of the present invention;

Fig. 4 is the concrete inner structure schematic diagram of one in the task scheduling modules of the embodiment of the present invention and priority query's module.

Embodiment

For making object of the present invention, technical scheme and advantage clearly understand, enumerate preferred embodiment referring to accompanying drawing, the present invention is described in more detail.But it should be noted that, the many details listed in instructions are only used to make reader to have a thorough understanding, even if do not have these specific details also can realize these aspects of the present invention to one or more aspect of the present invention.

The term such as " module " used in this application, " system " is intended to comprise the entity relevant to computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.Such as, module can be, but be not limited in: the thread of the process that processor runs, processor, object, executable program, execution, program and/or computing machine.For example, application program computing equipment run and this computing equipment can be modules.One or more module can be positioned at an executory process and/or thread, and module also and/or can be distributed on a computing machine between two or more platform computing machines.

The present inventor considers and utilizes Hadoop(Distributed Calculation) system to be to carry out data analysis.Hadoop is current the most ripe the most popular magnanimity DBMS disposal system, has high stability, extendability, robustness.Wherein, HDFS(Hadoop Distributed File System, distributed file system) assembly can ensure the redundant storage of data, and be easy to expansion.MapReduce(maps abbreviation) assembly utilizes the TaskTracker(task follower being easy to expand) strengthening analysis ability to mass data, analysis ability can infinite expanding theoretically, is particularly suitable for the analysis of off-line data.Storage and the analysis of mass data can be supported.Hive technology uses Hive SQL (class SQL) mode to submit analysis demand to, it can by Hive SQL(Structured Query Language in inside, Structured Query Language (SQL)) be converted into the abbreviation of one or several MR(MapReduce) task be submitted to assembly Jobtracer in certain sequence.The mode of class SQL can provide convenience data integration, special inquiry and the large-scale data analysis that is based upon on Hadoop file, make analyst just can complete analysis demand without the need to going to write complicated MR task again.

But directly use the client end AP I(Application Program Interface that Hadoop provides, application programming interfaces) development task, need complicated development, be unfavorable for job invocation; Hadoop does not provide and carries out unified scheduling to task simultaneously, does not have extendability to realizing newly increased requirement, and the monitoring, processed, result data unified process etc. relevant for task in demand process lack support equally.

Based on above-mentioned analysis, in the technical scheme that the embodiment of the present invention provides, at bottom based on Hadoop assembly (HDFS MapReduce Hive), construct a kind of data analysis system.As shown in Figure 1, the data analysis system of the embodiment of the present invention comprises: treat scheduler task generation module 101, treat scheduler task memory module 102, task scheduling modules 103, task processing module 104.

The data analysis system of the embodiment of the present invention carries out the method flow of data analysis, as shown in Figure 2, comprises the steps:

S201: treat that scheduler task generation module 101 generates task to be scheduled according to user-defined task parameters, and the task to be scheduled generated is stored into and treats in scheduler task memory module 102.

Particularly, each server sends to and treats scheduler task generation module 101 after have collected data.Treat that scheduler task generation module 101 is according to the predefined task parameters of user, by the data genaration task to be scheduled of collecting.Wherein, in user-defined task parameters, comprise store path, treat that the task to be scheduled generated is stored into and treats in scheduler task memory module 102 according to store path by scheduler task generation module 101.

Treat that scheduler task memory module 102 can be specifically that this database can be stored in the server being exclusively used in and storing task to be scheduled in order to store the database treating the task to be scheduled that scheduler task generation module 101 generates.Generally speaking, treat the task to be scheduled stored in scheduler task memory module 102, its task type is all plan target type, namely treats that the task to be scheduled stored in scheduler task memory module 102 is all plan target.

S202: task scheduling modules 103 is from treating to load scheduler task memory module 102 to treat scheduler task.

Task scheduling modules 103 is from treating to obtain scheduler task memory module 102 to treat scheduler task.

Preferably, task scheduling modules 103 is from treating to load task to be scheduled scheduler task memory module 102 to scheduling stack; And periodic monitor dispatches the task execution time of each task to be scheduled in stack, the task of taking out task execution time arrival from scheduling stack carries out dispatching, processing; Namely task scheduling modules 103 is after the task execution time determining to dispatch certain task to be scheduled in stack arrives, and takes out this task, dispatch, process it from scheduling stack; Detailed process is: this Task Switching, by Task Switching device, is a task instances (jobtrace) with time response by task scheduling modules 103.Task scheduling modules 103 is also according to the task attribute of this task, and such as task type, title, working time, the owner etc., determine the priority of this task.In fact, corresponding priority is pre-set for different task attributes, that is, the corresponding relation of task attribute and priority is pre-arranged, and task scheduling modules 103, according to the corresponding relation pre-set, determines the priority of this task, and according to the priority determined, jobtrace is encapsulated as priority object, this priority object is an object that can compare priority, and namely an attribute of this priority object is priority attribute; The comparison of priority can be carried out by priority attribute accordingly to priority object.Task scheduling modules 103 is also for this priority object generates a unique object identity (uuid mark).Above-mentioned priority object implementatio8 Comparable interface in java, has wherein packed jobtrace object, and has uuid to identify and priority attribute.

Further, also priority query's module 105 can be comprised in the data analysis system of the embodiment of the present invention.This priority object is sent to priority query's module 105 by task scheduling modules 103.Priority query's module 105 is after receiving priority object, the priority of other priority object in the priority of this priority object and priority query's module 105 is compared, automatically according to priority, sorting operation is carried out to this priority object according to comparative result.Particularly, priority query's module 105 can according to the method for this object:

public int compareTo(Object o){}

Realize the prioritization of priority object.

S203: the task of acquisition is encapsulated in task process (JobProcessor) thread by task scheduling modules 103.

Preferably, task scheduling modules 103 obtains the highest priority object of priority from priority query's module 105, and call corresponding task processing module 104 according to the task type of task in the priority object obtained and carry out data analysis: the task in task scheduling modules 103 obtains (JobTaker) thread, the priority object that priority is the highest is obtained from priority query's module 105, and initialization JobProcessor thread, task instances jobtrace in the priority object of acquisition is reached in this JobProcessor thread, this JobProcessor thread dispatching task processing module 104, thus task is carried in JobProcessor thread.

Preferably, task processing module 104 can be multiple, and multiple task processing module 104 can the multiple task of parallel processing.Further, different task processing modules 104 can corresponding different task type.The business model that each task processing module is directly corresponding different.In Data Analysis Platform, task type can be divided into load(Data import) type, the conversion of transform(data) type, acquire(data acquisition) type etc., each model can be responsible for a type, and user can complete analysis demand by several task model of flexible combination.Task processing stage, task can be sent to corresponding task processing module and processes by the task type belonging to task by system automatically, thus the correct process of support mission.

JobProcessor thread, in operational process, according to the task type of encapsulation (carrying) task within it, determines corresponding task processing module 104, and calls this task processing module 104 and complete the processing procedure that this treats scheduler task.

In addition, task scheduling modules 103 also can after judging that the task type of the task in JobProcessor thread is serial task type, and task scheduling modules 103 obtains other task of following this task.Particularly, task scheduling modules 103 can directly from treating to obtain other task of following this task scheduler task memory module 102.Task scheduling modules 103 is for other task each of following this task, and initialization JobProcessor thread carries respectively, calls corresponding task processing module respectively carry out data analysis, task process by each JobProcessor thread.

Particularly, task scheduling modules 103 is for the task (being called for short parallel task herein) of to follow in other task of this task, with this task to be same level obtained, adopt executed in parallel strategy: several JobProcessor threads of task scheduling modules 103 parallel initialization, the JobProcessor thread of each parallel initialization carries each parallel task respectively, calls corresponding task processing module 104 respectively carry out data analysis, task process by the JobProcessor thread after each parallel initialization.

Task scheduling modules 103 for obtain to follow in other task of this task, be the task (being called for short lower floor's level task herein) of this next level of task, adopt level sequence implementation strategy: task scheduling modules 103 is after task processing module 104 is disposed this task, reinitialize and carry the JobProcessor thread of lower floor's level task, carry out data analysis, task process by the corresponding task processing module 104 of this JobProcessor thread dispatching.

S204: task processing module 104 is responsible for the task in process JobProcessor thread, carries out data analysis to this task.

Particularly, task processing module 104 runs this JobProcessor thread after receiving the JobProcessor thread of task scheduling modules 103 transmission.JobProcessor thread is to Hive Server(server) send corresponding Hive SQL request, task processing module 104 obtains Hive Server return data, thus completes the Data Analysis Services to this task; Task processing module 104 processes the data returned, and such as carries out recording, adds up, display etc.Wherein, Hive Server is the data warehouse server based on Hadoop.

JobProcessor thread is in operational process, and the concrete treatment scheme that task processing module 104 is carried out, as shown in Figure 3, comprises the steps:

S301:JobProcessor thread, according to the analysis demand in its task of carrying, generates corresponding Hive SQL statement.

Tasks carrying (JobExecutor) thread in S302:JobProcessor thread dispatching task processing module 104, and Hive SQL statement is sent to Job Executor thread.

Tasks carrying (Job Executor) thread, connection (Connector) thread and result treatment (Result Processor) thread is included in task processing module 104.

S303:Job Executor thread sends connection request to Connector thread.

S304:Connector thread, after the connection request receiving the transmission of Job Executor thread, adopts JDBC (Java Data Base Connectivity, java DataBase combining) interconnection technique and Hive Server to connect.

S305: after connection establishment, Connector thread returns available connection to Job Executor thread.

S306:Job Executor thread sends HiveSQL request according to the available connection returned to Hive Server.

S307: if after Job Executor thread receives the data that Hive Server returns, the data returned sent to Result Processor thread to process.

After Hive Server receives the request of the execution Hive SQL that Job Executor thread sends, carry out corresponding data analysis operation, as Hive SQL being converted into one or several MR task and being submitted to MapReduce in certain sequence, and receive the data that MAPREDUCE returns, and the data returned by MapReduce return to Job Executor thread.Because Hive Server carries out data analysis according to the Hive SQL request received, return results the technology that data are well known to those skilled in the art and repeat no more herein.

Owing to being provided with the assembly Hive based on Hadoop in Hive Server.Hive uses class SQL mode to submit analysis demand to, and Hive SQL can be converted into one or several MR task and be submitted to MapReduce in certain sequence by it in inside.The mode of class SQL makes analyst just can complete analysis demand without the need to going to write complicated MR task again.And MapReduce can carry out data analysis according to analysis demand, return results data; Thus, avoid numerous and diverse submeter to relevant database in data analysis process, tear the operations such as table, merging open, simplify data analysis flow process, be convenient to the maintenance of developer to system.

Result Processor thread processes the data that Hive Server returns, such as, carry out data record, show, provide download.Hive Server return data is all the final results of each task, and at present in order to supported data redundant storage and well expand, task data is directly stored on HDFS.The download interface downloading data that user can be provided by system according to task Id.

S308: if Job Executor thread does not receive the data that Hive Server returns in setting-up time section, return execution failure result to Connector thread; Connector thread returns different available connections to Job Executor thread.

If Job Executor thread, after sending Hive SQL request to Hive Server, does not receive the data that Hive Server returns in setting-up time section, then show tasks carrying failure; Job Executor thread returns execution failure result to Connector thread; The execution failure result that Connector thread sends according to Job Executor thread reconnects other Hive Server, with other Hive Server successful connection after, return the connection of this other Hive Server to Job Executor thread, be namely different from the available connection of last time.

Further, also include in the data analysis system that the embodiment of the present invention provides: server monitoring recovers module (not marking in figure).

If above-mentioned Connector thread is in the process connected with Hive Server, connection failure, then server monitoring connection failure result is sent to recover module;

If above-mentioned Job Executor thread is after sending Hive SQL request to Hive Server, the data that Hive Server returns are not received in setting-up time, then show tasks carrying failure, mission failure result sends to server monitoring to recover module by JobExecutor thread.

Server monitoring recovers module to the various failure result received, and as connection failure result or mission failure result, carries out seriousness judgement; If be judged as serious failure result, then the frequency of failure of the Hive Server relevant to this failure result is added 1; When the frequency of failure of Hive Server reaches the maximum count value (MaxCount) of setting, then restart this Hive Server.Further, Job Executor thread asks Connector thread again, returns the available connection that other are different.

For the plan target treated in scheduler task memory module 102, the step of above-mentioned S201-S204 can be adopted to process, and be the task of unplanned task type for task type, as clicked the task of class, after user clicks this task, namely task scheduling modules 103 dispatches this tasks carrying.The step that task scheduling modules 103 dispatches this tasks carrying comprises: this task is encapsulated in JobProcessor thread by task scheduling modules 103, is sent to by JobProcessor thread corresponding task processing module 104 to carry out task process.

Preferably, if after user clicks this task, task scheduling modules 103 determines that the current general assignment number (namely being sent to the sum of the task of task processing module 104 by JobProcessor thread) run is lower than limit, then normally perform this task: this task be encapsulated in JobProcessor thread, sent to by JobProcessor thread corresponding task processing module 104 to carry out task process.

Otherwise task scheduling modules 103 advises that user tries after a while again.

For ensureing the reliability of task scheduling further, preferably, as shown in Figure 4, main task schedule component 401 can be comprised in above-mentioned task scheduling modules 103 and from task scheduling assembly 402.

In normal circumstances, treat that scheduler task memory module loads task to be scheduled by main task schedule component 401 from described, and call corresponding task processing module according to the task type of task loaded and carry out data analysis, task process;

If main task schedule component 401 occurs unexpected and out of service or cannot normally run, then be responsible for from task scheduling assembly 402 task that above-mentioned main task schedule component completes, namely out of service in main task schedule component or when cannot normally run, treat that scheduler task memory module loads task to be scheduled by from task scheduling assembly 402 from described, and call corresponding task processing module according to the task type of task loaded and carry out data analysis, task process;

Further, master-priority queue unit 403 can be comprised in above-mentioned priority query's module 105 and from priority query's unit 404.

In normal circumstances, main task schedule component 401 loads after task to be scheduled to scheduling stack from described until scheduler task memory module, the task that task execution time in scheduling stack arrives is encapsulated in priority object, and priority object is sent to master-priority queue unit 403; After master-priority queue unit 403 receives priority object, the priority of other priority object in the priority of this priority object and master-priority queue unit 403 is compared, automatically this priority object is sorted according to priority according to comparative result.The consistance of the data its data and master-priority queue unit is then regularly kept from priority query's unit.Main task schedule component 401 obtains the highest priority object of priority from master-priority queue unit 403, call corresponding task processing module according to the task type of the task in the priority object obtained and carry out data analysis: main task schedule component 401 initialization JobProcessor thread, the priority object of acquisition is carried on this JobProcessor thread, carries out data analysis, task process by the corresponding task processing module of this JobProcessor thread dispatching.

If main task schedule component 401 occurs unexpected and out of service or cannot normally run, then load to be scheduled task to scheduling stack after from described until scheduler task memory module from task scheduling assembly 402, the task that task execution time in scheduling stack arrives is encapsulated in priority object, and priority object is sent to from priority query's unit 404; After receiving priority object from priority query's unit 404, the priority of this priority object and the priority of other priority object from priority query's unit 404 are compared, automatically this priority object is sorted according to priority according to comparative result.From task scheduling assembly 402 from obtaining the highest priority object of priority from priority query's unit 404, and initialization JobProcessor thread, the priority object of acquisition is carried on this JobProcessor thread, carries out data analysis, task process by the corresponding task processing module of this JobProcessor thread dispatching.

Namely under normal circumstances, complete by main task schedule component the operation that in above-mentioned steps S202-S204, task processing module 104 is carried out, it is no longer repeated herein.And in step S202, priority object is sent to priority query's module 105 and is specially by main task schedule component: priority object is sent to the master-priority queue unit in priority query's module 105 by main task schedule component, and master-priority queue unit is after receiving priority object, compare according to the priority of other priority object in the priority of this priority object and priority query's unit, automatically in master-priority queue, sorting operation is carried out according to priority to this priority object according to comparative result.The consistance of the data its data and master-priority queue unit is then regularly kept from priority query's unit.In step S203, the priority object that main task schedule component obtains limit priority from priority query's module 105 is specially: main task schedule component obtains the priority object of limit priority from the master-priority queue unit of priority query's module 105.

If main task schedule component occurs unexpected and out of service or cannot normally run, then be responsible for from task scheduling assembly the task that above-mentioned main task schedule component completes, namely out of service in main task schedule component or when cannot normally run, treat that scheduler task memory module loads task to be scheduled by from task scheduling assembly from described, and call corresponding task processing module according to the task type of task loaded and carry out data analysis, task process; Namely complete by from task scheduling assembly the operation that in above-mentioned steps S202-S204, task processing module 104 is carried out.

More preferably, common tasks schedule component 405 can also be comprised in task scheduling modules 103.If main task schedule component and all occur unexpected and out of service from task scheduling assembly or cannot normally run, then common tasks schedule component 405 has been responsible for the task that above-mentioned main task schedule component completes, and namely completes the operation that in above-mentioned steps S202-S204, task processing module 104 is carried out.

If master-priority queue unit occurs unexpected and out of service or cannot normally run, then main task schedule component, or from task scheduling assembly, or common tasks schedule component is when being sent to priority query's module 105 by priority object, be specially: priority object is sent in priority query's module 105 from priority query's unit, and from priority query's unit after receiving priority object, compare according to the priority of other priority object in the priority of this priority object and priority query's unit, automatically from priority query, sorting operation is being carried out according to priority to this priority object according to comparative result.Main task schedule component or when obtaining the priority object of limit priority from task scheduling assembly or common tasks schedule component from priority query's module 105, be then specially: from the priority object obtaining limit priority from priority query's unit of priority query's module 105.

In actual applications, can be that main task schedule component and master-priority queue unit are arranged in a server, from task scheduling assembly and from priority query's cellular installation another server; Or main task schedule component and master-priority queue unit, be all arranged on a server from task scheduling assembly with from priority query's unit, specific embodiments is determined by loading condition.

Further, the data analysis system that the embodiment of the present invention provides also can comprise: task management module, template management module, Mission Monitor module, Report Forms Service module, Category(classify) administration module, fundamental analysis service management module, function library administration module, Restful API module, account management module, task test and auditing module.

Task management module, in order to provide the management function of plan target and unplanned task, user can carry out the establishment, amendment, deletion etc. of task scheduling by task management module.Namely task management module receives user-defined task parameters.This module is the most important module of system front end, contains the operations such as task creation, bind schema, task start, stopping.The plan that user is set up by this module, task dispatching can be preserved into database.For a plan target, task and user select certain plan foundation and contact.After user starts certain plan target, task management module is by RPC(Remote Procedure Call, remote procedure call) user-defined task parameters is submitted to and treats scheduler task generation module 101, treat that scheduler task generation module 101 by task and bind schema, can generate task to be scheduled according to user-defined task parameters and store.

Template management module provides self-defined task template and management function for user.

Mission Monitor module, in order to monitor tasks carrying progress and performance in Hadoop MapReduce cluster and each task processing module 104, and shows in real time.

Report Forms Service module, in order to add up Hadoop MapReduce cluster operation task situation; As using every 10 minutes as a point, the number of tasks that statistical cluster is run, map number, reduce number, IO resource that operation task expends etc.

Category administration module, with thinking that user provides the data of checking and being stored on Hadoop HDFS cluster and provides the service that statistics size is served and data are downloaded.Data Analysis Platform is in order to support the rights management to user data, and data are divided into different business scopes, and each field is referred to as a category, and user operates the category having authority only.

Fundamental analysis service management module, stores with the form of infrastructure service one by one respectively in order to frequently-used data to be analyzed demand, and supports flexible expansion, facilitates user to use when creating new task scheduling.

Function library administration module, main towards Hive at present, UDF (the User-Defined-Function carrying out needing when Hive analyzes to use is provided, user's defined function), UDAF (User-DefinedAggregation Funcation, the fit function of user's definitions set) etc. function library, during for customer analysis;

REST (REpresentational State Transfer, declarative state transfer) API module, it is to user's open service interface, supports that user uses REST API to system request Analysis Service.

Account management module, in order to account and the Authority Verification of leading subscriber usage data analytic system.

Task test and auditing module, carry out test run in order to pending task is dealt into data analysis system, and provide test report according to test result.

Each module in above-mentioned data analysis system can be installed in same server, and also can be installed in different servers, concrete mount scheme is determined by loading condition.

One of ordinary skill in the art will appreciate that all or part of step realized in above-described embodiment method is that the hardware that can carry out instruction relevant by program has come, this program can be stored in a computer read/write memory medium, as: ROM/RAM, magnetic disc, CD etc.

The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a data analysis system, comprising:

Task scheduling modules and task processing module, from described, described task scheduling modules treats that scheduler task memory module loads task to be scheduled, the task of acquisition is encapsulated in task processing threads, and calls corresponding task processing module according to the task type of loading of task;

Described task processing module generates corresponding type of structured query language Hive SQL statement according to the analysis demand in described task, and the tasks carrying thread called in described task processing module, the Hive SQL statement of generation is sent to described tasks carrying thread; Described tasks carrying thread sends connection request to the connection thread in described task processing module; Described connection thread, according to the connection request received, adopts java DataBase combining JDBC interconnection technique to connect with the data warehouse server Hive Server based on Distributed Calculation Hadoop; After connection establishment, return available connection to described tasks carrying thread; Described tasks carrying thread sends Hive SQL request according to the available connection returned to Hive Server; And the data that the Hive Server received returns sent to the result treatment thread in described task processing module to carry out processing the data analysis of paired described task.

2. the system as claimed in claim 1, is characterized in that, described task scheduling modules specifically comprises: main task schedule component and from task scheduling assembly;

3. system as claimed in claim 2, is characterized in that, also comprise priority query's module, wherein:

Described task scheduling modules also for load task to be scheduled to scheduling stack after, by described scheduling stack task execution time arrive task be encapsulated in priority object, described priority object is sent to priority query's module;

After described priority query module is used for receiving priority object, the priority of other priority object in the priority of this priority object and described priority query module is compared, according to comparative result, this priority object is sorted;

Described task scheduling modules also for obtaining the highest priority object of priority from described priority query module, and is called corresponding task processing module according to the task type of task in the priority object obtained and is carried out data analysis.

4. system as claimed in claim 3, it is characterized in that, described priority query module specifically comprises: master-priority queue unit and from priority query's unit; And

Described main task schedule component also for obtaining the highest priority object of priority from described master-priority queue unit, and is called corresponding task processing module according to the task type of task in the priority object obtained and is carried out data analysis;

5. system as claimed in claim 4, is characterized in that,

Described from task scheduling assembly also for out of service in described main task schedule component or after cannot normally running, load task to be scheduled to scheduling stack, the task that task execution time in described scheduling stack arrives is encapsulated in priority object, and described priority object is sent to described from priority query's unit; And from priority query's unit, obtain the highest priority object of priority from described, and call corresponding task processing module according to the task type of task in the priority object obtained and carry out data analysis;

6. the system as described in as arbitrary in claim 1-5, is characterized in that, also comprise:

Task management module, for receiving the task parameters of definition, and treats scheduler task generation module described in described task parameters being sent to.

7. a data analysing method, comprising:

From described, task scheduling modules treats that scheduler task memory module loads task to be scheduled, the task of loading be encapsulated in task processing threads, and runs described task processing threads;

Described task processing threads calls corresponding task processing module according to the task type of described task in operational process:

According to the analysis demand in the task that it carries, generate corresponding type of structured query language HiveSQL statement, and the tasks carrying thread in calling task processing module, the Hive SQL statement of generation is sent to described tasks carrying thread;

Described tasks carrying thread sends connection request to the connection thread in described task processing module;

Described connection thread, according to the connection request received, adopts JDBC interconnection technique to connect with the Hive Server based on Hadoop; After connection establishment, return available connection to described tasks carrying thread;

Described tasks carrying thread sends Hive SQL request according to the available connection returned to Hive Server; And the data that the Hive Server received returns sent to the result treatment thread in described task processing module to carry out processing the data analysis of paired described task.

8. method as claimed in claim 7, it is characterized in that, from described, described task scheduling modules treats that scheduler task memory module loads task to be scheduled and is specially:

9. method as claimed in claim 8, is characterized in that, described from scheduling stack, take out the task that task execution time arrives after, also comprise:

Describedly call corresponding task processing module and be specially:

10. method as claimed in claim 9, is characterized in that, after this task processing threads described calls the task processing module corresponding to the task type of the task in described task instances, also comprise:

11. methods as claimed in claim 10, it is characterized in that, described task scheduling modules is for other task each of following this task, and initialization task processing threads carries respectively, calls corresponding task processing module respectively specifically comprise by each task processing threads:

12. methods as claimed in claim 10, it is characterized in that, described task scheduling modules is for other task each of following this task, and initialization task processing threads carries respectively, calls corresponding task processing module respectively specifically comprise by each task processing threads: