CN104331421A

CN104331421A - High-efficiency processing method and system for big data

Info

Publication number: CN104331421A
Application number: CN201410540392.0A
Authority: CN
Inventors: 王佐成; 任子晖; 马韵洁; 张凯
Original assignee: Anhui Sun Create Electronic Co Ltd
Current assignee: Anhui Sun Create Electronic Co Ltd
Priority date: 2014-10-14
Filing date: 2014-10-14
Publication date: 2015-02-04

Abstract

The invention relates to a high-efficiency processing method for big data, which comprises the following steps that a data node receives data to be stored; the data node stores the data, an index is simultaneously created according to a business scenario and is stored in a memory, and the data is gradually stored in a disk by index curing; a user inputs a task request, and an SQL (Structured Query Language) engine implements rapid retrieval of the data according to the created index and outputs the data to a computational node; a task processing module of a management node executes task scheduling, applies for resources to a resource management module and determines a spare computational node, and the spare computational node processes the data; the finally processed data is shown for the user. The invention also discloses the high-efficiency processing system for the big data. According to the invention, all processing is executed concurrently; hardware equipment of a computer is utilized to the greatest extent; processing efficiency is greatly improved; the user can more rapidly obtain a processing result when a task is executed.

Description

A kind of high-efficient treatment method of large data and system

Technical field

The present invention relates to the large market demand processing technology field of computing machine, especially a kind of high-efficient treatment method of large data and system.

Background technology

Along with the mega project such as safe city, smart city extensively carrying out in various places, data are gathered, data fusion further develops, data volume to be processed is needed to reach TB level, PB level, the process of big data quantity creates a series of realistic problem, original relevant database is when so large data volume, and its Technical Architecture, processing power, processing mode etc. more and more cannot be met consumers' demand.

The development of cloud computing, large data technique provides good solution route to the process of mass data, and Hadoop frame system uses parallel computation (MapReduce) especially, the mode of distributed storage (HDFS) achieves storage and the calculating of big data quantity.But, because distributed storage (HDFS) does not support that structuring query statement (SQL) directly processes, the data of distributed storage (HDFS) are difficult to directly be subsequently processed, and calculation task finally all needs to change into parallel computation MapReduce framework performs, its management node (Jobtracker) task is heavy, efficiency is low, easily cause Single Point of Faliure.How processing mass data fast, easily, how while raising task treatment effeciency, the availability increasing system becomes problem demanding prompt solution.

Summary of the invention

Primary and foremost purpose of the present invention is to provide in a kind of storage in large data, retrieval, computation process the high-efficient treatment method realizing the large data that large data fast, efficiently process.

For achieving the above object, present invention employs following technical scheme: a kind of high-efficient treatment method of large data, the method comprises the step of following order:

(1) back end receives data to be stored;

(2) back end stores data, meanwhile, creates index being kept in internal memory according to business scenario, and is solidified by index and be progressively kept in disk;

(3) user's incoming task request, SQL engine realizes data quick-searching according to the index created, and exports data to computing node;

(4) the task processing module of management node is executed the task scheduling, and to resource management module application resource, determines idle computing node, and processed data by this computing node; (5) final process data are presented to user.

The data type that described back end receives comprises structuring, semi-structured and unstructured data.

When carrying out data storage and index creation, first, index rule is created according to business scenario, then the data received are stored, be stored in hard disk, meanwhile, the basis of distributed file system use blur+lencense component construction index, indexed facet is set up to service application scene, is formed in the condition rear, usage degree is higher chooses and be stored in memory module according to index data.

When retrieving, by submit queries request, the inquiry request information of control module to input is analyzed, control module adopts SQL engine first to carry out automatic semantics recognition to querying condition, first from the index of memory module memory storage, target is searched, obtain raw data by the index degaussing dish that finds, and data are returned, present to user; If search less than, then to search to disk index stores district.

Described task processing module by the priority according to task, complexity situation to resource management module application resource, resource management module provides concrete task process resource according to dispatching algorithm, return to task processing module, task processing module issues task to corresponding computing node.

Described index is first stored in memory module, by internal memory working mechanism the index file exceeding memory capacity is cured in disk and preserves, the form of file carries out the storage of many copies in a distributed manner, index file forms sequencing and index file usage degree parameter for according to being cured stored in disk working mechanism with memory storage area size, index, the index formed at first, usage degree is minimum is first cured to disk, and the index file being cured to disk is distributed storage.

Another object of the present invention is to the efficient disposal system that a kind of large data are provided, comprising:

Store and index creation module, back end stores the data received, and meanwhile, creates index, be first kept in memory module by index file, be more progressively kept in disk according to business scenario;

Retrieval module, SQL engine, according to the index created, realizes data quick-searching, and exports data to computing node;

Processing module, the scheduling of executing the task, application resource, manages resource, and be responsible for the cutting of task simultaneously, function that process, merger, failed tasks are restarted, the execution of finally finishing the work.

Described processing module comprises:

Resource management module, realizes the management to computing module resource, by computing node client, and the resource service condition of in good time perception computing node, preparing dynamically is at any time task matching resource;

Task processing module, reception task, according to the priority of task, complexity situation to resource management module application resource, resource management module provides concrete task process resource according to dispatching algorithm, return to task processing module, task processing module is responsible for task to pass to given computing module, and be responsible for the cutting of task simultaneously, function that process, merger, failed tasks are restarted, the execution of finally finishing the work;

Computing module, the physics of specifically executing the task or virtual resource node.

As shown from the above technical solution, the present invention adopts multithreading to create index on each back end; Each back end arranges core buffer, store the index created, when index reaches a certain amount of, history index data and the index record that is not well used are cured to disk by escape mechanism, and carry out distributed storage to ensure availability, simultaneously in order to improve data high availability; SQL engine is adopted to realize real-time, fast query for index; Resource management module and task processing module separate by management node, and resource management realizes management, the scheduling of resource in cluster, and the resource bid of all tasks of task processing modules implement, task cutting, result merging, task status maintenance, result export.All process of the present invention are all concurrence performance, make use of the hardware device of computing machine to greatest extent, drastically increase treatment effeciency, make the user Shi Nenggeng that executes the task obtain result soon.

Accompanying drawing explanation

Fig. 1 is method flow diagram of the present invention.

Fig. 2 is the process flow diagram of data of the present invention storage and index creation.

Fig. 3 is retrieval flow figure of the present invention.

Fig. 4 is task processing flow chart of the present invention.

Embodiment

A high-efficient treatment method for large data, comprising: first, and back end receives data to be stored; Secondly, back end stores data, meanwhile, creates index and be kept in internal memory according to business scenario, and is solidified by index and be progressively kept in disk; Again, the request of user's incoming task, SQL engine realizes data quick-searching according to the index created, and exports data to computing node; Then, the task processing module of management node is executed the task scheduling, and to resource management module application resource, determines idle computing node, and processed data by this computing node; Finally, final process data are presented to user, the data type that described back end receives comprises structuring, semi-structured and unstructured data, as shown in Figure 1.

As shown in Figure 1, back end realizes storage to data to be stored, uses blur+lencense component construction index on the basis of HDFS simultaneously, and indexed facet is set up to service application scene, chooses valuable, time order and function order and builds.After index creation completes, can retrieve for index, use Squirre-SQL assembly to realize SQL and operate and carry out data structured displaying.Processing module, realize task quick, efficiently process, resource management and task process main functional modules separate by management node, be divided into resource management module and task processing module, resource management module realize resource distribution, resource status monitoring, resource reclaim function, the application of task processing modules implement resource, utilize function, solve the problem that former management node task is heavy, efficiency is low, easily cause the machine of delaying.

As shown in Figure 2, when carrying out data storage and index creation, first, create index rule according to business scenario, then the data received are stored, be stored in hard disk, simultaneously, the basis of distributed file system uses blur+lencense component construction index, and indexed facet is set up to service application scene, is formed in the condition rear, usage degree is higher chooses and be stored in memory module according to index data.Described index is first stored in memory module, by internal memory working mechanism the index file exceeding memory capacity is cured in disk and preserves, the form of file carries out the storage of many copies in a distributed manner, index file stored in disk working mechanism with memory storage area size, index forms sequencing and index file usage degree parameter is according to being cured, to be formed at first, the index that usage degree is minimum is first cured to disk, the index file being cured to disk is distributed storage, the maximum business datum index of such application will be kept at memory field all the time, be convenient to quick use.

As shown in Figure 2, according to business index building rule: this index creates based on concrete business, directly serves service application, utilizes existing regular index building while back end carries out data storage, data are stored in disk, and the metadata store of generation is on management node.On the basis of distributed file system HDFS, back end creates process, use the mode of blur+lencense to carry out the establishment of data directory.Index is first stored in memory module, and memory module keeps a certain amount of memory size, in order to ensure the high availability of data, is cured storage in a hard disk simultaneously, and the form of file stores in a distributed manner.

As shown in Figure 3, when retrieving, by submit queries request, such as certain fuzzy vehicle license plate information; The inquiry request information of control module to input is analyzed, control module adopts SQL engine first to carry out automatic semantics recognition to querying condition, first from the index of memory module memory storage, target is searched, such as the license board information of vehicle, obtain raw data by the index degaussing dish that finds, and data are returned, present to user; If search less than, then to search to disk index stores district.Under simple service environment, the data of searching directly can return to user, under the service environment of complexity, and also can by being back to user after the task processing module of management node and resource management module process.Disk index stores module in Fig. 3 is exactly the disk storage in Fig. 1.

As shown in Figure 4, described task processing module by the priority according to task, complexity situation to resource management module application resource, resource management module provides concrete task process resource according to dispatching algorithm, return to task processing module, task processing module issues task to corresponding computing node.

As shown in Figure 1, native system comprises: store and index creation module, back end stores the data received, and meanwhile, creates index, be first kept in memory module by index file, be more progressively kept in disk according to business scenario; Retrieval module, SQL engine, according to the index created, realizes data quick-searching, and exports data to computing node; Processing module, the scheduling of executing the task, application resource, manages resource, and be responsible for the cutting of task simultaneously, function that process, merger, failed tasks are restarted, the execution of finally finishing the work.

Described processing module comprises: resource management module, realizes the management to computing module resource, by computing node client, and the resource service condition of in good time perception computing node, preparing dynamically is at any time task matching resource; Task processing module, reception task, according to the priority of task, complexity situation to resource management module application resource, resource management module provides concrete task process resource according to dispatching algorithm, return to task processing module, task processing module is responsible for task to pass to given computing module, and be responsible for the cutting of task simultaneously, function that process, merger, failed tasks are restarted, the execution of finally finishing the work; Computing module, the physics of specifically executing the task or virtual resource node.

Resource management module realizes the management to computing module resource, by every platform computer client, Resource Management node can the resource service condition of the computing node of perception in good time, resource content comprises internal memory, CPU, disk, network etc., have at fingertips to the resource situation of real-time, tunable degree, preparing dynamically is at any time task matching resource.Task refers to concrete some application, and as the incomplete license board information inputted according to front end, remove the garage information mating large database concept, first this task can be caught by task processing module.Computing module refers to the physics or virtual resource node of specifically executing the task.The resource management module passage client be deployed on computing module obtains and the load information, health information etc. of computing node, and task processing module issues task to computing node.First each task is received by task processing module, task processing module by the priority according to task, complexity situation to resource management module application resource, resource management module provides concrete task process resource according to dispatching algorithm, return to task processing module, task processing module is responsible for task being passed to given resource processing module, and be responsible for the cutting of task simultaneously, function, the execution of finally finishing the work such as process, merger, failed tasks are restarted.

In sum, all process of the present invention are all concurrence performance, make use of the hardware device of computing machine to greatest extent, drastically increase treatment effeciency, make the user Shi Nenggeng that executes the task obtain result soon.

Claims

1. a high-efficient treatment method for large data, the method comprises the step of following order:

(1) back end receives data to be stored;

(4) the task processing module of management node is executed the task scheduling, and to resource management module application resource, determines idle computing node, and processed data by this computing node;

(5) final process data are presented to user.

2. the high-efficient treatment method of large data according to claim 1, is characterized in that: the data type that described back end receives comprises structuring, semi-structured and unstructured data.

3. the high-efficient treatment method of large data according to claim 1, it is characterized in that: when carrying out data storage and index creation, first, index rule is created according to business scenario, then the data received are stored, be stored in hard disk, simultaneously, the basis of distributed file system uses blur+lencense component construction index, indexed facet is set up to service application scene, is formed in the condition rear, usage degree is higher chooses and be stored in memory module according to index data.

4. the high-efficient treatment method of large data according to claim 1, it is characterized in that: when retrieving, by submit queries request, the inquiry request information of control module to input is analyzed, control module adopts SQL engine first to carry out automatic semantics recognition to querying condition, first from the index of memory module memory storage, target is searched, obtain raw data by the index degaussing dish that finds, and data are returned, present to user; If search less than, then to search to disk index stores district.

5. the high-efficient treatment method of large data according to claim 1, it is characterized in that: described task processing module by the priority according to task, complexity situation to resource management module application resource, resource management module provides concrete task process resource according to dispatching algorithm, return to task processing module, task processing module issues task to corresponding computing node.

6. the high-efficient treatment method of large data according to claim 3, it is characterized in that: described index is first stored in memory module, by internal memory working mechanism the index file exceeding memory capacity is cured in disk and preserves, the form of file carries out the storage of many copies in a distributed manner, index file stored in disk working mechanism with memory storage area size, index forms sequencing and index file usage degree parameter is according to being cured, to be formed at first, the index that usage degree is minimum is first cured to disk, the index file being cured to disk is distributed storage.

7. a large data efficient disposal system, is characterized in that: comprising:

8. large data efficient disposal system according to claim 7, is characterized in that: described processing module comprises: