CN107341061A

CN107341061A - A kind of data dispatch processing method and processing device

Info

Publication number: CN107341061A
Application number: CN201710597512.4A
Authority: CN
Inventors: 璧垫尝; 赵波
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2017-11-10

Abstract

The present invention provides a kind of data dispatch processing method and processing device, and the above method comprises the following steps：According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit；If the pending data that task scheduling processing unit is distributed is more than or equal to preset value, data volume, pending data total amount according to handled by individual task scheduling processing unit, the schedule information of acquisition task scheduling processing unit；According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is handled into multiple tasks scheduling processing unit, avoid causing whole spark systematic functions to reduce because the data volume of some task scheduling processing cell processing is excessive.

Description

A kind of data dispatch processing method and processing device

Technical field

The present invention relates to scheduling controlling field, more particularly to a kind of data dispatch processing method and processing device.

Background technology

Apache Spark are a cluster computing systems rapidly and efficiently, Spark make use of well Hadoop and Mesos infrastructure, by means of Hadoop gesture and Hadoop seamless combinations, but have more compared to Hadoop MapReduce High calculating speed, while Spark supports that internal memory calculating, more iteration batch processings, extemporaneous inquiry, stream process and figure calculating etc. are more Kind normal form.For big data system as Spark/Hadoop, data volume is not fearful greatly, and fearful is data skew； Data skew refers in the data set of parallel processing that certain a part of data is significantly more than other parts, so that the portion The processing speed divided turns into the bottleneck of whole data set processing.

As shown in figure 1, in data set parallel process, by each node key key identicals pending data point Corresponding scheduler task task is assigned to be handled, such as：Pending data is by scheduler task corresponding to keyword key1 Task1 processing；Pending data corresponding to keyword key2 is by scheduler task task2 processing；Key3 pairs of keyword The pending data answered is by scheduler task task3 processing.

But in above-mentioned technical proposal, if in scheduler task task, pending data amount is especially big corresponding to key, then The task operations are very slow so that the processing speed of the part turns into the bottleneck of whole data set processing, thus can cause whole Systematic function reduces or throw exception.

Therefore, there is an urgent need to provide a kind of data dispatch processing method to solve above-mentioned technical problem.

The content of the invention

The present invention provides a kind of data dispatch processing method and processing device, to solve the above problems.

The present invention provides a kind of data dispatch processing method, comprises the following steps：Believed according to the keyword of pending data Breath, pending data is distributed to corresponding task scheduling processing unit；

If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its In, the schedule information includes at least one of：Task scheduling processing unit title, task scheduling processing number of unit, appoint The pending data amount for scheduling processing unit secondary distribution of being engaged in；

According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution Data.

The present invention provides a kind of data dispatch processing unit, including processor, is adapted for carrying out each instruction；Storage device, fit In storing a plurality of instruction, the instruction is suitable to be loaded and performed by the processor；

According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit；

The present invention provides a kind of data dispatch processing unit, including parameter configuration module, scheduler module, data cutting recording Module, summarizing module；Wherein, the parameter configuration module is connected with the scheduler module；The scheduler module respectively with it is described Data cutting recording module, the summarizing module are connected；The data cutting recording module is connected with the summarizing module.

The embodiment of the present invention provides following technical scheme：According to the keyword message of pending data, by pending data Distribute to corresponding task scheduling processing unit；If the pending data that the task scheduling processing unit is distributed is more than or waited In preset value, then the data volume according to handled by individual task scheduling processing unit, pending data total amount, obtain task scheduling The schedule information of processing unit, wherein, the schedule information includes at least one of：Task scheduling processing unit title, appoint Scheduling processing unit of being engaged in number, the pending data amount of task scheduling processing unit secondary distribution；It is right according to the schedule information The task scheduling processing unit is scheduled, and handles the pending data of secondary distribution.

In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is dispatched into multiple tasks Processing unit is handled, and avoids causing whole spark because the data volume of some task scheduling processing cell processing is excessive Systematic function reduces.

Brief description of the drawings

Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings：

Fig. 1 show the data set parallel process of prior art；

Fig. 2 show the data dispatch process flow figure of the embodiment of the present invention 2；

Fig. 3 show the data dispatch processing unit structure chart of the embodiment of the present invention 3；

Fig. 4 show the data dispatch processing unit structure chart of the embodiment of the present invention 4.

Embodiment

Describe the present invention in detail below with reference to accompanying drawing and in conjunction with the embodiments.It should be noted that do not conflicting In the case of, the feature in embodiment and embodiment in the application can be mutually combined.

The core technology feature of the embodiment of the present invention is：

By by the Volume data parallel divisional under same keyword key into multiple tasks scheduling processing unit task Processing, result is finally carried out by a task and collected, avoid causing whole because the data volume of some task processing is excessive Spark systematic functions, which are reduced or deposited, overflows OOM (Out Of Memory, OOM).

Fig. 2 show the data dispatch process flow figure of the embodiment of the present invention 2, comprises the following steps：

Step 201：According to the keyword message of pending data, pending data is distributed to corresponding task scheduling Manage unit；

Further, keyword identical pending data is distributed to corresponding task scheduling processing unit, wherein, close Key word identical pending data is distributed in one or more servers.

Step 202：If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, root According to the data volume handled by individual task scheduling processing unit, pending data total amount, the tune of acquisition task scheduling processing unit Information is spent, wherein, the schedule information includes at least one of：Task scheduling processing unit title, task scheduling processing list The pending data amount of first number, task scheduling processing unit secondary distribution；

Step 203：According to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution Pending data.

Further, the schedule information of task scheduling processing unit is recorded, and establishes keyword and task scheduling The corresponding relation of the schedule information of processing unit.

Further, if task scheduling processing unit is finished to the pending data of the distribution, according to key Word and the corresponding relation of the schedule information of task scheduling processing unit, count the result of each task scheduling processing unit simultaneously Collect and export final implementing result.

Further, according to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution The process of pending data be：

Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.

Fig. 3 show the data dispatch processing unit structure chart of the embodiment of the present invention 3, including parameter configuration module, scheduling Module, data cutting recording module, summarizing module；

Wherein, the parameter configuration module is connected with the scheduler module；The scheduler module is divided with the data respectively Cut logging modle, the summarizing module is connected；The data cutting recording module is connected with the summarizing module.

The parameter configuration module, dispatched for increasing panel data amount configuration parameter in configuration file with parallel task Algorithm parameter；

The scheduler module, for obtaining panel data amount configuration parameter and parallel scheduling from the parameter configuration module Algorithm parameter；It is additionally operable to obtain the tune of task scheduling processing unit according to panel data amount configuration parameter, pending data total amount Spend information；It is additionally operable to according to parallel dispatching algorithm parameter, task scheduling processing unit of the scheduling in idle condition, to secondary point The pending data matched somebody with somebody is handled.

Wherein, the schedule information includes at least one of：Task scheduling processing unit title, task scheduling processing list The pending data amount of first number, task scheduling processing unit secondary distribution；The panel data amount configuration parameter, for clear and definite Data volume handled by individual task scheduling processing unit.

The data cutting recording module, for the schedule information to the task scheduling processing unit in the scheduler module Recorded, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit；

The summarizing module, for obtaining the result of each task scheduling processing unit from the scheduler module； It is additionally operable to obtain keyword pass corresponding with the schedule information of task scheduling processing unit from the data cutting recording module System；It is additionally operable to obtain the final implementing result of task scheduling processing unit corresponding to same keyword.

Fig. 4 show the data dispatch processing unit structure chart of the embodiment of the present invention 4, including processor, is adapted for carrying out each Instruction；Storage device, suitable for storing a plurality of instruction, the instruction is suitable to be loaded and performed by the processor；

Further, recorded for the schedule information of task scheduling processing unit, and establish keyword and adjusted with task Spend the corresponding relation of the schedule information of processing unit.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

1. a kind of data dispatch processing method, it is characterised in that comprise the following steps：

If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, adjusted according to individual task Data volume, the pending data total amount handled by processing unit are spent, obtains the schedule information of task scheduling processing unit, wherein, The schedule information includes at least one of：Task scheduling processing unit title, task scheduling processing number of unit, task are adjusted Spend the pending data amount of processing unit secondary distribution；

According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.

2. according to the method for claim 1, it is characterised in that distribute keyword identical pending data to corresponding Task scheduling processing unit, wherein, keyword identical pending data is distributed in one or more servers.

3. according to the method for claim 2, it is characterised in that the schedule information of task scheduling processing unit is remembered Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit.

4. according to the method for claim 3, it is characterised in that if task scheduling processing unit is to the pending of the distribution Data are finished, then according to keyword and the corresponding relation of the schedule information of task scheduling processing unit, count each task The result of scheduling processing unit simultaneously collects the final implementing result of output.

5. according to the method for claim 1, it is characterised in that according to the schedule information, to the task scheduling processing Unit is scheduled, and the process for handling the pending data of secondary distribution is：

6. a kind of data dispatch processing unit, it is characterised in that including processor, be adapted for carrying out each instruction；Storage device, it is suitable to A plurality of instruction is stored, the instruction is suitable to be loaded and performed by the processor；

7. device according to claim 6, it is characterised in that distribute keyword identical pending data to corresponding Task scheduling processing unit, wherein, keyword identical pending data is distributed in one or more servers.

8. device according to claim 7, it is characterised in that remember for the schedule information of task scheduling processing unit Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit.

9. device according to claim 8, it is characterised in that if task scheduling processing unit is to the pending of the distribution Data are finished, then according to keyword and the corresponding relation of the schedule information of task scheduling processing unit, count each task The result of scheduling processing unit simultaneously collects the final implementing result of output.

10. device according to claim 6, it is characterised in that according to the schedule information, to the task scheduling processing Unit is scheduled, and the process for handling the pending data of secondary distribution is：

11. a kind of data dispatch processing unit, it is characterised in that including parameter configuration module, scheduler module, data cutting recording Module, summarizing module；Wherein, the parameter configuration module is connected with the scheduler module；The scheduler module respectively with it is described Data cutting recording module, the summarizing module are connected；The data cutting recording module is connected with the summarizing module.

12. device according to claim 11, it is characterised in that the parameter configuration module, in configuration file Increase panel data amount configuration parameter and parallel task dispatching algorithms parameter；

The scheduler module, for obtaining panel data amount configuration parameter and parallel dispatching algorithm from the parameter configuration module Parameter；It is additionally operable to obtain the scheduling letter of task scheduling processing unit according to panel data amount configuration parameter, pending data total amount Breath；It is additionally operable to according to parallel dispatching algorithm parameter, task scheduling processing unit of the scheduling in idle condition, to secondary distribution Pending data is handled；

The data cutting recording module, for the schedule information progress to the task scheduling processing unit in the scheduler module Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit；

The summarizing module, for obtaining the result of each task scheduling processing unit from the scheduler module；Also use In the corresponding relation for the schedule information that keyword and task scheduling processing unit are obtained from the data cutting recording module；Also For obtaining the final implementing result of task scheduling processing unit corresponding to same keyword.