CN107341061A - A kind of data dispatch processing method and processing device - Google Patents
A kind of data dispatch processing method and processing device Download PDFInfo
- Publication number
- CN107341061A CN107341061A CN201710597512.4A CN201710597512A CN107341061A CN 107341061 A CN107341061 A CN 107341061A CN 201710597512 A CN201710597512 A CN 201710597512A CN 107341061 A CN107341061 A CN 107341061A
- Authority
- CN
- China
- Prior art keywords
- processing unit
- task scheduling
- scheduling processing
- data
- pending data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G06F9/5088—Techniques for rebalancing the load in a distributed system involving task migration
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Multi Processors (AREA)
Abstract
The present invention provides a kind of data dispatch processing method and processing device, and the above method comprises the following steps:According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;If the pending data that task scheduling processing unit is distributed is more than or equal to preset value, data volume, pending data total amount according to handled by individual task scheduling processing unit, the schedule information of acquisition task scheduling processing unit;According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is handled into multiple tasks scheduling processing unit, avoid causing whole spark systematic functions to reduce because the data volume of some task scheduling processing cell processing is excessive.
Description
Technical field
The present invention relates to scheduling controlling field, more particularly to a kind of data dispatch processing method and processing device.
Background technology
Apache Spark are a cluster computing systems rapidly and efficiently, Spark make use of well Hadoop and
Mesos infrastructure, by means of Hadoop gesture and Hadoop seamless combinations, but have more compared to Hadoop MapReduce
High calculating speed, while Spark supports that internal memory calculating, more iteration batch processings, extemporaneous inquiry, stream process and figure calculating etc. are more
Kind normal form.For big data system as Spark/Hadoop, data volume is not fearful greatly, and fearful is data skew;
Data skew refers in the data set of parallel processing that certain a part of data is significantly more than other parts, so that the portion
The processing speed divided turns into the bottleneck of whole data set processing.
As shown in figure 1, in data set parallel process, by each node key key identicals pending data point
Corresponding scheduler task task is assigned to be handled, such as:Pending data is by scheduler task corresponding to keyword key1
Task1 processing;Pending data corresponding to keyword key2 is by scheduler task task2 processing;Key3 pairs of keyword
The pending data answered is by scheduler task task3 processing.
But in above-mentioned technical proposal, if in scheduler task task, pending data amount is especially big corresponding to key, then
The task operations are very slow so that the processing speed of the part turns into the bottleneck of whole data set processing, thus can cause whole
Systematic function reduces or throw exception.
Therefore, there is an urgent need to provide a kind of data dispatch processing method to solve above-mentioned technical problem.
The content of the invention
The present invention provides a kind of data dispatch processing method and processing device, to solve the above problems.
The present invention provides a kind of data dispatch processing method, comprises the following steps:Believed according to the keyword of pending data
Breath, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single
Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its
In, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, appoint
The pending data amount for scheduling processing unit secondary distribution of being engaged in;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution
Data.
The present invention provides a kind of data dispatch processing unit, including processor, is adapted for carrying out each instruction;Storage device, fit
In storing a plurality of instruction, the instruction is suitable to be loaded and performed by the processor;
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single
Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its
In, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, appoint
The pending data amount for scheduling processing unit secondary distribution of being engaged in;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution
Data.
The present invention provides a kind of data dispatch processing unit, including parameter configuration module, scheduler module, data cutting recording
Module, summarizing module;Wherein, the parameter configuration module is connected with the scheduler module;The scheduler module respectively with it is described
Data cutting recording module, the summarizing module are connected;The data cutting recording module is connected with the summarizing module.
The embodiment of the present invention provides following technical scheme:According to the keyword message of pending data, by pending data
Distribute to corresponding task scheduling processing unit;If the pending data that the task scheduling processing unit is distributed is more than or waited
In preset value, then the data volume according to handled by individual task scheduling processing unit, pending data total amount, obtain task scheduling
The schedule information of processing unit, wherein, the schedule information includes at least one of:Task scheduling processing unit title, appoint
Scheduling processing unit of being engaged in number, the pending data amount of task scheduling processing unit secondary distribution;It is right according to the schedule information
The task scheduling processing unit is scheduled, and handles the pending data of secondary distribution.
In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is dispatched into multiple tasks
Processing unit is handled, and avoids causing whole spark because the data volume of some task scheduling processing cell processing is excessive
Systematic function reduces.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair
Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 show the data set parallel process of prior art;
Fig. 2 show the data dispatch process flow figure of the embodiment of the present invention 2;
Fig. 3 show the data dispatch processing unit structure chart of the embodiment of the present invention 3;
Fig. 4 show the data dispatch processing unit structure chart of the embodiment of the present invention 4.
Embodiment
Describe the present invention in detail below with reference to accompanying drawing and in conjunction with the embodiments.It should be noted that do not conflicting
In the case of, the feature in embodiment and embodiment in the application can be mutually combined.
The core technology feature of the embodiment of the present invention is:
By by the Volume data parallel divisional under same keyword key into multiple tasks scheduling processing unit task
Processing, result is finally carried out by a task and collected, avoid causing whole because the data volume of some task processing is excessive
Spark systematic functions, which are reduced or deposited, overflows OOM (Out Of Memory, OOM).
Fig. 2 show the data dispatch process flow figure of the embodiment of the present invention 2, comprises the following steps:
Step 201:According to the keyword message of pending data, pending data is distributed to corresponding task scheduling
Manage unit;
Further, keyword identical pending data is distributed to corresponding task scheduling processing unit, wherein, close
Key word identical pending data is distributed in one or more servers.
Step 202:If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, root
According to the data volume handled by individual task scheduling processing unit, pending data total amount, the tune of acquisition task scheduling processing unit
Information is spent, wherein, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing list
The pending data amount of first number, task scheduling processing unit secondary distribution;
Step 203:According to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution
Pending data.
Further, the schedule information of task scheduling processing unit is recorded, and establishes keyword and task scheduling
The corresponding relation of the schedule information of processing unit.
Further, if task scheduling processing unit is finished to the pending data of the distribution, according to key
Word and the corresponding relation of the schedule information of task scheduling processing unit, count the result of each task scheduling processing unit simultaneously
Collect and export final implementing result.
Further, according to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution
The process of pending data be:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
Fig. 3 show the data dispatch processing unit structure chart of the embodiment of the present invention 3, including parameter configuration module, scheduling
Module, data cutting recording module, summarizing module;
Wherein, the parameter configuration module is connected with the scheduler module;The scheduler module is divided with the data respectively
Cut logging modle, the summarizing module is connected;The data cutting recording module is connected with the summarizing module.
The parameter configuration module, dispatched for increasing panel data amount configuration parameter in configuration file with parallel task
Algorithm parameter;
The scheduler module, for obtaining panel data amount configuration parameter and parallel scheduling from the parameter configuration module
Algorithm parameter;It is additionally operable to obtain the tune of task scheduling processing unit according to panel data amount configuration parameter, pending data total amount
Spend information;It is additionally operable to according to parallel dispatching algorithm parameter, task scheduling processing unit of the scheduling in idle condition, to secondary point
The pending data matched somebody with somebody is handled.
Wherein, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing list
The pending data amount of first number, task scheduling processing unit secondary distribution;The panel data amount configuration parameter, for clear and definite
Data volume handled by individual task scheduling processing unit.
The data cutting recording module, for the schedule information to the task scheduling processing unit in the scheduler module
Recorded, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit;
The summarizing module, for obtaining the result of each task scheduling processing unit from the scheduler module;
It is additionally operable to obtain keyword pass corresponding with the schedule information of task scheduling processing unit from the data cutting recording module
System;It is additionally operable to obtain the final implementing result of task scheduling processing unit corresponding to same keyword.
Fig. 4 show the data dispatch processing unit structure chart of the embodiment of the present invention 4, including processor, is adapted for carrying out each
Instruction;Storage device, suitable for storing a plurality of instruction, the instruction is suitable to be loaded and performed by the processor;
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single
Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its
In, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, appoint
The pending data amount for scheduling processing unit secondary distribution of being engaged in;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution
Data.
Further, keyword identical pending data is distributed to corresponding task scheduling processing unit, wherein, close
Key word identical pending data is distributed in one or more servers.
Further, recorded for the schedule information of task scheduling processing unit, and establish keyword and adjusted with task
Spend the corresponding relation of the schedule information of processing unit.
Further, if task scheduling processing unit is finished to the pending data of the distribution, according to key
Word and the corresponding relation of the schedule information of task scheduling processing unit, count the result of each task scheduling processing unit simultaneously
Collect and export final implementing result.
Further, according to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution
The process of pending data be:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
The embodiment of the present invention provides following technical scheme:According to the keyword message of pending data, by pending data
Distribute to corresponding task scheduling processing unit;If the pending data that the task scheduling processing unit is distributed is more than or waited
In preset value, then the data volume according to handled by individual task scheduling processing unit, pending data total amount, obtain task scheduling
The schedule information of processing unit, wherein, the schedule information includes at least one of:Task scheduling processing unit title, appoint
Scheduling processing unit of being engaged in number, the pending data amount of task scheduling processing unit secondary distribution;It is right according to the schedule information
The task scheduling processing unit is scheduled, and handles the pending data of secondary distribution.
In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is dispatched into multiple tasks
Processing unit is handled, and avoids causing whole spark because the data volume of some task scheduling processing cell processing is excessive
Systematic function reduces.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.
Claims (12)
1. a kind of data dispatch processing method, it is characterised in that comprise the following steps:
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, adjusted according to individual task
Data volume, the pending data total amount handled by processing unit are spent, obtains the schedule information of task scheduling processing unit, wherein,
The schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, task are adjusted
Spend the pending data amount of processing unit secondary distribution;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.
2. according to the method for claim 1, it is characterised in that distribute keyword identical pending data to corresponding
Task scheduling processing unit, wherein, keyword identical pending data is distributed in one or more servers.
3. according to the method for claim 2, it is characterised in that the schedule information of task scheduling processing unit is remembered
Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit.
4. according to the method for claim 3, it is characterised in that if task scheduling processing unit is to the pending of the distribution
Data are finished, then according to keyword and the corresponding relation of the schedule information of task scheduling processing unit, count each task
The result of scheduling processing unit simultaneously collects the final implementing result of output.
5. according to the method for claim 1, it is characterised in that according to the schedule information, to the task scheduling processing
Unit is scheduled, and the process for handling the pending data of secondary distribution is:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
6. a kind of data dispatch processing unit, it is characterised in that including processor, be adapted for carrying out each instruction;Storage device, it is suitable to
A plurality of instruction is stored, the instruction is suitable to be loaded and performed by the processor;
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, adjusted according to individual task
Data volume, the pending data total amount handled by processing unit are spent, obtains the schedule information of task scheduling processing unit, wherein,
The schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, task are adjusted
Spend the pending data amount of processing unit secondary distribution;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.
7. device according to claim 6, it is characterised in that distribute keyword identical pending data to corresponding
Task scheduling processing unit, wherein, keyword identical pending data is distributed in one or more servers.
8. device according to claim 7, it is characterised in that remember for the schedule information of task scheduling processing unit
Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit.
9. device according to claim 8, it is characterised in that if task scheduling processing unit is to the pending of the distribution
Data are finished, then according to keyword and the corresponding relation of the schedule information of task scheduling processing unit, count each task
The result of scheduling processing unit simultaneously collects the final implementing result of output.
10. device according to claim 6, it is characterised in that according to the schedule information, to the task scheduling processing
Unit is scheduled, and the process for handling the pending data of secondary distribution is:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
11. a kind of data dispatch processing unit, it is characterised in that including parameter configuration module, scheduler module, data cutting recording
Module, summarizing module;Wherein, the parameter configuration module is connected with the scheduler module;The scheduler module respectively with it is described
Data cutting recording module, the summarizing module are connected;The data cutting recording module is connected with the summarizing module.
12. device according to claim 11, it is characterised in that the parameter configuration module, in configuration file
Increase panel data amount configuration parameter and parallel task dispatching algorithms parameter;
The scheduler module, for obtaining panel data amount configuration parameter and parallel dispatching algorithm from the parameter configuration module
Parameter;It is additionally operable to obtain the scheduling letter of task scheduling processing unit according to panel data amount configuration parameter, pending data total amount
Breath;It is additionally operable to according to parallel dispatching algorithm parameter, task scheduling processing unit of the scheduling in idle condition, to secondary distribution
Pending data is handled;
The data cutting recording module, for the schedule information progress to the task scheduling processing unit in the scheduler module
Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit;
The summarizing module, for obtaining the result of each task scheduling processing unit from the scheduler module;Also use
In the corresponding relation for the schedule information that keyword and task scheduling processing unit are obtained from the data cutting recording module;Also
For obtaining the final implementing result of task scheduling processing unit corresponding to same keyword.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710597512.4A CN107341061A (en) | 2017-07-20 | 2017-07-20 | A kind of data dispatch processing method and processing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710597512.4A CN107341061A (en) | 2017-07-20 | 2017-07-20 | A kind of data dispatch processing method and processing device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107341061A true CN107341061A (en) | 2017-11-10 |
Family
ID=60217318
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710597512.4A Pending CN107341061A (en) | 2017-07-20 | 2017-07-20 | A kind of data dispatch processing method and processing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107341061A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359295A (en) * | 2007-08-01 | 2009-02-04 | 阿里巴巴集团控股有限公司 | Batch task scheduling and allocating method and system |
CN102521051A (en) * | 2011-12-05 | 2012-06-27 | 中国联合网络通信集团有限公司 | Task scheduling method, device and system in Map Reduce system applied to nomography |
US20130198753A1 (en) * | 2012-01-30 | 2013-08-01 | International Business Machines Corporation | Full exploitation of parallel processors for data processing |
CN104714785A (en) * | 2015-03-31 | 2015-06-17 | 中芯睿智(北京)微电子科技有限公司 | Task scheduling device, task scheduling method and data parallel processing device |
CN105912399A (en) * | 2016-04-05 | 2016-08-31 | 杭州嘉楠耘智信息科技有限公司 | Task processing method, device and system |
CN106528275A (en) * | 2015-09-10 | 2017-03-22 | 网易(杭州)网络有限公司 | Processing method of data tasks and task scheduler |
-
2017
- 2017-07-20 CN CN201710597512.4A patent/CN107341061A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359295A (en) * | 2007-08-01 | 2009-02-04 | 阿里巴巴集团控股有限公司 | Batch task scheduling and allocating method and system |
CN102521051A (en) * | 2011-12-05 | 2012-06-27 | 中国联合网络通信集团有限公司 | Task scheduling method, device and system in Map Reduce system applied to nomography |
US20130198753A1 (en) * | 2012-01-30 | 2013-08-01 | International Business Machines Corporation | Full exploitation of parallel processors for data processing |
CN104714785A (en) * | 2015-03-31 | 2015-06-17 | 中芯睿智(北京)微电子科技有限公司 | Task scheduling device, task scheduling method and data parallel processing device |
CN106528275A (en) * | 2015-09-10 | 2017-03-22 | 网易(杭州)网络有限公司 | Processing method of data tasks and task scheduler |
CN105912399A (en) * | 2016-04-05 | 2016-08-31 | 杭州嘉楠耘智信息科技有限公司 | Task processing method, device and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11720537B2 (en) | Bucket merging for a data intake and query system using size thresholds | |
Wang et al. | Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality | |
CN103345514B (en) | Streaming data processing method under big data environment | |
CN103309738B (en) | User job dispatching method and device | |
CN102254246B (en) | Workflow managing method and system | |
US9529626B2 (en) | Facilitating equitable distribution of thread resources for job types associated with tenants in a multi-tenant on-demand services environment | |
CN103092683B (en) | For data analysis based on didactic scheduling | |
CN105824957A (en) | Query engine system and query method of distributive memory column-oriented database | |
CN109885397A (en) | The loading commissions migration algorithm of time delay optimization in a kind of edge calculations environment | |
CN106130960B (en) | Judgement system, load dispatching method and the device of steal-number behavior | |
CN107818112A (en) | A kind of big data analysis operating system and task submit method | |
US11574242B1 (en) | Guided workflows for machine learning-based data analyses | |
CN111324606B (en) | Data slicing method and device | |
CN110058940B (en) | Data processing method and device in multi-thread environment | |
Wang et al. | A throughput optimal algorithm for map task scheduling in mapreduce with data locality | |
He et al. | Priority queue with customer upgrades | |
CN109408220A (en) | A kind of task processing method and device | |
CN107515784A (en) | A kind of method and apparatus of computing resource in a distributed system | |
CN102054000A (en) | Data querying method, device and system | |
CN105608138B (en) | A kind of system of optimization array data base concurrency data loading performance | |
CN110413927B (en) | Optimization method and system based on matching instantaneity in publish-subscribe system | |
CN112148381A (en) | Software definition-based edge computing priority unloading decision method and system | |
CN116777182A (en) | Task dispatch method for semiconductor wafer manufacturing | |
CN103634374A (en) | Method and device for processing concurrent access requests | |
Liu et al. | KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171110 |