CN107341061A - A kind of data dispatch processing method and processing device - Google Patents

A kind of data dispatch processing method and processing device Download PDF

Info

Publication number
CN107341061A
CN107341061A CN201710597512.4A CN201710597512A CN107341061A CN 107341061 A CN107341061 A CN 107341061A CN 201710597512 A CN201710597512 A CN 201710597512A CN 107341061 A CN107341061 A CN 107341061A
Authority
CN
China
Prior art keywords
processing unit
task scheduling
scheduling processing
data
pending data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710597512.4A
Other languages
Chinese (zh)
Inventor
璧垫尝
赵波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201710597512.4A priority Critical patent/CN107341061A/en
Publication of CN107341061A publication Critical patent/CN107341061A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Multi Processors (AREA)

Abstract

The present invention provides a kind of data dispatch processing method and processing device, and the above method comprises the following steps:According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;If the pending data that task scheduling processing unit is distributed is more than or equal to preset value, data volume, pending data total amount according to handled by individual task scheduling processing unit, the schedule information of acquisition task scheduling processing unit;According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is handled into multiple tasks scheduling processing unit, avoid causing whole spark systematic functions to reduce because the data volume of some task scheduling processing cell processing is excessive.

Description

A kind of data dispatch processing method and processing device
Technical field
The present invention relates to scheduling controlling field, more particularly to a kind of data dispatch processing method and processing device.
Background technology
Apache Spark are a cluster computing systems rapidly and efficiently, Spark make use of well Hadoop and Mesos infrastructure, by means of Hadoop gesture and Hadoop seamless combinations, but have more compared to Hadoop MapReduce High calculating speed, while Spark supports that internal memory calculating, more iteration batch processings, extemporaneous inquiry, stream process and figure calculating etc. are more Kind normal form.For big data system as Spark/Hadoop, data volume is not fearful greatly, and fearful is data skew; Data skew refers in the data set of parallel processing that certain a part of data is significantly more than other parts, so that the portion The processing speed divided turns into the bottleneck of whole data set processing.
As shown in figure 1, in data set parallel process, by each node key key identicals pending data point Corresponding scheduler task task is assigned to be handled, such as:Pending data is by scheduler task corresponding to keyword key1 Task1 processing;Pending data corresponding to keyword key2 is by scheduler task task2 processing;Key3 pairs of keyword The pending data answered is by scheduler task task3 processing.
But in above-mentioned technical proposal, if in scheduler task task, pending data amount is especially big corresponding to key, then The task operations are very slow so that the processing speed of the part turns into the bottleneck of whole data set processing, thus can cause whole Systematic function reduces or throw exception.
Therefore, there is an urgent need to provide a kind of data dispatch processing method to solve above-mentioned technical problem.
The content of the invention
The present invention provides a kind of data dispatch processing method and processing device, to solve the above problems.
The present invention provides a kind of data dispatch processing method, comprises the following steps:Believed according to the keyword of pending data Breath, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its In, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, appoint The pending data amount for scheduling processing unit secondary distribution of being engaged in;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution Data.
The present invention provides a kind of data dispatch processing unit, including processor, is adapted for carrying out each instruction;Storage device, fit In storing a plurality of instruction, the instruction is suitable to be loaded and performed by the processor;
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its In, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, appoint The pending data amount for scheduling processing unit secondary distribution of being engaged in;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution Data.
The present invention provides a kind of data dispatch processing unit, including parameter configuration module, scheduler module, data cutting recording Module, summarizing module;Wherein, the parameter configuration module is connected with the scheduler module;The scheduler module respectively with it is described Data cutting recording module, the summarizing module are connected;The data cutting recording module is connected with the summarizing module.
The embodiment of the present invention provides following technical scheme:According to the keyword message of pending data, by pending data Distribute to corresponding task scheduling processing unit;If the pending data that the task scheduling processing unit is distributed is more than or waited In preset value, then the data volume according to handled by individual task scheduling processing unit, pending data total amount, obtain task scheduling The schedule information of processing unit, wherein, the schedule information includes at least one of:Task scheduling processing unit title, appoint Scheduling processing unit of being engaged in number, the pending data amount of task scheduling processing unit secondary distribution;It is right according to the schedule information The task scheduling processing unit is scheduled, and handles the pending data of secondary distribution.
In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is dispatched into multiple tasks Processing unit is handled, and avoids causing whole spark because the data volume of some task scheduling processing cell processing is excessive Systematic function reduces.
Brief description of the drawings
Accompanying drawing described herein is used for providing a further understanding of the present invention, forms the part of the application, this hair Bright schematic description and description is used to explain the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 show the data set parallel process of prior art;
Fig. 2 show the data dispatch process flow figure of the embodiment of the present invention 2;
Fig. 3 show the data dispatch processing unit structure chart of the embodiment of the present invention 3;
Fig. 4 show the data dispatch processing unit structure chart of the embodiment of the present invention 4.
Embodiment
Describe the present invention in detail below with reference to accompanying drawing and in conjunction with the embodiments.It should be noted that do not conflicting In the case of, the feature in embodiment and embodiment in the application can be mutually combined.
The core technology feature of the embodiment of the present invention is:
By by the Volume data parallel divisional under same keyword key into multiple tasks scheduling processing unit task Processing, result is finally carried out by a task and collected, avoid causing whole because the data volume of some task processing is excessive Spark systematic functions, which are reduced or deposited, overflows OOM (Out Of Memory, OOM).
Fig. 2 show the data dispatch process flow figure of the embodiment of the present invention 2, comprises the following steps:
Step 201:According to the keyword message of pending data, pending data is distributed to corresponding task scheduling Manage unit;
Further, keyword identical pending data is distributed to corresponding task scheduling processing unit, wherein, close Key word identical pending data is distributed in one or more servers.
Step 202:If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, root According to the data volume handled by individual task scheduling processing unit, pending data total amount, the tune of acquisition task scheduling processing unit Information is spent, wherein, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing list The pending data amount of first number, task scheduling processing unit secondary distribution;
Step 203:According to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution Pending data.
Further, the schedule information of task scheduling processing unit is recorded, and establishes keyword and task scheduling The corresponding relation of the schedule information of processing unit.
Further, if task scheduling processing unit is finished to the pending data of the distribution, according to key Word and the corresponding relation of the schedule information of task scheduling processing unit, count the result of each task scheduling processing unit simultaneously Collect and export final implementing result.
Further, according to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution The process of pending data be:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
Fig. 3 show the data dispatch processing unit structure chart of the embodiment of the present invention 3, including parameter configuration module, scheduling Module, data cutting recording module, summarizing module;
Wherein, the parameter configuration module is connected with the scheduler module;The scheduler module is divided with the data respectively Cut logging modle, the summarizing module is connected;The data cutting recording module is connected with the summarizing module.
The parameter configuration module, dispatched for increasing panel data amount configuration parameter in configuration file with parallel task Algorithm parameter;
The scheduler module, for obtaining panel data amount configuration parameter and parallel scheduling from the parameter configuration module Algorithm parameter;It is additionally operable to obtain the tune of task scheduling processing unit according to panel data amount configuration parameter, pending data total amount Spend information;It is additionally operable to according to parallel dispatching algorithm parameter, task scheduling processing unit of the scheduling in idle condition, to secondary point The pending data matched somebody with somebody is handled.
Wherein, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing list The pending data amount of first number, task scheduling processing unit secondary distribution;The panel data amount configuration parameter, for clear and definite Data volume handled by individual task scheduling processing unit.
The data cutting recording module, for the schedule information to the task scheduling processing unit in the scheduler module Recorded, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit;
The summarizing module, for obtaining the result of each task scheduling processing unit from the scheduler module; It is additionally operable to obtain keyword pass corresponding with the schedule information of task scheduling processing unit from the data cutting recording module System;It is additionally operable to obtain the final implementing result of task scheduling processing unit corresponding to same keyword.
Fig. 4 show the data dispatch processing unit structure chart of the embodiment of the present invention 4, including processor, is adapted for carrying out each Instruction;Storage device, suitable for storing a plurality of instruction, the instruction is suitable to be loaded and performed by the processor;
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, according to single Data volume, pending data total amount handled by scheduling processing unit of being engaged in, the schedule information of task scheduling processing unit is obtained, its In, the schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, appoint The pending data amount for scheduling processing unit secondary distribution of being engaged in;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending of secondary distribution Data.
Further, keyword identical pending data is distributed to corresponding task scheduling processing unit, wherein, close Key word identical pending data is distributed in one or more servers.
Further, recorded for the schedule information of task scheduling processing unit, and establish keyword and adjusted with task Spend the corresponding relation of the schedule information of processing unit.
Further, if task scheduling processing unit is finished to the pending data of the distribution, according to key Word and the corresponding relation of the schedule information of task scheduling processing unit, count the result of each task scheduling processing unit simultaneously Collect and export final implementing result.
Further, according to the schedule information, the task scheduling processing unit is scheduled, handles secondary distribution The process of pending data be:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
The embodiment of the present invention provides following technical scheme:According to the keyword message of pending data, by pending data Distribute to corresponding task scheduling processing unit;If the pending data that the task scheduling processing unit is distributed is more than or waited In preset value, then the data volume according to handled by individual task scheduling processing unit, pending data total amount, obtain task scheduling The schedule information of processing unit, wherein, the schedule information includes at least one of:Task scheduling processing unit title, appoint Scheduling processing unit of being engaged in number, the pending data amount of task scheduling processing unit secondary distribution;It is right according to the schedule information The task scheduling processing unit is scheduled, and handles the pending data of secondary distribution.
In above-mentioned technical proposal, by the way that the Volume data parallel divisional under same keyword is dispatched into multiple tasks Processing unit is handled, and avoids causing whole spark because the data volume of some task scheduling processing cell processing is excessive Systematic function reduces.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims (12)

1. a kind of data dispatch processing method, it is characterised in that comprise the following steps:
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, adjusted according to individual task Data volume, the pending data total amount handled by processing unit are spent, obtains the schedule information of task scheduling processing unit, wherein, The schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, task are adjusted Spend the pending data amount of processing unit secondary distribution;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.
2. according to the method for claim 1, it is characterised in that distribute keyword identical pending data to corresponding Task scheduling processing unit, wherein, keyword identical pending data is distributed in one or more servers.
3. according to the method for claim 2, it is characterised in that the schedule information of task scheduling processing unit is remembered Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit.
4. according to the method for claim 3, it is characterised in that if task scheduling processing unit is to the pending of the distribution Data are finished, then according to keyword and the corresponding relation of the schedule information of task scheduling processing unit, count each task The result of scheduling processing unit simultaneously collects the final implementing result of output.
5. according to the method for claim 1, it is characterised in that according to the schedule information, to the task scheduling processing Unit is scheduled, and the process for handling the pending data of secondary distribution is:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
6. a kind of data dispatch processing unit, it is characterised in that including processor, be adapted for carrying out each instruction;Storage device, it is suitable to A plurality of instruction is stored, the instruction is suitable to be loaded and performed by the processor;
According to the keyword message of pending data, pending data is distributed to corresponding task scheduling processing unit;
If the pending data that the task scheduling processing unit is distributed is more than or equal to preset value, adjusted according to individual task Data volume, the pending data total amount handled by processing unit are spent, obtains the schedule information of task scheduling processing unit, wherein, The schedule information includes at least one of:Task scheduling processing unit title, task scheduling processing number of unit, task are adjusted Spend the pending data amount of processing unit secondary distribution;
According to the schedule information, the task scheduling processing unit is scheduled, handles the pending data of secondary distribution.
7. device according to claim 6, it is characterised in that distribute keyword identical pending data to corresponding Task scheduling processing unit, wherein, keyword identical pending data is distributed in one or more servers.
8. device according to claim 7, it is characterised in that remember for the schedule information of task scheduling processing unit Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit.
9. device according to claim 8, it is characterised in that if task scheduling processing unit is to the pending of the distribution Data are finished, then according to keyword and the corresponding relation of the schedule information of task scheduling processing unit, count each task The result of scheduling processing unit simultaneously collects the final implementing result of output.
10. device according to claim 6, it is characterised in that according to the schedule information, to the task scheduling processing Unit is scheduled, and the process for handling the pending data of secondary distribution is:
Task scheduling processing unit of the scheduling in idle condition, is handled the pending data of secondary distribution.
11. a kind of data dispatch processing unit, it is characterised in that including parameter configuration module, scheduler module, data cutting recording Module, summarizing module;Wherein, the parameter configuration module is connected with the scheduler module;The scheduler module respectively with it is described Data cutting recording module, the summarizing module are connected;The data cutting recording module is connected with the summarizing module.
12. device according to claim 11, it is characterised in that the parameter configuration module, in configuration file Increase panel data amount configuration parameter and parallel task dispatching algorithms parameter;
The scheduler module, for obtaining panel data amount configuration parameter and parallel dispatching algorithm from the parameter configuration module Parameter;It is additionally operable to obtain the scheduling letter of task scheduling processing unit according to panel data amount configuration parameter, pending data total amount Breath;It is additionally operable to according to parallel dispatching algorithm parameter, task scheduling processing unit of the scheduling in idle condition, to secondary distribution Pending data is handled;
The data cutting recording module, for the schedule information progress to the task scheduling processing unit in the scheduler module Record, and establish keyword and the corresponding relation of the schedule information of task scheduling processing unit;
The summarizing module, for obtaining the result of each task scheduling processing unit from the scheduler module;Also use In the corresponding relation for the schedule information that keyword and task scheduling processing unit are obtained from the data cutting recording module;Also For obtaining the final implementing result of task scheduling processing unit corresponding to same keyword.
CN201710597512.4A 2017-07-20 2017-07-20 A kind of data dispatch processing method and processing device Pending CN107341061A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710597512.4A CN107341061A (en) 2017-07-20 2017-07-20 A kind of data dispatch processing method and processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710597512.4A CN107341061A (en) 2017-07-20 2017-07-20 A kind of data dispatch processing method and processing device

Publications (1)

Publication Number Publication Date
CN107341061A true CN107341061A (en) 2017-11-10

Family

ID=60217318

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710597512.4A Pending CN107341061A (en) 2017-07-20 2017-07-20 A kind of data dispatch processing method and processing device

Country Status (1)

Country Link
CN (1) CN107341061A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359295A (en) * 2007-08-01 2009-02-04 阿里巴巴集团控股有限公司 Batch task scheduling and allocating method and system
CN102521051A (en) * 2011-12-05 2012-06-27 中国联合网络通信集团有限公司 Task scheduling method, device and system in Map Reduce system applied to nomography
US20130198753A1 (en) * 2012-01-30 2013-08-01 International Business Machines Corporation Full exploitation of parallel processors for data processing
CN104714785A (en) * 2015-03-31 2015-06-17 中芯睿智(北京)微电子科技有限公司 Task scheduling device, task scheduling method and data parallel processing device
CN105912399A (en) * 2016-04-05 2016-08-31 杭州嘉楠耘智信息科技有限公司 Task processing method, device and system
CN106528275A (en) * 2015-09-10 2017-03-22 网易(杭州)网络有限公司 Processing method of data tasks and task scheduler

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101359295A (en) * 2007-08-01 2009-02-04 阿里巴巴集团控股有限公司 Batch task scheduling and allocating method and system
CN102521051A (en) * 2011-12-05 2012-06-27 中国联合网络通信集团有限公司 Task scheduling method, device and system in Map Reduce system applied to nomography
US20130198753A1 (en) * 2012-01-30 2013-08-01 International Business Machines Corporation Full exploitation of parallel processors for data processing
CN104714785A (en) * 2015-03-31 2015-06-17 中芯睿智(北京)微电子科技有限公司 Task scheduling device, task scheduling method and data parallel processing device
CN106528275A (en) * 2015-09-10 2017-03-22 网易(杭州)网络有限公司 Processing method of data tasks and task scheduler
CN105912399A (en) * 2016-04-05 2016-08-31 杭州嘉楠耘智信息科技有限公司 Task processing method, device and system

Similar Documents

Publication Publication Date Title
US11720537B2 (en) Bucket merging for a data intake and query system using size thresholds
Wang et al. Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality
CN103345514B (en) Streaming data processing method under big data environment
CN103309738B (en) User job dispatching method and device
CN102254246B (en) Workflow managing method and system
US9529626B2 (en) Facilitating equitable distribution of thread resources for job types associated with tenants in a multi-tenant on-demand services environment
CN103092683B (en) For data analysis based on didactic scheduling
CN105824957A (en) Query engine system and query method of distributive memory column-oriented database
CN109885397A (en) The loading commissions migration algorithm of time delay optimization in a kind of edge calculations environment
CN106130960B (en) Judgement system, load dispatching method and the device of steal-number behavior
CN107818112A (en) A kind of big data analysis operating system and task submit method
US11574242B1 (en) Guided workflows for machine learning-based data analyses
CN111324606B (en) Data slicing method and device
CN110058940B (en) Data processing method and device in multi-thread environment
Wang et al. A throughput optimal algorithm for map task scheduling in mapreduce with data locality
He et al. Priority queue with customer upgrades
CN109408220A (en) A kind of task processing method and device
CN107515784A (en) A kind of method and apparatus of computing resource in a distributed system
CN102054000A (en) Data querying method, device and system
CN105608138B (en) A kind of system of optimization array data base concurrency data loading performance
CN110413927B (en) Optimization method and system based on matching instantaneity in publish-subscribe system
CN112148381A (en) Software definition-based edge computing priority unloading decision method and system
CN116777182A (en) Task dispatch method for semiconductor wafer manufacturing
CN103634374A (en) Method and device for processing concurrent access requests
Liu et al. KubFBS: A fine‐grained and balance‐aware scheduling system for deep learning tasks based on kubernetes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20171110