CN110825526B - Distributed scheduling method and device based on ER relationship, equipment and storage medium - Google Patents

Distributed scheduling method and device based on ER relationship, equipment and storage medium Download PDF

Info

Publication number
CN110825526B
CN110825526B CN201911088140.8A CN201911088140A CN110825526B CN 110825526 B CN110825526 B CN 110825526B CN 201911088140 A CN201911088140 A CN 201911088140A CN 110825526 B CN110825526 B CN 110825526B
Authority
CN
China
Prior art keywords
task
tasks
computing
scheduling method
distributed scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911088140.8A
Other languages
Chinese (zh)
Other versions
CN110825526A (en
Inventor
冯若寅
万仕龙
邹晓峰
仲跻炜
朱彭生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ouye Yunshang Co ltd
Original Assignee
Ouye Yunshang Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ouye Yunshang Co ltd filed Critical Ouye Yunshang Co ltd
Priority to CN201911088140.8A priority Critical patent/CN110825526B/en
Publication of CN110825526A publication Critical patent/CN110825526A/en
Application granted granted Critical
Publication of CN110825526B publication Critical patent/CN110825526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The utility model provides a distributed scheduling method and device, electronic equipment and computer storage medium based on ER relation, wherein, distributed scheduling method based on ER relation is used for dispatching the task in big data platform, the distributed scheduling method includes the following step: step S1, determining the range of the calculation task to be scheduled, and generating an initial task set; step S2, arranging all tasks in the initial task set in a layered way based on ER relationship; step S3, estimating the spending of the memory and the processor of each task, calculating the respective scores through a resource spending estimation algorithm, and sequencing according to the scores to generate an execution task set; step S4, distributing the tasks in the executed task set to a plurality of computing nodes of the big data platform to execute their respective tasks by the plurality of computing nodes. According to the distributed scheduling method, the distribution control and the overhead measurement and calculation are carried out, and the problem of resource waste is solved.

Description

Distributed scheduling method and device based on ER relationship, equipment and storage medium
Technical Field
The present invention relates to a big data platform technology, and in particular, to a method and an apparatus for distributed scheduling of tasks of a big data platform based on an ER relationship, an electronic device, and a non-transitory computer storage medium.
Background
As the business development range gradually expands, a company large data platform usually supports more and more data computing tasks and also gradually becomes an important support platform for data services. Taking a big data platform of an e-commerce enterprise as an example, taking 30 data acquisition tasks and 20-30 calculation tasks at the beginning of a report task of an e-commerce analysis center as starting points, and gradually covering more than 400 data acquisition tasks and more than 700 data analysis calculation tasks of services such as a consignment report, a financial report, supply chain services, risk early warning, GMV operation daily report and the like; meanwhile, the object level of the data service is also changed from the business middle station to the business decision. Therefore, by establishing a big data base platform optimization project, the targeted optimization needs are listed as project completion targets, and the corresponding application problem is solved.
The scheduling method applied at present has the practical application problem that the distribution rule is preset through parameter setting, but the distribution rule does not match the overhead condition of the differentiation of the actual calculation task.
Further, there are also the following practical problems:
the granularity of resource overhead is high, large calculation tasks are not distributed enough, small calculation tasks are wasted greatly, the calculation result of the large calculation tasks is slow, the total number of concurrent tasks of the small calculation tasks is limited, and the total amount and the effect of concurrent operation are influenced on the whole;
the pretreatment process and the management approach are not enough to meet the specialized operation result analysis;
the rapid positioning and checking of abnormal task running conditions in the whole range cannot be met;
the fault analysis positioning time of the existing dispatching program is long, so that execution logs of all batches must be stored, and the purpose is to have enough time to position fault information without being covered before next dispatching is started; logs of the same hour-level computing task are stored 20+ times a day; therefore, the number of the calculation log files generated every day is about more than 15000; the daily log data volume is larger than 1G based on distributed storage, so that more fragments and garbage files are generated in operation, and the maintenance energy and input cost are higher.
Disclosure of Invention
In view of the above, the present invention is based on a daily expanded business application scenario, and one of the objectives of the present invention is to provide a distributed scheduling method, which can maintain a front-end query service of a result set in a platform while refreshing a floor data set by a computation task, so that a large data platform as a whole can satisfy more concurrent computation task capabilities in a computation cycle, and simultaneously, realize analysis and layering of a computation task total set, without manually defining configuration information, and automatically identify and generate configuration information by an algorithm for layering based on an ER association relationship, thereby greatly reducing human resource investment and management time.
Another object of the present invention is to provide a distributed scheduling apparatus.
It is a further object of the present invention to provide an electronic device.
It is also an object of the invention to provide a non-transitory computer storage medium.
In order to solve the technical problems, the invention adopts the following technical scheme:
the distributed scheduling method according to the embodiment of the first aspect of the present invention is used for scheduling tasks in a big data platform, and is characterized in that,
the hierarchical tasks in the big data platform comprise:
a data acquisition task for acquiring data from a business system;
a data cleaning calculation task for calculating data to be cleaned for the acquired data;
the detail data calculation task is used for calculating the detail data of the data warehouse of the big data platform;
an application data computation task to compute application data of a data warehouse of the big data platform,
the distributed scheduling method comprises the following steps:
step S1, determining the range of the calculation task to be scheduled, and generating an initial task set;
step S2, arranging all tasks in the initial task set in a layered way based on ER relationship;
step S3, estimating the spending of the memory and the processor of each task, calculating the respective scores through a resource spending estimation algorithm, and sequencing according to the scores to generate an execution task set;
step S4, distributing the tasks in the executed task set to a plurality of computing nodes of the big data platform to execute their respective tasks by the plurality of computing nodes.
According to some embodiments of the invention, the distributed scheduling method further comprises the steps of:
step S5, in the steps S1 to S4, a calculation task log file and an alarm log file are generated.
According to some embodiments of the present invention, in step S1, when there is a new task to be issued, the range of the computing task to be scheduled is re-determined, and the initial task set is updated.
According to some embodiments of the invention, the step S2 includes:
analyzing an ER association relation for the computing tasks in the initial task set;
and carrying out hierarchical arrangement based on the ER association relation.
Further, the analyzing the ER association relationship of the single job of each task comprises: and extracting keywords in the text information of the calculation task, and acquiring the ER association relation of the calculation task according to the keywords.
Further, the analyzing the ER association relationship of the single job of each task further includes:
performing correlation analysis on a reference field of a data table upstream of the computing task, and determining a source of a specific field associated with the computing task;
and when the data structure field of the source business system is changed, updating the corresponding field of the calculation task and the associated information of the downstream layer in time.
Further, the keywords in the text information of the computing task are identified through a text recognition technique.
Further, the keyword includes one or more of a call field, a connection field, and a naming-rule-based field in the text information.
Further, before resolving the ER association relationship, the method also comprises the following steps:
dividing the computing task into a plurality of batches according to a preset rule,
in the layering based on the ER association relationship, in each batch, the computing tasks are layered based on the ER association relationship.
Further, the predetermined rules include partitioning by time echelon and/or partitioning by business logic.
Further, the analyzing the ER association relationship of the computing task further includes:
and importing the ER association relation of the single job of each computing task into the database through a structured query language.
Further, said layering based on said ER association comprises:
sorting all data in the database, and classifying the data based on a preset rule;
calculating the hierarchical value of each calculation task based on the ER association relation of all the calculation tasks;
the computing task is layered based on the layering values.
Further, a hierarchical value for each of the computing tasks is computed based on an online analysis.
According to some embodiments of the invention, in step S3, the resource cost evaluation algorithm calculates respective scores according to the barrel principle.
Further, the step S3 includes:
step S31: outputting execution plan logs of all the computing tasks;
step S32: parsing resource overhead information in the output execution plan log, classifying the overhead of each application data computation task,
the processor resource overhead x of a single computational task in the log content is counted,
counting the memory overhead of each computing node of a single computing task in the log content on the data platform, carrying out arithmetic addition to obtain the memory overhead y,
counting the total number of bytes scanned by a single task in the log content in the distributed file system, performing arithmetic addition to obtain the scanning amount z of the storage resource,
and then, calculating the resource overhead of each application data calculation task according to the following ternary quadratic calculation formula:
Figure BDA0002266055100000041
the method comprises the following steps that a coefficient n is the total number of running nodes of the distributed system, x is processor resource overhead, y is memory overhead, and z is storage resource scanning amount;
step S33, arranging the calculation tasks of the same batch of layers in reverse order according to the resource spending f (x, y, z), the distribution method takes the total number of the tasks as dividend, the number of the calculation nodes (n-1) as divisor, the remainder is recorded as variable c,
if the total number of the tasks is less than or equal to (n-1), sequentially distributing the task names to the executing task set of each node,
if the total number of the tasks is larger than (n-1), c tasks at the end of the sorting queue are distributed to the (n-1) th execution task set, and the tasks of the rest reverse-order queues are circularly distributed to the (n-1) execution task sets in sequence.
According to some embodiments of the present invention, in step S4, the tasks in the set of executing tasks are distributed to the plurality of computing nodes of the big data platform in a balanced manner so that the plurality of computing nodes execute the respective tasks.
According to other embodiments of the present invention, in step S4, allocable resources outside the component resource pool of the system node are measured, and the tasks in the execution task set are allocated according to the allocable resources.
Further, according to the allocable resources, the overhead condition of the memory and the calculation resources of the processor is divided for allocation consideration.
Further, a monitoring process runs in the background of the system, the resource running condition is inquired at a preset time interval to obtain the basic resource cost amount, and the allocable resource is calculated based on the following formula:
allocable resources are total resource amount-base resource overhead amount.
The distributed scheduling apparatus according to the second aspect of the present invention includes:
the initial task set generation module is used for determining the range of the calculation tasks needing to be scheduled and generating an initial task set;
the hierarchical arrangement module is used for arranging the tasks in all the initial task sets based on the ER relationship;
the execution task set generation module is used for estimating the spending of the memory and the processor of each task, calculating respective scores through a resource spending estimation algorithm and sequencing according to the scores to generate an execution task set;
an assignment module to assign the tasks in the set of executing tasks to a plurality of compute nodes of the big data platform such that the respective tasks are executed by the plurality of compute nodes.
The electronic equipment for big data distributed scheduling according to the third aspect of the invention comprises:
one or more processors;
one or more memories having computer readable code stored therein which, when executed by the one or more processors, performs the distributed scheduling method of any of the above embodiments.
A non-transitory computer storage medium according to a fourth aspect of the invention, wherein computer readable code is stored, which when executed by one or more processors performs the distributed scheduling method of any of the above embodiments.
The technical scheme of the invention at least has one of the following beneficial effects:
according to the distributed scheduling method based on the ER relationship, the computing tasks are automatically layered based on the analysis of the ER association relationship, the resource overhead and the statistical computation are performed in a mode of an operating system instruction set, a large data platform function interface and an automatic operation and maintenance technology set, the task overhead condition of a batch is counted, the whole computing tasks are distributed and pre-distributed, the total task quantity of the batch is divided into the node task set, the computing tasks in the task total set can be efficiently, accurately and automatically layered, the distribution control and the overhead measurement of the computing resources are realized on the basis, and then the execution scheduling instruction is issued to the computing nodes, so that the resource waste condition is solved;
meanwhile, for processing time with different lengths of the calculation tasks, the large task resources are inclined according to the barrel principle, the concurrency of the small tasks is improved, the overhead time of the whole calculation task is reduced, and more calculation tasks are completed within one-hour scheduling period intervals;
because the task log file and the alarm log file are generated in the middle process, and the functions of calculating task tracking and analyzing are provided, the abnormal condition can be quickly positioned through task tracking, more specifically, through the combination of text operating instructions cat, grep and the like of an operating system layer, the tasks which fail to run in hundreds of log files generated by scheduling in a batch can be quickly checked within 2 seconds, the key abnormal information of fault errors can be quickly obtained, and error logs are output, and the fault positioning time of the distributed scheduling method is greatly reduced by 99.3-99.89% compared with the fault checking time of conventional scheduling;
according to the distributed scheduling method, all resources are regarded as a large pool, the resource range of a starting task is not limited by component parameters, process resources are isolated through an operating system CGROUP, computing tasks are distributed in a user-defined mode through a resource distribution overhead algorithm, the resource distribution of large and small tasks is met, the resource distribution granularity is reduced, and the concurrence effect is better in actual measurement;
in addition, according to the distributed scheduling method, because the log management method and the statistical mode of the task completion degree are optimized, and the generation and maintenance requirements of the junk files are reduced, the optimization of the management approach and the statistical mode is realized, no process fragments exist, and the maintenance cost is reduced;
the method can meet the scheduling requirement of the data analysis and calculation tasks which are gradually increased, the frequency and concurrency requirement of small-level data updating of service data service, and the following multiple management function requirements related to data management.
Drawings
FIG. 1 is a flow chart of a distributed scheduling method according to the present invention;
fig. 2 is a schematic flowchart of layering based on ER association in a distributed scheduling method according to an embodiment of the present invention;
FIG. 3 is another schematic flow diagram for layering based on ER associations in accordance with the present invention;
FIG. 4 is a flowchart of overhead estimation and ordering in a distributed scheduling method according to an embodiment of the present invention;
FIG. 5 is a block diagram of a distributed scheduling apparatus according to an embodiment of the present invention;
FIG. 6 is a block diagram of a distributed scheduling apparatus in accordance with one embodiment of the present invention;
fig. 7 illustrates an exemplary electronic device that may be used to implement the distributed scheduling methods of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the following will combine the drawings of the embodiments of the present invention to clearly and completely describe the technical solutions of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.
First, a distributed scheduling method according to an embodiment of the present invention is described in detail with reference to the accompanying drawings.
The distributed scheduling method is used for scheduling tasks in a big data platform.
For example, hierarchical tasks in a big data platform may include:
0_ ods layer data acquisition task- >1_ odsp layer data cleaning calculation task- >2_ dw0, 2_ dw1 layer detail data calculation task- >3_ dm0, 3_ dm1, 3_ dm2, 3_ dm3 and 3_ dm4 layer application data calculation task- >4_ h2m layer data push task.
And each layer of tasks have clear successive calculation relationship according to arrows.
In order to schedule the layered tasks in the big data platform, as shown in fig. 1, the distributed scheduling method according to the present invention includes the following steps:
step S1, determining the range of the calculation tasks to be scheduled, and generating an initial task set.
The calculation tasks to be scheduled include a data cleaning calculation task, a detail data calculation task, and an application data calculation task, that is, the data cleaning calculation task, the detail data calculation task, and the application data calculation task are analyzed to determine the range of the calculation tasks to be scheduled.
Preferably, when a new task is issued, the range of the computing task to be scheduled is re-confirmed, and the initial task set is updated.
And step S2, arranging all tasks in the initial task set in a hierarchical mode based on the ER relation.
According to some embodiments of the present invention, as shown in fig. 2, the performing hierarchical arrangement based on ER relationships may specifically include:
and step S21, analyzing the ER association relation of the calculation task. According to some embodiments of the invention, ER associations for individual jobs of each computing task are resolved.
Further, resolving ER associations for individual jobs for each computing task may include:
a) and extracting keywords in the text information of the calculation task, and acquiring the ER association relation of the calculation task according to the keywords.
The keywords may be recognized, for example, by conventional text recognition techniques, such as OCR recognition techniques and the like. In addition, the keyword may include one or more of a call Field (FROM), a connection field (JOIN), and a naming-rule based field in the text information.
Still further, resolving ER associations for individual jobs for each computing task may further include:
b) performing correlation analysis on a reference field of a data table upstream of the computing task, and determining a source of a specific field associated with the computing task;
c) when the data structure field of the source business system is changed, the corresponding field of the calculation task and the associated information of the downstream layer are updated in time.
Therefore, through correlation analysis of the reference field of the upstream data table, the specific field source associated with the calculation task can be clear, when the field name, the data type, the meaning description and the like of the data structure field of the source service system are changed, the data acquisition ods layer and the data cleaning odsp layer can effectively manage the change and guide the change of the associated data dictionary and index information of the downstream dw layer and the dm layer and the correction of the calculation task, and the fine management of a large data platform is realized.
Still further, the method may further include:
d) and importing the ER association relation of the single job of each computing task into the database through a Structured Query Language (SQL). Thus, a query channel based on data results of system command lines or SQL is provided.
And step S22, layering based on the ER association relationship.
According to some embodiments of the present invention, after the ER association of the single job of each computing task is parsed, for example, the ER association of the single job is extracted by extracting keywords such as FROM, JOIN, naming rules of structure objects, and the like in the computing task definition content text information, an upstream task set of the single job may be determined based on the ER association. On the basis, after the ER association relation of each computing task is imported into the database, the information of the layer where the computing task is located can be obtained through calculation, the layer where the computing task is located is further confirmed, and the tasks are layered accordingly. After determining the layer at which all computing tasks should be, then the computing tasks in the big data platform may be layered as such.
Specifically, the layering based on the ER association relationship may specifically include:
1) and sorting all data in the database and classifying the data based on a preset rule.
After the ER relation of each computing task is obtained through analysis, the ER relation is imported into a database through SQL, and then the original data is sorted and classified based on preset rules.
2) And calculating the hierarchical value of each calculation task based on the ER association relation of all the calculation tasks.
After the preliminary sorting and classification, ER association relation analysis and calculation are carried out to obtain each layering value.
Specifically, the hierarchical value of each calculation task may be calculated by an analysis algorithm based on OLAP (online analytical processing) calculation.
3) The computing task is layered based on the layering values.
Specifically, for example, the calculation tasks having the same hierarchical value may be output after removing the duplicate value. For example, the data is output to a configuration file of a scheduling task set of a big data platform, thereby performing layering of computing tasks.
FIG. 3 illustrates an example of layering based on the ER associations.
As shown in fig. 3, after the calculation of each hierarchical value, the calculation tasks having the same hierarchical value are hierarchically performed one by performing a filtering one by one based on a certain search filtering rule (for example, the same hierarchical value).
In summary, that is, after extracting the computing tasks in the large data platform task set, first in step S21, the ER association information of the individual jobs of each computing task is parsed, and thereafter in step S22, the association between each computing task is calculated and a hierarchical value is obtained, and the hierarchy is performed based on the hierarchical value.
And step S3, estimating the spending of the memory and the processor of each task, calculating the respective scores through a resource spending estimation algorithm, and sequencing according to the scores to generate an execution task set.
After the computing tasks are layered, the cost of each task is estimated, and a set of executing tasks is generated based on the cost for distributed scheduling.
Preferably, the resource cost assessment algorithm calculates the respective scores according to the barrel principle.
Further, as shown in fig. 4, the step S3 includes:
step S31: outputting execution plan logs of all the computing tasks;
step S32: parsing resource overhead information in the output execution plan log, classifying the overhead of each application data computation task,
the processor resource overhead x of a single computational task in the log content is counted,
counting the memory overhead of each computing node of a single computing task in the log content on the data platform, carrying out arithmetic addition to obtain the memory overhead y,
counting the total number of bytes scanned by a single task in the log content in the distributed file system, performing arithmetic addition to obtain the scanning amount z of the storage resource,
and then, calculating the resource overhead of each application data calculation task according to the following ternary quadratic calculation formula:
Figure BDA0002266055100000101
the method comprises the following steps that a coefficient n is the total number of running nodes of the distributed system, x is processor resource overhead, y is memory overhead, and z is storage resource scanning amount;
step S33, arranging the calculation tasks of the same batch of layers in reverse order according to the resource spending f (x, y, z), the distribution method takes the total number of the tasks as dividend, the number of the calculation nodes (n-1) as divisor, the remainder is recorded as variable c,
if the total number of the tasks is less than or equal to (n-1), sequentially distributing the task names to the executing task set of each node,
if the total number of the tasks is larger than (n-1), c tasks at the end of the sorting queue are distributed to the (n-1) th execution task set, and the tasks of the rest reverse-order queues are circularly distributed to the (n-1) execution task sets in sequence.
Therefore, for processing time with different lengths of the calculation tasks, large task resources are inclined according to the barrel principle, the concurrency of small tasks is improved, and the overhead time of the whole calculation task is reduced; therefore, more calculation tasks can be completed within the scheduling period interval of unit time.
Step S4, distributing the tasks in the executed task set to a plurality of computing nodes of the big data platform to execute their respective tasks by the plurality of computing nodes.
According to the distributed scheduling method, the task overhead condition of the batch is counted, the whole task is distributed and pre-distributed, the total task quantity of the batch is divided into the node task set, the distribution control and the overhead measurement and calculation of the computing resources are realized, and then the scheduling execution instruction is issued to the computing node, so that the resource waste condition is solved. In addition, through the distributed scheduling method, reasonable resource overhead distribution among multiple nodes in a distributed architecture is realized, reasonable resource overhead is distributed in the total amount range of the large and small calculation tasks, original idle resources are reasonably utilized, and waiting conditions caused by different time lengths of the tasks are greatly reduced.
In step S4, the tasks in the set of executed tasks may be equally distributed to the plurality of computing nodes of the big data platform so that the plurality of computing nodes execute the respective tasks.
Preferably, in step S4, allocable resources outside the component resource pool of the system node are measured, and the tasks in the execution task set are allocated according to the allocable resources. Specifically, for example, the system runs in the background with a monitoring process, queries the resource running conditions at predetermined time intervals (e.g., 5s time intervals), to obtain the basic resource overhead amount,
the allocable resources are measured and calculated by the following method:
allocable resources are total resource amount-base resource overhead amount.
Further, according to the allocable resources, the overhead condition of the memory and the calculation resources of the processor is divided for allocation consideration. By the queue technology, namely, the tasks in the execution task set are distributed according to the distributable resources, the full load state of the concurrent queue can be kept at any time, and the actual effect can be further optimized and enhanced.
The existing scheduling program combines a resource pool of a big data platform to perform allocation management of a calculation task, and defines a resource range through 6 parameters: container initial memory, container increment memory, container upper limit memory, container initial cpu core number, container increment cpu core number, and container cpu core number online. The practical use finds that the initial memory capacity and the CPU core number are too low to start a large calculation task; but the initial baseline is improved, so that the small computing task generates resource waste to influence concurrency; meanwhile, the upper limit value is set to be lower, the large tasks are not distributed enough, the calculation speed is reduced, the set upper limit value is set to be higher, the large tasks are occupied rapidly, and concurrency is influenced.
According to the distributed scheduling method provided by the embodiment of the invention, all resources are regarded as a large pool, the resource range of the starting task is not limited by component parameters, the process resources are isolated by the CGROUP of the operating system, the calculation tasks are distributed in a user-defined mode by a resource distribution overhead algorithm, the resource distribution of the large tasks and the resource distribution of the small tasks are all met, the resource distribution granularity is reduced, and the actual measurement of the concurrency effect is better. In addition, the distributed scheduling method according to the above embodiment of the present invention may further include the following steps:
step S5, in the steps S1 to S4, generating a task log file and an alarm log file. Therefore, a task tracking and analyzing function is provided, and abnormal conditions can be quickly located through task tracking.
Specifically, through the combination of text operation instructions cat, grep and the like of an operating system layer, tasks which fail to operate in hundreds of log files generated by a batch of scheduling can be rapidly checked within 2 seconds, the key abnormal information of fault errors can be rapidly acquired, and error logs are output, so that within the time of operating tasks and the total number of the operating tasks of a batch of scheduling, a plurality of operation alarms are generated, and the successful tasks and the failed tasks are intuitively output (some operation alarms do not influence the success of fault execution); the follow-up of the operation result of the scheduler is very convenient.
The conventional scheduling program is based on imaging, the content of each scheduling result needs to be inquired after clicking layer by layer, the difference between alarm and execution failure cannot be distinguished, and the whole operation is not convenient to know whether normal or not; and the fault positioning follow-up time of the conventional scheduling program is in the level of 5-30 minutes, and the large-range abnormality of more than 15 fault tasks needs more than 1 hour to complete all troubleshooting.
Therefore, the fault location time of the distributed scheduling method is greatly reduced by 99.3% -99.89% compared with the troubleshooting time of conventional scheduling.
In addition, according to the distributed scheduling method, distributed scheduling and fault analysis positioning can be completed in the second level, so that all calculation tasks only need to store one latest log file respectively, and 5 fault summary report logs are generated in the hour level; the number of log files is only required to be generated within 1000 in one day, the log files are reduced by 93%, and the log data volume is not increased along with the increase of time. Therefore, the distributed scheduling method greatly reduces the number of the historical log junk files, can ensure that all tasks run normally at any time point, and greatly reduces the maintenance energy and the input cost.
A distributed scheduling apparatus 100 according to the present invention is described below with reference to fig. 5.
As shown in fig. 5, the distributed scheduling apparatus 100 according to the present invention includes: an initial task set generating module 101, a hierarchical arrangement module 102, an execution task set generating module 103, and an allocation module 104.
An initial task set generating module 101, configured to determine a range of computing tasks that need to be scheduled, and generate an initial task set.
The hierarchical arrangement module 102 is configured to hierarchically arrange the tasks in all the initial task sets based on the ER relationship.
The execution task set generating module 103 is configured to estimate memory and processor overhead of each task, calculate respective scores through a resource overhead evaluation algorithm, and sort the scores to generate an execution task set.
The distribution module 104 is configured to distribute the tasks in the set of executing tasks to a plurality of computing nodes of the big data platform such that the respective tasks are executed by the plurality of computing nodes.
In the following, fig. 3 is a set of drawings for describing a distributed scheduling apparatus 100' according to an embodiment of the present invention.
As shown in fig. 6, the distributed scheduling apparatus of the present embodiment includes an initial scheduling task set module 101 ', a hierarchical scheduling set module 102 ', a calculation module 1031, an optimization module 1032, and an allocation module 104 '.
The initial task set module 101' has the following functions: when the calculation task is issued, the system is responsible for performing code analysis on the newly added task or the updated task, analyzing the dependency relationship, and determining which layer of task the newly added task should be placed in the scheduling set.
After that, the layering module 102 'layers the initial scheduling task based on the ER relationship of each task based on the initial task aggregation module 101' to generate a layered scheduling aggregation.
Next, the calculation module 1031 performs overhead estimation on the scheduling tasks of each layer, and the optimization module 1032 ranks the scheduling tasks according to the calculation results of the calculation module 1031 to obtain an execution task set.
Thereafter, the assignment module 104' assigns the tasks in the set of executing tasks to the respective nodes.
FIG. 7 illustrates an exemplary electronic device that may be used to implement the processing methods of the present disclosure.
The electronic device 1000 includes at least one processor 1002 that executes instructions stored in a memory 1004. The instructions may be, for example, instructions for implementing the functions described as being performed by one or more of the modules described above or instructions for implementing one or more steps in the methods described above. The processor 1002 may access the memory 1004 via a system bus 1006. In addition to storing executable instructions, memory 1004 may also store training data and the like. The processor 1002 may be any of a variety of devices having computing capabilities, such as a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU). The CPU may be an X86 or ARM processor; the GPU may be integrated directly onto the motherboard, or built into the north bridge chip of the motherboard, or built into the Central Processing Unit (CPU), separately.
The electronic device 1000 also includes a data store 1008 that is accessible by the processor 1002 via the system bus 1006. Data store 1008 may include executable instructions, multi-image training data, and the like. The electronic device 1000 also comprises an input interface 1010 that allows external devices to communicate with the electronic device 1000. For example, the input interface 1010 may be used to receive instructions from an external computer device, from a user, or the like. The electronic device 1000 may also include an output interface 1012 that interfaces the electronic device 1000 with one or more external devices. For example, the electronic device 1000 may display images and the like through the output interface 1012. It is contemplated that external devices in communication with electronic device 1000 via input interface 1010 and output interface 1012 can be included in an environment that provides virtually any type of user interface with which a user can interact. Examples of user interface types include graphical user interfaces, natural user interfaces, and the like. For example, the graphical user interface may accept input from a user using input device(s) such as a keyboard, mouse, remote control, etc., and provide output on an output device such as a display. Further, the natural language interface may enable a user to interact with the electronic device 1000 in a manner that does not require the constraints imposed by input devices such as a keyboard, mouse, remote control, and the like. Instead, natural user interfaces may rely on speech recognition, touch and stylus recognition, gesture recognition on and near the screen, air gestures, head and eye tracking, speech and speech, vision, touch, gestures, and machine intelligence, among others.
The various illustrative logical blocks, modules, and circuits described may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an AS IC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
The steps of a method or algorithm described in connection with the present disclosure may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of tangible storage medium. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, and the like. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. A software module may be a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media.
The methods disclosed herein comprise one or more acts for implementing the described methods. The methods and/or acts may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of actions is specified, the order and/or use of specific actions may be modified without departing from the scope of the claims.
The functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions on a tangible computer-readable medium. The computer readable medium includes a computer readable storage medium. Computer readable storage media can be any available storage media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Additionally, propagated signals are not included within the scope of computer-readable storage media. Computer-readable media also includes communication media including any medium that facilitates transfer of a computer program from one place to another. The connection may be, for example, a communication medium. For example, if the software is transmitted from a web site, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of communication medium. Combinations of the above should also be included within the scope of computer-readable media. Alternatively or in addition, the functions described herein may be performed, at least in part, by one or more hardware logic components. For example, illustrative types of hardware logic components that may be used include Field Programmable Gate Arrays (FPGAs), program specific integrated circuits (ASICs), program specific standard products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and so forth.
Accordingly, a computer program product may perform the operations presented herein. For example, such a computer program product may be a computer-readable tangible medium having instructions stored (and/or encoded) thereon that are executable by one or more processors to perform the operations described herein. The computer program product may include packaged material.
Software or instructions may also be transmitted over a transmission medium. For example, the software may be transmitted from a website, server, or other remote source using a transmission medium such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, or microwave.
Further, modules and/or other suitable means for carrying out the methods and techniques described herein may be downloaded and/or otherwise obtained by a user terminal and/or base station as appropriate. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, the various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a CD or floppy disk) so that the user terminal and/or base station can obtain the various methods when coupled to or providing storage means to the device. Further, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.
Additionally, although the electronic device 1000 is shown as a single system, it is understood that the electronic device 1000 may be a distributed system and may be arranged as a cloud infrastructure (including a public cloud or a private cloud). Thus, for example, several devices may communicate over a network connection and may collectively perform tasks described as being performed by the electronic device 1000.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (18)

1. A distributed scheduling method for scheduling tasks in a big data platform,
the hierarchical tasks in the big data platform comprise:
a data acquisition task, which acquires data from a business system to obtain a data set;
a data cleaning calculation task for calculating the data to be cleaned in the data set;
a detail data calculation task to calculate the detail data in the dataset;
an application data computation task to compute application data in the dataset,
the distributed scheduling method comprises the following steps:
step S1, determining the range of the calculation task to be scheduled, and generating an initial task set;
step S2, the calculation tasks in all the initial task sets are arranged in a layered mode based on the ER relationship;
step S3, estimating the spending of the memory and the processor of each task, calculating the respective scores through a resource spending estimation algorithm, and sequencing according to the scores to generate an execution task set;
step S4, distributing the tasks in the task execution set to a plurality of computing nodes of the big data platform to execute their respective tasks by the plurality of computing nodes; wherein the step S2 includes:
analyzing an ER association relation for the computing tasks in the initial task set;
performing hierarchical arrangement based on the ER association relationship; wherein the content of the first and second substances,
for the computing tasks in the initial task set, resolving the ER association relationship includes: analyzing the ER association relationship of the single job of each task, wherein the analyzing of the ER association relationship of the single job of each task comprises the following steps:
extracting keywords in the text information of the computing task, and acquiring an ER association relation of the computing task according to the keywords;
performing correlation analysis on a reference field of a data table upstream of the computing task, and determining a source of a specific field associated with the computing task;
and when the data structure field of the source business system is changed, updating the corresponding field of the calculation task and the associated information of the downstream layer in time.
2. The distributed scheduling method of claim 1, further comprising the steps of:
step S5, in the steps S1 to S4, a calculation task log file and an alarm log file are generated.
3. The distributed scheduling method of claim 1 wherein, in step S1, when there is a new task issued, the range of the computing task to be scheduled is re-determined, and the initial task set is updated.
4. The distributed scheduling method of claim 1, wherein the extracting keywords from the text information of the computing task comprises: identifying the keywords in the text information of the computing task through a text recognition technology.
5. The distributed scheduling method of claim 4 wherein the keywords comprise one or more of call fields, connection fields, and naming-rule based fields in the textual information.
6. The distributed scheduling method of claim 5, further comprising the following steps before resolving the ER association relationship:
dividing the computing task into a plurality of batches according to a preset rule,
in the layering based on the ER association relationship, in each batch, the computing tasks are layered based on the ER association relationship.
7. The distributed scheduling method according to claim 6 wherein the predetermined rules include partitioning by time echelon and/or partitioning by service logic.
8. The distributed scheduling method of claim 1 wherein resolving ER associations for a single job for each task further comprises: and importing the ER association relation of the single job of each computing task into the database through a structured query language.
9. The distributed scheduling method of claim 8, wherein said hierarchically arranging based on the ER associations comprises:
sorting all data in the database, and classifying the data based on a preset rule;
calculating the hierarchical value of each calculation task based on the ER association relation of all the calculation tasks;
the computing task is layered based on the layering values.
10. The distributed scheduling method of claim 9 wherein the hierarchical value for each computational task is calculated based on an online analysis.
11. The distributed scheduling method of claim 1 wherein in step S3, the resource cost evaluation algorithm calculates respective scores according to the barrel principle.
12. The distributed scheduling method of claim 11, wherein the step S3 includes:
step S31: outputting execution plan logs of all the computing tasks;
step S32: parsing resource overhead information in the output execution plan log, classifying the overhead of each application data computation task,
the processor resource overhead x of a single computational task in the log content is counted,
counting the memory overhead of each computing node of a single computing task in the log content on the data platform, carrying out arithmetic addition to obtain the memory overhead y,
counting the total number of bytes scanned by a single task in the log content in the distributed file system, performing arithmetic addition to obtain the scanning amount z of the storage resource,
and then, calculating the resource overhead of each application data calculation task according to the following ternary quadratic calculation formula:
Figure FDA0002684253090000031
the method comprises the following steps that a coefficient n is the total number of running nodes of the distributed system, x is processor resource overhead, y is memory overhead, and z is storage resource scanning amount;
step S33, arranging the calculation tasks of the same batch of layers in reverse order according to the resource spending f (x, y, z), the distribution method takes the total number of the tasks as dividend, the number of the calculation nodes (n-1) as divisor, the remainder is recorded as variable c,
if the total number of the tasks is less than or equal to (n-1), sequentially distributing the task names to the executing task set of each node,
if the total number of the tasks is larger than (n-1), c tasks at the end of the sorting queue are distributed to the (n-1) th execution task set, and the tasks of the rest reverse-order queues are circularly distributed to the (n-1) execution task sets in sequence.
13. The distributed scheduling method of claim 1 wherein in step S4, the tasks in the executing task set are distributed equally to the plurality of computing nodes of the big data platform to execute the respective tasks by the plurality of computing nodes.
14. The distributed scheduling method of claim 1 wherein, in step S4, assignable resources outside a component resource pool of a system node are measured, and tasks in the executing task set are assigned according to the assignable resources.
15. The distributed scheduling method of claim 14 wherein the system background runs a monitoring process, queries resource operation conditions at predetermined time intervals to obtain a base resource cost amount, and calculates the allocable resource based on the following formula:
allocable resources are total resource amount-base resource overhead amount.
16. The distributed scheduling method of claim 14 wherein the overhead of memory and processor computing resources is considered separately based on the allocable resources.
17. An electronic device for big data distributed scheduling, comprising:
one or more processors;
one or more memories having computer-readable code stored therein which, when executed by the one or more processors, performs the distributed scheduling method of any of claims 1-16.
18. A non-transitory computer storage medium having stored therein computer readable code, which when executed by one or more processors performs the distributed scheduling method of any one of claims 1-16.
CN201911088140.8A 2019-11-08 2019-11-08 Distributed scheduling method and device based on ER relationship, equipment and storage medium Active CN110825526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911088140.8A CN110825526B (en) 2019-11-08 2019-11-08 Distributed scheduling method and device based on ER relationship, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911088140.8A CN110825526B (en) 2019-11-08 2019-11-08 Distributed scheduling method and device based on ER relationship, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110825526A CN110825526A (en) 2020-02-21
CN110825526B true CN110825526B (en) 2020-10-30

Family

ID=69553621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911088140.8A Active CN110825526B (en) 2019-11-08 2019-11-08 Distributed scheduling method and device based on ER relationship, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110825526B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111427748B (en) * 2020-03-31 2023-06-23 携程计算机技术(上海)有限公司 Task alarm method, system, equipment and storage medium
CN113111071B (en) * 2021-05-11 2024-05-07 北京星辰天合科技股份有限公司 Object processing method, device, nonvolatile storage medium and processor
CN113535400A (en) * 2021-07-19 2021-10-22 闻泰通讯股份有限公司 Parallel computing resource allocation method and device, storage medium and terminal equipment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899199B (en) * 2014-03-04 2018-12-28 阿里巴巴集团控股有限公司 A kind of data warehouse data processing method and system
US10176011B2 (en) * 2015-06-10 2019-01-08 Tata Consultancy Services Limited Automatically generating and executing a service operation implementation for executing a task
CN106775841B (en) * 2016-11-29 2021-02-19 深圳广电银通金融电子科技有限公司 Method, system and device for upgrading plug-in
CN106802947B (en) * 2017-01-13 2019-11-29 中国工商银行股份有限公司 The data processing system and method for entity relationship diagram
CN106919449B (en) * 2017-03-21 2020-11-20 联想(北京)有限公司 Scheduling control method of computing task and electronic equipment
CN107493205B (en) * 2017-07-13 2020-08-14 华为技术有限公司 Method and device for predicting capacity expansion performance of equipment cluster
CN108763565B (en) * 2018-06-04 2022-06-14 广东京信软件科技有限公司 Deep learning-based data automatic association matching construction method
CN109101492A (en) * 2018-07-25 2018-12-28 南京瓦尔基里网络科技有限公司 Usage history conversation activity carries out the method and system of entity extraction in a kind of natural language processing
CN109445928A (en) * 2018-11-14 2019-03-08 郑州云海信息技术有限公司 A kind of access request processing method, device, equipment and readable storage medium storing program for executing
CN110245023B (en) * 2019-06-05 2020-09-25 欧冶云商股份有限公司 Distributed scheduling method and device, electronic equipment and computer storage medium

Also Published As

Publication number Publication date
CN110825526A (en) 2020-02-21

Similar Documents

Publication Publication Date Title
CN110245023B (en) Distributed scheduling method and device, electronic equipment and computer storage medium
US10963810B2 (en) Efficient duplicate detection for machine learning data sets
CN106663037B (en) System and method for managing feature processing
CN106575246B (en) Machine learning service
CN110825526B (en) Distributed scheduling method and device based on ER relationship, equipment and storage medium
US20150379429A1 (en) Interactive interfaces for machine learning model evaluations
CN112579273B (en) Task scheduling method and device and computer readable storage medium
CN112579586A (en) Data processing method, device, equipment and storage medium
CN115373835A (en) Task resource adjusting method and device for Flink cluster and electronic equipment
AU2021244852B2 (en) Offloading statistics collection
Sekhar et al. Optimized focused web crawler with natural language processing based relevance measure in bioinformatics web sources
US20230018975A1 (en) Monolith database to distributed database transformation
Liew et al. Towards optimising distributed data streaming graphs using parallel streams
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
Bergui et al. A survey on bandwidth-aware geo-distributed frameworks for big-data analytics
CN113886111A (en) Workflow-based data analysis model calculation engine system and operation method
CN111046059B (en) Low-efficiency SQL statement analysis method and system based on distributed database cluster
Packiaraj et al. Hypar-fca: a distributed framework based on hybrid partitioning for fca
CN110908986B (en) Layering method and device for computing tasks, distributed scheduling method and device and electronic equipment
CN113722141B (en) Method and device for determining delay reason of data task, electronic equipment and medium
CN114312930A (en) Train operation abnormity diagnosis method and device based on log data
CN115248815A (en) Predictive query processing
CN114201328A (en) Fault processing method and device based on artificial intelligence, electronic equipment and medium
CN116719584B (en) Data processing method, apparatus, computer, storage medium, and program product
CN114780620B (en) Cloud computing service analysis method, device and system based on big data mining performance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant