CN111061762A - Distributed task processing method, related device, system and storage medium - Google Patents

Distributed task processing method, related device, system and storage medium Download PDF

Info

Publication number
CN111061762A
CN111061762A CN201911085929.8A CN201911085929A CN111061762A CN 111061762 A CN111061762 A CN 111061762A CN 201911085929 A CN201911085929 A CN 201911085929A CN 111061762 A CN111061762 A CN 111061762A
Authority
CN
China
Prior art keywords
task
subtasks
processed
processing
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911085929.8A
Other languages
Chinese (zh)
Inventor
李温良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JD Digital Technology Holdings Co Ltd
Original Assignee
JD Digital Technology Holdings Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JD Digital Technology Holdings Co Ltd filed Critical JD Digital Technology Holdings Co Ltd
Priority to CN201911085929.8A priority Critical patent/CN111061762A/en
Publication of CN111061762A publication Critical patent/CN111061762A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a distributed task processing method, related equipment, a system and a storage medium. The method comprises the following steps: the main server acquires a task to be processed; determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to the configuration information of the data in the database cluster; dividing at least one data table to be processed into a data table queue corresponding to a plurality of subtasks according to task granularity configuration information; and distributing the data table queues corresponding to the subtasks to the task execution servers. The slave server receives the subtasks distributed by the master server; processing the subtasks by using a thread pool executor, and putting a processing result into a cache; and reading a first parameter and a second parameter, summarizing each processing result in the cache when the first parameter and the second parameter are the same, and returning an execution result of the task to be processed.

Description

Distributed task processing method, related device, system and storage medium
Technical Field
The present invention relates to the field of database technologies, and in particular, to a distributed task processing method, a related device, a system, and a storage medium.
Background
With the rapid development of internet technology today, the amount of data that various systems, applications rely on and generate has also increased explosively. High frequency systems, such as order centers, produce tens of millions of data volumes per day, while medium and low frequency systems also produce hundreds of thousands or millions of data volumes per day. In order to maintain the same level of efficient persistent and efficient query for these systems that generate large amounts of data, the related art adopts a database clustering scheme of multi-bank and multi-table, i.e. data hashing in different tables of different databases.
In the context of multiple databases and tables, the increment of a single database and a single data table is thousands or even tens of thousands relative to the total amount of data of tens of millions. In the process of implementing the invention, the inventor finds that the multi-library multi-table scheme brings disadvantages that data can only be positioned through the key attributes of hashing, range query and statistical query cannot be carried out, and traversing all data is very difficult, and after all data is extracted, statistics is carried out, so that the real-time property is deficient. Therefore, it is highly desirable to simplify the statistics or traversal operation of data in a database cluster as much as possible and reduce the workload of research and development personnel outside the research and development of business logic on the premise of meeting the real-time performance.
Disclosure of Invention
In order to solve the problems in the related art, embodiments of the present invention provide a distributed task processing method, a related device, a system, and a storage medium, which can simplify statistics or traversal operations of data in a database cluster and reduce workload of development personnel outside of development business logic on the premise of meeting real-time performance.
The technical scheme of the embodiment of the invention is realized as follows:
the embodiment of the invention provides a distributed task processing method, which is applied to a main server and comprises the following steps:
acquiring a task to be processed;
determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to configuration information of data in the database cluster;
dividing the at least one data table to be processed into a data table queue corresponding to a plurality of subtasks according to task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster;
and distributing the data table queues corresponding to the subtasks to the task execution servers.
In the foregoing solution, the allocating the data table queues corresponding to the respective subtasks to the respective task execution servers includes:
and distributing the data table queues corresponding to the subtasks to the task execution servers by using a polling mechanism.
The embodiment of the invention also provides a distributed task processing method, which is applied to the slave server and comprises the following steps:
receiving a subtask distributed by a main server;
processing the subtasks by using a thread pool executor, and putting a processing result into a cache;
and reading a first parameter and a second parameter, summarizing each processing result in the cache when the first parameter and the second parameter are the same, and returning an execution result of the task to be processed.
In the foregoing solution, the sub-task includes a plurality of tasks, and the processing the sub-task by using the thread pool executor includes:
and multithreading processing the tasks by utilizing a thread pool executor, and monitoring the multithreading processing process.
In the foregoing solution, before the method for processing the subtasks by using the thread pool executor, the method further includes:
according to a preset configuration rule, performing repeated operation condition check on the subtasks;
when determining that no repeatedly run subtasks exist, locking the subtasks;
when the subtask method is processed by using the thread pool executor, the method comprises the following steps:
and processing the locked subtasks by using the thread pool executor.
An embodiment of the present invention further provides a main server in a distributed task processing system, including:
the acquisition unit is used for acquiring the tasks to be processed;
the determining unit is used for determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to the configuration information of the data in the database cluster;
the dividing unit is used for dividing the at least one data table to be processed into a plurality of data table queues corresponding to the subtasks according to the task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster;
and the distribution unit is used for distributing the data table queues corresponding to the subtasks to the task execution servers.
An embodiment of the present invention further provides a slave server in a distributed task processing system, including:
the receiving unit is used for receiving the subtasks distributed by the main server;
the task processing unit is used for processing the subtasks by using the thread pool executor and putting processing results into a cache;
and the collection returning unit is used for reading the first parameter and the second parameter, collecting the processing results in the cache when the first parameter and the second parameter are the same, and returning the execution results of the tasks to be processed.
The embodiment of the invention also provides a distributed task processing system which comprises the master server and the slave server.
An embodiment of the present invention further provides a server, including:
a memory for storing executable instructions;
and the processor is used for implementing the distributed task processing method applied to the master server or implementing the distributed task processing method applied to the slave server provided by the embodiment of the invention when the executable instructions stored in the memory are executed.
An embodiment of the present invention further provides a storage medium, where the storage medium stores executable instructions, and when the executable instructions are executed by at least one processor, the distributed task processing method applied to a master server according to the embodiment of the present invention is implemented, or the distributed task processing method applied to a slave server according to the embodiment of the present invention is implemented.
The embodiment of the invention provides a distributed task processing method, related equipment, a system and a storage medium. Wherein the method comprises the following steps: the main server acquires a task to be processed; determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to configuration information of data in the database cluster; dividing the at least one data table to be processed into a data table queue corresponding to a plurality of subtasks according to task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster; and distributing the data table queues corresponding to the subtasks to the task execution servers. The slave server receives the subtasks distributed by the master server; processing the subtasks by using a thread pool executor, and putting a processing result into a cache; and reading a first parameter and a second parameter, summarizing each processing result in the cache when the first parameter and the second parameter are the same, and returning an execution result of the task to be processed. In the embodiment of the invention, the tasks to be processed are divided into the subtasks in the form of the data table queues so as to directly process each subtask in the database cluster, thereby meeting the requirement on real-time performance. Meanwhile, the embodiment of the invention greatly simplifies the data statistics or traversal operation under the database cluster by combining the cluster distributed execution and the single-machine multi-thread task execution, and improves the execution efficiency of the tasks, thereby reducing the workload of developing and researching personnel outside the research and development business logic.
Drawings
FIG. 1 is a diagram illustrating a database cluster having a multi-bank and multi-table structure for performing statistical operations or scoping data operations in the related art;
fig. 2 is a schematic diagram of a distributed task processing method applied to a server cluster in the embodiment of the present invention;
fig. 3 is a schematic flow chart of an implementation of a distributed task processing method applied to a primary server according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an interface for configuring information for task granularity according to an embodiment of the present invention;
fig. 5 is a schematic flow chart of an implementation of a distributed task processing method applied to a slave server according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a main server in a distributed task processing system according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram illustrating a composition of a slave server in a distributed task processing system according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a hardware structure of a server according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
Fig. 1 is a diagram illustrating a statistical operation or a scoped data operation performed in a database cluster having a multi-bank and multi-table structure according to the related art. The statistical operation or the scoped data operation in the database cluster of the multi-library and multi-table structure needs to go through the following steps:
step a: compiling a complex database cluster traversing device;
in the related art, when a statistical or scoped task is executed in a multi-library and multi-table database cluster, a complex traversal program needs to be written in the program, and the execution sequence of the program itself is linear, and database positioning, table positioning, scope data query limitation, data temporary storage, exception handling, failure retry and the like often need to be processed.
Step b: setting the logic to be counted or the logic corresponding to the task execution in the database cluster traversing device;
after the database cluster traversing device is built, the problems of positioning and traversing of all databases and data tables are only solved, actually required statistics or data processing logic needs to be arranged in the database cluster traversing device, and in addition, according to the difference of the design of the database cluster traversing device, certain adaptation needs to be carried out on the data results returned by the database cluster traversing device according to the statistics or task execution logic.
Step c: and summarizing the statistics or task execution results.
After the statistics or the task execution is finished, the service center collects the temporarily stored intermediate results of each data table and returns the collected results to the front end. The front end here may be understood as a page where relevant personnel, such as task originators, can view the results of a task.
Based on the above processing steps, there are several disadvantages in the related art when performing statistical or scoped tasks in a multi-bank and multi-table database cluster:
1. complex database cluster traversing means need to be configured: the research and development staff of the business system should focus on the processing of business logic. The specific distribution condition of the database cluster does not need to be transparent to developers, but the premise is only limited to the condition that the storage position of the data is singly located through Hash or other distribution mechanisms, and the developers need to research and develop the database cluster traversing device through the specific conditions of the database and the table cluster at other times, and the process is usually more than the workload involved in actually processing the business. The database cluster manager generally does not open the consulting authority of the specific distribution condition of the database cluster to the research and development personnel for the consideration of safety and database availability.
2. The execution efficiency of the single-machine program is low: usually, a database cluster traversing device configured by a developer to solve similar problems runs under a single server, that is, the server triggered to execute a task starts traversing operation of the whole database, and the whole traversing operation is executed linearly on the single server from the database 1 table 1 to the database N and the table N, if the database cluster is large in scale and the number of tables reaches ten thousand levels, the time consumption of data traversal alone is very long in the execution process of the whole program, and the execution time of logic itself is also long, so that the time consumption of the whole execution process is very long, and the execution result cannot be obtained in a short time.
3. The whole execution process lacks a monitoring mechanism: in general, traversing operation of the whole database cluster needs to process a large amount of data and data combing involved in large-scale statistics is usually in the number of tens of millions, while in the prior art, when relevant transactions are processed, data with logic execution failure are often difficult to count and finally return as a statistical result, and generally only some characteristics of the data with explicit execution failure can be checked through a log.
Based on this, the distributed task processing method, the related device, the system and the storage medium in the embodiments of the present invention are exactly generated to solve the above problems, and through the combination of cluster distributed execution and single machine multi-thread task execution, the statistics or traversal operation of data under a database cluster is greatly simplified, and the execution efficiency of the task is improved, thereby reducing the workload of development and research personnel outside the research and development business logic.
The distributed task processing method in the embodiment of the invention is applied to the server cluster. Here, the server cluster includes: a master (leader) server and at least one slave (follower) server; the main server is mainly used for generating specific data processing tasks, distributing the data processing tasks and specifically executing the data processing tasks; the slave server is mainly used for specifically executing data processing tasks distributed by the master server and summarizing data. Fig. 2 is a schematic diagram illustrating that the distributed task processing method in the embodiment of the present invention is applied to a server cluster.
It should be noted that: the master server may also allocate the data processing task to itself when allocating the data processing task, and after the data processing task is allocated, the master server may also specifically execute the data processing task allocated to itself, where the role of the master server is the same as that of the slave server.
An embodiment of the present invention provides a distributed task processing method, which is applied to a main server, and fig. 3 is a schematic diagram of an implementation flow of the distributed task processing method according to the embodiment of the present invention. As shown in fig. 3, the method comprises the steps of:
step 301: acquiring a task to be processed;
the task to be processed can be statistical operation or extensive data operation in a database cluster with a multi-library and multi-table structure. For example, the pending task may specifically be to raise the membership level of all users with points exceeding 1000 in a database cluster with a multi-bank and multi-table structure by one level.
Here, the multi-bank and multi-table means that data blocks originally stored in one database are stored in a plurality of databases, and data blocks originally stored in one data table are stored in a plurality of data tables. The multi-library multi-table comprises a vertical segmentation and a horizontal segmentation of an original database; the vertical segmentation refers to dividing the data table according to the function module and the degree of relationship affinity, and deploying the data table to different databases, for example, a commodity definition table stored in a commodity database, a user data table stored in a user database, and the like can be established; the horizontal segmentation means that when the data amount in one data table is too large, the data of the data table can be divided according to a certain rule and then stored in a plurality of data tables with the same structure.
In practical application, the manner of acquiring the to-be-processed task by the main server may be to receive an input to-be-processed task, or to receive a to-be-processed task directly allocated from an upstream system. The task to be processed receiving input here may also be receiving a task to be processed input by the relevant person through the input interface. The input interface may be a keyboard, a mouse, etc. Here, the upstream system may be a higher-level system that assigns tasks to the main server.
Step 302: determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to configuration information of data in the database cluster;
the database cluster is a database cluster with a multi-database multi-table structure. Here, the configuration information of the data in the database cluster may include: setting rules of a plurality of databases and a plurality of data tables in the database cluster, specific setting positions of the data tables in the database cluster and the like.
In practical application, the process of forming the data table queue by the tasks to be processed can be understood as an initialization stage of the task queue. The method comprises the steps that a main server analyzes the specific situation of a whole database cluster on the basis of obtaining tasks to be processed in the initialization stage of a task queue, the number of databases, the number of data tables and specific addresses of related databases are analyzed, and at least one data table to be processed, specifically all the data tables to be processed, corresponding to the tasks to be processed in the database cluster is determined.
Step 303: dividing the at least one data table to be processed into a data table queue corresponding to a plurality of subtasks according to task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster;
the task granularity is a description of the relative size or coarseness of a task element. For example, 10 data tables may be used as a task granularity. Here, the task granularity configuration information may include: table structure, number, whether the task is allowed to re-enter during running, overtime of task execution, retention mode of task execution record, segmentation granularity of task, whether each fetch is sequential or reverse, and the like. The interface for the partial task granularity configuration information is shown in FIG. 4.
In practical application, the main server divides the task to be processed into a plurality of subtasks according to the task granularity configuration information, namely, divides all data tables to be processed into a plurality of data table queues. The plurality of subtasks constitutes a task queue.
After the cluster scheme and the task granularity of the data are set, during statistical query or task traversal, a plurality of distributable executing tasks can be generated for the task to be processed, and the granularity of the distributable executing tasks can be changed by changing the configuration information in the task granularity configuration item.
Here, the to-be-processed data table, which is an execution range of each sub task of the to-be-processed task, may be a certain data table in a certain database in the database cluster, or may be a certain data table in a certain database in the database cluster. That is, the data table queue may be a task in the form of a single data table, or may be a task in the form of multiple data tables.
Here, the distribution of the to-be-processed data tables of each subtask in the database cluster refers to location information that may exist in the database cluster of the multi-bank and multi-table structure for the to-be-processed data tables of all subtasks. In practical applications, the location information may include a data table name, a database link address, and the like.
In actual application, before step 302 and step 303, the main server needs to receive configuration information of data in the database cluster and task granularity configuration information.
In practical application, the mode that the main server receives the configuration information of the data in the database cluster and the task granularity configuration information is to receive the configuration information of the data in the database cluster and the task granularity configuration information which are configured in advance. The configuration information of the data in the pre-configured database cluster can be input and stored by related personnel, such as research and development personnel, according to the actual setting condition of the data in the database cluster; the task granularity and the task granularity configuration information can be set and stored by relevant personnel such as research and development personnel according to experience.
Step 304: and distributing the data table queues corresponding to the subtasks to the task execution servers.
After the task queue is initialized, the subtasks exist in the task queue in the form of a data table queue, i.e., a single table or multiple tables.
The task execution server may be a standby task execution server in a cluster of servers. In actual application, the task execution server may be a master server or a slave server.
In practical application, a process of allocating the data table queue corresponding to each subtask to each standby task execution server in the server cluster may be understood as a cluster server task distribution stage.
In some embodiments, a polling mechanism is used to allocate the data table queues corresponding to the subtasks to the task execution servers.
In actual application, the main server may use a polling mechanism to quasi-uniformly hash the data table queues corresponding to the subtasks to the task execution servers in the task distribution stage of the cluster server.
The specific logic for performing quasi-uniform hashing using a polling mechanism is illustrated here:
example 1: supposing that x subtasks are arranged, y standby task execution servers are arranged, and x is less than or equal to y; the specific logic of quasi-uniform hashing is: respectively executing 1 subtask on x servers in y standby task execution servers; and 0 subtasks are respectively executed on the rest y-x servers in the y standby task execution servers.
Example 2: assuming that x subtasks are provided, y ready task execution servers, and x > y, the specific logic of quasi-uniform hashing is as follows: and the y-x% y servers in the y ready task execution servers respectively execute the integer part of tasks corresponding to x/y, and the rest x% y servers in the y ready task execution servers respectively execute the integer part of tasks corresponding to x/y + 1. Here, x% y represents the remainder of taking x and y as a quotient.
In practical applications, for example, an open source technology based on a Remote Procedure Call (RPC) technology may be adopted to implement the specific implementation procedure of the polling mechanism.
In some embodiments, the assignment of the subtasks may also be performed according to the processing performance of the task execution server. For example, the task execution server with strong processing performance may be assigned more subtasks according to the empirical value, and the task execution server with weak processing performance may be assigned less subtasks.
In the embodiment of the invention, the tasks of the main server part in the distributed task processing method are completed, namely the main server obtains the tasks to be processed; determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to configuration information of data in the database cluster; dividing the at least one data table to be processed into a data table queue corresponding to a plurality of subtasks according to task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster; and distributing the data table queues corresponding to the subtasks to the task execution servers. In this way, each distributed subtask obtained by dividing the task to be processed is distributed to each slave server.
An embodiment of the present invention provides a distributed task processing method, which is applied to a slave server, and fig. 5 is a schematic diagram of an implementation flow of the distributed task processing method according to the embodiment of the present invention. As shown in fig. 5, the method comprises the steps of:
step 501: receiving a subtask distributed by a main server;
and in actual application, the slave server receives the subtasks distributed by the master server. Here, the number of subtasks that one slave server receives from the master server may be 0 to N (N is a positive integer ≧ 1). That is, a slave server receives a set of subtasks assigned by a master server.
When the number of task execution servers on standby is greater than the number of subtasks, the number of subtasks allocated to some of the task execution servers on standby may be 0, and in this case, some of the slave servers may not receive the subtasks allocated to the master server. The slave server that receives the subtask that is not assigned by the master server does not perform the subsequent subtask processing step.
It should be noted that, after the master server completes the assignment of the subtasks, the role may be switched from the master server to the slave server. That is, the master server may be considered a slave server after the assignment of the subtasks is completed.
In actual application, after receiving the subtask request and before entering the specific execution of the subtask, each slave server checks the repeated operation condition of the subtask to prevent the repeated execution of a same subtask.
Based on this, in some embodiments, prior to the method of processing the subtasks with the thread pool executor, the method further comprises:
according to a preset configuration rule, performing repeated operation condition check on the subtasks;
when determining that no repeatedly run subtasks exist, locking the subtasks;
when the subtask method is processed by using the thread pool executor, the method comprises the following steps:
and processing the locked subtasks by using the thread pool executor.
Here, the preset configuration rule may be a configuration rule capable of checking a repetitive operation condition of the subtask. E.g., configuration rules whether tasks in the data table queue are allowed to reentry at runtime.
In actual application, the slave server may check whether there is a sub-task that is repeatedly run in the current server cluster according to a configuration rule whether the task in the data table queue is allowed to be re-entered during running. And when the check result is that no prompt for repeated entry exists, that is, the previous Server cluster does not have a repeatedly-operated subtask, the slave Server locks the subtask to be executed through a Remote Dictionary service (REDIS) cache lock, and the subtask is ready to enter a specific execution stage after being locked.
It should be noted that, the rule of configuration information whether to allow reentry refers to whether to allow reentry is set as disallowed in the embodiment of the present invention. That is, under the rule, each subtask is allowed to execute the task only once, and when more than one task enters into execution for one subtask, a prompt for repeated entry is issued.
Here, the REDIS is an open source, a log-type and Key-Value database written in ANSI C language, supporting network, based on memory and persistent, and provides an Application Programming Interface (API) in multiple languages.
Step 502: processing the subtasks by using a thread pool executor, and putting a processing result into a cache;
in some embodiments, the subtask includes a plurality of tasks, and the processing the subtask with the thread pool executor includes:
and multithreading the tasks by utilizing a thread pool executor, and monitoring the multithreading process.
In practical application, when the number of the subtasks allocated by the slave server to the master server is multiple, the process that the slave server processes and monitors the allocated subtasks by using the thread pool executor can be understood as a multithreading execution phase of the slave task set by the cluster server. After the slave server locks the subtask sets, the data tables to be processed are read out according to the distribution conditions of the data tables to be processed of the corresponding subtasks reflected by the data table queues corresponding to the subtask sets in the database cluster, in order to ensure efficiency and avoid causing excessive reading pressure of the database cluster, 200 to 1000 pieces of data are generally read out at one time and read out for multiple times until all the data covered in the operation range of the corresponding subtask are read out, and after the data are read out, specific operation corresponding to each piece of data is put into a thread pool executor capable of monitoring results in a Guava concurrency package for specific data processing, namely, multi-thread data processing is carried out.
The thread pool executor is created after receiving the subtasks distributed by the main server, and the thread pool executor is used for executing data processing, so that the efficiency of a single data specific processing logic can be improved, and the specific operation corresponding to each piece of data is monitored.
Here, the Guava concurrency package contains several core libraries, such as: collections (collections), caches (caching), native type support (basic support), concurrent libraries (common annotations), string processing (string processing), Input/Output (I/O), and the like.
It should be noted that, when the number of the sub-tasks allocated to the slave server is one, there is no multi-thread processing situation, and at this time, the thread pool executor may also be used to process the data processing and process monitoring of the sub-task, so as to achieve the purpose of improving the processing efficiency and monitoring the processing process.
When the logic execution of the slave server in the thread pool executor is completed, all data operation results of each independent thread are returned through the monitored result set, and after the last execution result is returned, the whole processing result needs to be put into the cache.
In actual application, the cache may be a REDIS cache. The total execution result number and the failure number of the distributed subtasks are temporarily written into the REDIS cache according to the task id, after the cache is written, the thread executing the task releases the cache lock corresponding to the task id, and the next execution task has the access condition. Here, the task id is a number of each subtask in the subtask set, that is, each subtask is numbered in order to easily distinguish a specific operation of each subtask in the multithread execution stage of the subtask set.
In practical application, the slave servers process respective subtasks according to the steps and put respective processing results into the cache. The processing speeds of the slave servers are different, and the slave server which completes the sub-task allocated by the slave server last in time is called as the slave server which completes the sub-task allocated last.
Step 503: and reading a first parameter and a second parameter, summarizing each processing result in the cache when the first parameter and the second parameter are the same, and returning an execution result of the task to be processed.
Here, the first parameter and the second parameter are used to determine whether the slave server currently reading data is the slave server that has completed the assigned subtask last. The first parameter may be the total number of subtasks divided by the current task to be processed; the second parameter may be the number of subtasks currently completed by all dependent servers.
In practical application, the slave server may read the total number of subtasks and the number of completed subtasks from the cache, compare the total number of subtasks with the number of completed subtasks, and determine the slave server to which the slave server currently reading data completes the assigned subtasks last when the total number of subtasks is the same as the number of completed subtasks.
In practical application, after the subtasks of each slave server are completed and the completed result is put into the cache, the total number of the subtasks and the number of the completed subtasks are read from the cache, and whether the slave server is the slave server which completes the distributed subtasks at last or not is known by comparing the total number of the subtasks and the number of the completed subtasks. For example, now, a certain slave server learns that the total number of subtasks is M from the REDIS cache, and the number of completed subtasks is M, that is, it is the slave server that completes the assigned subtasks last. Where M is a positive integer ≧ 1.
And finally, the slave server which finishes the distributed subtasks collects the data in each subtask result in the cache to obtain the execution result of the task to be processed. Here, the data involved in the summary includes: the execution result of the task to be processed, the total number of data related to the task to be processed, the total number of successful execution times, the total number of failed execution times and the like. Here, the returned execution result may further include a specific Internet Protocol address (IP) of the master server, a specific IP of the slave server having executed the subtask, and the number of subtasks executed by the slave server having executed the subtask, so that the task initiator can check the execution result conveniently.
It should be noted that, when the slave server determines that the read first parameter and the read second parameter are not the same, it indicates that the slave server is not the slave server that has completed the assigned subtask last. And the slave server which does not finish the distributed subtasks at last does not need to perform the step of summarizing the processing results in the cache and returning the execution results of the tasks to be processed.
In the embodiment of the invention, the tasks of the slave server part in the distributed task processing method are completed, namely the slave server receives the subtasks distributed by the master server; processing the subtasks by using a thread pool executor, and putting a processing result into a cache; and reading a first parameter and a second parameter, summarizing each processing result in the cache when the first parameter and the second parameter are the same, and returning an execution result of the task to be processed. According to the embodiment of the invention, through the combination of cluster distributed execution and single-machine multi-thread task execution, the statistics or traversal operation of data under a database cluster is greatly simplified, and the execution efficiency of the tasks is improved, so that the workload of development personnel outside the research and development business logic is reduced.
In addition, the embodiment of the invention writes the execution result of each server participating in execution into the cache, and the task finally counts the summary of the execution result of each server participating in the execution of the task, thereby providing the most intuitive and effective execution result statistics.
In order to implement the method of the embodiment of the present invention, the embodiment of the present invention further provides a main server in the distributed task processing system. Fig. 6 is a schematic structural diagram of a main server 600 in a distributed task processing system, as shown in fig. 6, the main server 600 includes:
an obtaining unit 601, configured to obtain a task to be processed;
a determining unit 602, configured to determine, according to configuration information of data in a database cluster, at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster;
a dividing unit 603, configured to divide the at least one to-be-processed data table into data table queues corresponding to multiple sub-tasks according to the task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster;
the distributing unit 604 is configured to distribute the data table queues corresponding to the respective subtasks to the respective task execution servers.
In some embodiments, the distributing unit 604 is specifically configured to: and distributing the data table queues corresponding to the subtasks to the task execution servers by using a polling mechanism.
In actual application, the obtaining unit 601, the determining unit 602, the dividing unit 603, and the distributing unit 604 may be implemented by a processor in a main server in the distributed task processing system.
It should be noted that: in the distributed task processing system provided in the foregoing embodiment, when performing distributed task processing, the main server is only illustrated by dividing the program modules, and in practical applications, the processing may be distributed and completed by different program modules according to needs, that is, the internal structure of the main server is divided into different program modules to complete all or part of the processing described above. In addition, the main server in the distributed task processing system provided in the foregoing embodiments and the distributed task processing method applied to the main server belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, and are not described herein again.
In order to implement the method of the embodiment of the present invention, an embodiment of the present invention further provides a slave server in a distributed task processing system. Fig. 7 is a schematic structural diagram of a slave server 700 in a distributed task processing system, as shown in fig. 7, the slave server 700 includes:
a receiving unit 701, configured to receive a subtask allocated by a main server;
a task processing unit 702, configured to process the subtasks by using a thread pool executor, and place a processing result in a cache;
the summary returning unit 703 is configured to read a first parameter and a second parameter, summarize each processing result in the cache when the first parameter is the same as the second parameter, and return an execution result of the to-be-processed task.
In some embodiments, the task processing unit 702 is specifically configured to:
and multithreading processing the tasks by utilizing a thread pool executor, and monitoring the multithreading processing process.
In some embodiments, the slave server 700 further comprises a checking unit for: according to a preset configuration rule, performing repeated operation condition check on the subtasks; when determining that no repeatedly run subtasks exist, locking the subtasks;
the task processing unit 702 is further configured to:
and processing the locked subtasks by using the thread pool executor.
In actual application, the receiving unit 701, the checking unit, the task processing unit 702, and the summary returning unit 703 may be implemented by processors in slave servers in the distributed task processing system.
It should be noted that: in the distributed task processing system provided in the above embodiment, when performing distributed task processing, the slave server is only illustrated by the division of the program modules, and in practical applications, the processing may be distributed and completed by different program modules as needed, that is, the internal structure of the slave server is divided into different program modules to complete all or part of the processing described above. In addition, the slave server in the distributed task processing system provided in the foregoing embodiment and the distributed task processing method applied to the slave server belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.
In an exemplary embodiment, an embodiment of the present invention further provides a distributed task processing system, where the system includes: the master server 600 and the slave server 700.
In practice, the system includes a master server 600 and at least one slave server 700.
Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present invention, an embodiment of the present invention further provides a server 800, where the server 800 includes:
a memory 801 for storing executable instructions;
the processor 802 is configured to, when executing the executable instructions stored in the memory, implement the distributed task processing method applied to the master server according to the embodiment of the present invention, or implement the distributed task processing method applied to the slave server according to the embodiment of the present invention.
In practice, as shown in FIG. 8, the various components of the server 800 are coupled together by a bus system 803. It is understood that the bus system 803 is used to enable communications among the components. The bus system 803 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 803 in figure 8.
An embodiment of the present invention further provides a storage medium, where the storage medium stores executable instructions, and when the executable instructions are executed by at least one processor, the distributed task processing method applied to a master server according to the embodiment of the present invention is implemented, or the distributed task processing method applied to a slave server according to the embodiment of the present invention is implemented.
In some embodiments, the storage medium may be a Memory such as a magnetic random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read Only Memory (CD-ROM); or may be various devices including one or any combination of the above memories.
In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
In addition, the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.

Claims (10)

1. A distributed task processing method applied to a main server is characterized by comprising the following steps:
acquiring a task to be processed;
determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to configuration information of data in the database cluster;
dividing the at least one data table to be processed into a data table queue corresponding to a plurality of subtasks according to task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster;
and distributing the data table queues corresponding to the subtasks to the task execution servers.
2. The method according to claim 1, wherein the allocating the data table queue corresponding to each subtask to each task execution server comprises:
and distributing the data table queues corresponding to the subtasks to the task execution servers by using a polling mechanism.
3. A distributed task processing method applied to a slave server, the method comprising:
receiving a subtask distributed by a main server;
processing the subtasks by using a thread pool executor, and putting a processing result into a cache;
and reading a first parameter and a second parameter, summarizing each processing result in the cache when the first parameter and the second parameter are the same, and returning an execution result of the task to be processed.
4. The method of claim 3, wherein the subtasks include a plurality of tasks, and wherein processing the subtasks using a thread pool executor comprises:
and multithreading processing the tasks by utilizing a thread pool executor, and monitoring the multithreading processing process.
5. The method of claim 3, wherein prior to said processing the subtask method with the thread pool executor, the method further comprises:
according to a preset configuration rule, performing repeated operation condition check on the subtasks;
when determining that no repeatedly run subtasks exist, locking the subtasks;
when the subtask method is processed by using the thread pool executor, the method comprises the following steps:
and processing the locked subtasks by using the thread pool executor.
6. A primary server in a distributed task processing system, the primary server comprising:
the acquisition unit is used for acquiring the tasks to be processed;
the determining unit is used for determining at least one to-be-processed data table corresponding to the to-be-processed task in the database cluster according to the configuration information of the data in the database cluster;
the dividing unit is used for dividing the at least one data table to be processed into a plurality of data table queues corresponding to the subtasks according to the task granularity configuration information; the data table queue can reflect the distribution condition of the data tables to be processed of the corresponding subtasks in the database cluster;
and the distribution unit is used for distributing the data table queues corresponding to the subtasks to the task execution servers.
7. A slave server in a distributed task processing system, the slave server comprising:
the receiving unit is used for receiving the subtasks distributed by the main server;
the task processing unit is used for processing the subtasks by using the thread pool executor and putting processing results into a cache;
and the collection returning unit is used for reading the first parameter and the second parameter, collecting the processing results in the cache when the first parameter and the second parameter are the same, and returning the execution results of the tasks to be processed.
8. A distributed task processing system comprising the master server of claim 6 and the slave server of claim 7.
9. A server, comprising:
a memory for storing executable instructions;
a processor configured to implement the distributed task processing method of any one of claims 1 to 2, or the distributed task processing method of any one of claims 3 to 5, when executing the executable instructions stored in the memory.
10. A storage medium storing executable instructions which, when executed by at least one processor, implement the distributed task processing method of any one of claims 1 to 2 or the distributed task processing method of any one of claims 3 to 5.
CN201911085929.8A 2019-11-08 2019-11-08 Distributed task processing method, related device, system and storage medium Pending CN111061762A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911085929.8A CN111061762A (en) 2019-11-08 2019-11-08 Distributed task processing method, related device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911085929.8A CN111061762A (en) 2019-11-08 2019-11-08 Distributed task processing method, related device, system and storage medium

Publications (1)

Publication Number Publication Date
CN111061762A true CN111061762A (en) 2020-04-24

Family

ID=70298502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911085929.8A Pending CN111061762A (en) 2019-11-08 2019-11-08 Distributed task processing method, related device, system and storage medium

Country Status (1)

Country Link
CN (1) CN111061762A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580098A (en) * 2020-12-23 2021-03-30 光大兴陇信托有限责任公司 Business data comparison method and system
CN112835945A (en) * 2021-02-25 2021-05-25 平安消费金融有限公司 User data-based label processing method, system, device and storage medium
CN113407429A (en) * 2021-06-23 2021-09-17 中国建设银行股份有限公司 Task processing method and device
CN113901141A (en) * 2021-10-11 2022-01-07 京信数据科技有限公司 Distributed data synchronization method and system
CN114553877A (en) * 2022-01-14 2022-05-27 天津天地伟业智能安全防范科技有限公司 Network distributed server and resource allocation method thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844018A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 A kind of task processing method, apparatus and system
CN108364162A (en) * 2018-01-31 2018-08-03 深圳市买买提信息科技有限公司 A kind of task interface management method, system and terminal device
CN109086138A (en) * 2018-08-07 2018-12-25 北京京东金融科技控股有限公司 Data processing method and system
CN109145051A (en) * 2018-07-03 2019-01-04 阿里巴巴集团控股有限公司 The data summarization method and device and electronic equipment of distributed data base
CN109308214A (en) * 2017-07-27 2019-02-05 北京京东尚科信息技术有限公司 Data task processing method and system
CN109766349A (en) * 2018-12-13 2019-05-17 平安普惠企业管理有限公司 The anti-weighing method of task, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844018A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 A kind of task processing method, apparatus and system
CN109308214A (en) * 2017-07-27 2019-02-05 北京京东尚科信息技术有限公司 Data task processing method and system
CN108364162A (en) * 2018-01-31 2018-08-03 深圳市买买提信息科技有限公司 A kind of task interface management method, system and terminal device
CN109145051A (en) * 2018-07-03 2019-01-04 阿里巴巴集团控股有限公司 The data summarization method and device and electronic equipment of distributed data base
CN109086138A (en) * 2018-08-07 2018-12-25 北京京东金融科技控股有限公司 Data processing method and system
CN109766349A (en) * 2018-12-13 2019-05-17 平安普惠企业管理有限公司 The anti-weighing method of task, device, computer equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580098A (en) * 2020-12-23 2021-03-30 光大兴陇信托有限责任公司 Business data comparison method and system
CN112835945A (en) * 2021-02-25 2021-05-25 平安消费金融有限公司 User data-based label processing method, system, device and storage medium
CN113407429A (en) * 2021-06-23 2021-09-17 中国建设银行股份有限公司 Task processing method and device
CN113901141A (en) * 2021-10-11 2022-01-07 京信数据科技有限公司 Distributed data synchronization method and system
CN114553877A (en) * 2022-01-14 2022-05-27 天津天地伟业智能安全防范科技有限公司 Network distributed server and resource allocation method thereof

Similar Documents

Publication Publication Date Title
Sethi et al. Presto: SQL on everything
CN111061762A (en) Distributed task processing method, related device, system and storage medium
Jeon et al. Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads
US11275743B2 (en) System and method for analyzing data records
EP2608066B1 (en) Many core algorithms for in-memory column store databases
Gautam et al. A survey on job scheduling algorithms in big data processing
US7870556B2 (en) Managing computing resources in graph-based computations
US7984043B1 (en) System and method for distributed query processing using configuration-independent query plans
Qadah et al. Quecc: A queue-oriented, control-free concurrency architecture
US10261888B2 (en) Emulating an environment of a target database system
CN113297057A (en) Memory analysis method, device and system
CN114416849A (en) Data processing method and device, electronic equipment and storage medium
US20200356885A1 (en) Service management in a dbms
Shi et al. Performance models of data parallel DAG workflows for large scale data analytics
Almeida et al. Performance analysis and optimization techniques for oracle relational databases
Fino et al. RStream: Simple and efficient batch and stream processing at scale
Issa et al. Exploiting symbolic execution to accelerate deterministic databases
Poldner et al. Skeletons for divide and conquer algorithms
CN113608891A (en) Distributed batch processing system, method, computer device and storage medium
Son et al. Parallel Job Processing Technique for Real-time Big-Data Processing Framework
Soethout et al. Path-Sensitive Atomic Commit: Local Coordination Avoidance for Distributed Transactions
Retnowo Multithread to Accelerate Process Data Sync Using MapReduce Model Programming
Li et al. Prajna: Cloud Service and Interactive Big Data Analytics
Patra High-Performance Database Management System Design for Efficient Query Scheduling
Lopes SDD4 Streaming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 221, 2nd floor, Block C, 18 Kechuang 11th Street, Daxing Economic and Technological Development Zone, Beijing, 100176

Applicant after: Jingdong Technology Holding Co.,Ltd.

Address before: Room 221, 2nd floor, Block C, 18 Kechuang 11th Street, Daxing Economic and Technological Development Zone, Beijing, 100176

Applicant before: JINGDONG DIGITAL TECHNOLOGY HOLDINGS Co.,Ltd.