CN118093147A

CN118093147A - Massive data summarizing method and system based on task chain and divide-and-conquer method

Info

Publication number: CN118093147A
Application number: CN202410525952.9A
Authority: CN
Inventors: 耿晗琳; 李珂; 万文军; 倪伟健; 丁祎
Original assignee: Zhejiang Rural Commercial Digital Technology Co ltd
Current assignee: Zhejiang Rural Commercial Digital Technology Co ltd
Priority date: 2024-04-29
Filing date: 2024-04-29
Publication date: 2024-05-28

Abstract

The invention provides a method and a system for summarizing mass data based on a task chain and a divide-and-conquer method; belonging to the technical field of data summarization; the method comprises the following steps: a main task chain and a subtask chain are pre-configured in a batch task chain flow definition table, and each node and the execution sequence of each node are defined; the dispatching center system sets a timing task, issues a message of a specific topic and tag to the batch application system through a message queue at a fixed settlement time point, and starts a batch summary main task; after a node in the batch application system monitors messages corresponding to the topic and the tag in the MQ, the batch summarizing main task chain starts to be executed. By reasonably decomposing and optimally scheduling the tasks, the execution efficiency of the whole business process is improved; computing resources can be better allocated and utilized with the task chain, reducing unnecessary waiting and bottleneck problems.

Description

Massive data summarizing method and system based on task chain and divide-and-conquer method

Technical Field

The invention provides a method and a system for summarizing mass data based on a task chain and a divide-and-conquer method, and belongs to the technical field of data summarization.

Background

As a tightly linked loop in the cashless payment chain, the order service expands rapidly with the growth of cashless payment scale. With the popularization and development of mobile payment at the consumer end, the order receiving service has become one of the necessary services of current merchants.

The acquiring bank calculates one of the key links in the acquiring business, namely, the transaction data is obtained from the merchant, the cost calculated according to the rate is deducted and then paid to the merchant, and a certain proportion of commission is deducted from the transaction data. The settlement can be divided into real-time settlement and batch settlement according to the settlement period, wherein the real-time settlement pays the merchant settlement amount to the merchant in a single way in real time; and when the batch settlement is finished, the merchant settlement amount, the merchant commission settlement amount, the Unionpay brand service fee and the like in the transaction data reaching the predicted settlement date are summarized according to certain summarizing conditions, and then one transaction is paid to the income parties such as the merchant, the merchant partner, the bank and the like.

With the huge demand of merchants for order receiving service, the business volume of the business-in merchants and daily transaction volume are increased, and the data which needs to be summarized in daily final batch is also increased. Taking the whole channel order of a certain bank as an example, the daily transaction amount can reach 1200 ten thousand, wherein 30% of the daily transaction amount is batch settlement, and the total amount is about 720 ten thousand settlement details. As calculated, the first working day after holidays is summarized by at most 4 and 5 millions of data. In the face of massive data, the pressure of applications and databases is very great when the daily final batch is summarized. To solve this problem, we have devised a way to aggregate massive amounts of data based on task chain patterns and divide and conquer methods.

Disclosure of Invention

The invention provides a method and a system for summarizing mass data based on a task chain and a divide-and-conquer method, which are used for solving the problems mentioned in the background art:

The invention provides a mass data summarization method based on a task chain and a divide-and-conquer method, which comprises the following steps:

A main task chain and a subtask chain are pre-configured in a batch task chain flow definition table, and each node and the execution sequence of each node are defined;

the dispatching center system sets a timing task, issues a message of a specific topic and tag to the batch application system through a message queue at a fixed settlement time point, and starts a batch summary main task;

After a node in the batch application system monitors messages corresponding to the topic and the tag in the MQ, starting to execute a batch summarizing main task chain;

after each subtask is executed, the subtasks summarize the result data processed by each subtask, and the main task integrates the subtask results to form a complete data set.

Further, the pre-configuring the main task chain and the sub task chain in the batch task chain flow definition table, and defining each node and the execution sequence of each node includes:

Determining the basic constitution of a main task chain, identifying key steps involved in the whole batch summarization process, and configuring main task chain nodes in detail;

and analyzing the concrete execution flow of the subtasks, defining necessary steps contained in each subtask, and configuring the subtask chain nodes in detail.

Further, the configuring the primary task chain node in detail includes: adding task locks, idempotent, cleaning data, subtask splitting, subtask distribution, polling subtask execution status, summarizing subtask data, and handling the pending refund of the present day.

Further, the detailed configuration subtask chain node includes: adding a task lock, pre-task checking, sub-task data query, and recording refund in progress and sub-task summary data.

Further, the dispatching center system sets a timing task, issues a message of a specific topic and tag to the batch application system through the message queue at a fixed settlement time point every day, and starts a batch summary main task, including:

Creating a timing task in the dispatching center system and designating the execution period of the task;

Integrating the timing task with a target message queue system, and setting message content, a topic (topic) and a tag (tag) which need to be sent when the task is executed;

Defining the content of the message body; selecting or creating a specific theme (topic) and using a tag with semantic meaning;

Writing a script or a program code in an actual execution action part of the timing task, automatically calling an API (application program interface) of a message queue when a preset time point is reached, and releasing the configured message content to a preset topic;

Configuring message consumers in a batch application system, subscribing tags under corresponding topics, and executing starting of a batch summarizing main task if receiving messages issued by a dispatching center;

and writing message processing logic in the batch application system, analyzing the triggered message after acquiring the message triggered by the timing task from the message queue, and initializing and starting a main task chain.

Further, after a node in the batch application system monitors messages corresponding to topic and tag in the MQ, starting to execute a batch summary main task chain; comprising the following steps:

After the batch application system is started, the node instance establishes stable network connection with the message queue service, subscribes to a pre-configured topic, and sets a filtering tag.

The node instance enters a monitoring state and waits for the MQ server to push messages matching the topic and the tag; if the message is received, the node instance reads the message and confirms that the message is successfully received;

Analyzing the received original message content, and extracting necessary parameters and context information required by executing batch summarizing main tasks;

According to a main task chain structure configured in a batch task chain flow definition table in advance, a node instance starts to initialize a task chain, and corresponding execution environments and parameters are distributed for each node; sequentially executing detailed configuration main task chain node operation according to a pre-defined node execution sequence;

In the process of executing the main task chain, the node instance continuously monitors the execution condition of each subtask, and after all the subtasks are executed, the processing results of each subtask are collected and data summarization is carried out;

in the execution process, if an abnormal situation occurs, capturing the abnormal situation by the node instance, and executing secondary operation;

further, after the execution of each subtask is completed, the subtasks summarize the result data processed by each subtask, and the main task integrates the subtask results to form a complete data set, which comprises the following steps:

each subtask node completes the processing of each data set according to preset logic and stores the processing result;

Each subtask node actively reports the respective processing result to the main task node through a message queue after completing the task; the main task node receives the result data reported by each subtask, stores the result data and preprocesses the stored result data;

The main task node combines the result data of each subtask according to a preset combining strategy to form a complete data set;

and summarizing and counting the combined data sets, generating a final summarized report or data view, and outputting.

The invention provides a system for realizing the mass data summarization method based on a task chain and a divide-and-conquer method, which comprises the following steps:

and a batch processing module: a main task chain and a subtask chain are pre-configured in a batch task chain flow definition table, and each node and the execution sequence of each node are defined;

A batch starting module: the dispatching center system sets a timing task, issues a message of a specific topic and tag to the batch application system through a message queue at a fixed settlement time point, and starts a batch summary main task;

Approve the right to use of foreign exchange total modules: after a node in the batch application system monitors messages corresponding to the topic and the tag in the MQ, starting to execute a batch summarizing main task chain;

And (3) an integration module: after each subtask is executed, the subtasks summarize the result data processed by each subtask, and the main task integrates the subtask results to form a complete data set.

The invention provides electronic equipment, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the memory, wherein the processor executes the program to realize the mass data summarization method based on a task chain and a divide-and-conquer method.

The invention provides a non-transitory computer readable storage medium, on which a computer program is stored, the program being executed by a processor to implement a method for summarizing mass data based on a task chain and a divide-and-conquer method as described in any one of the above.

The invention has the beneficial effects that: by reasonably decomposing and optimally scheduling the tasks, the execution efficiency of the whole business process is improved; the computing resources can be better distributed and utilized by utilizing the task chain, so that unnecessary waiting and bottleneck problems are reduced; the task chain has high visualization degree, so that the business process is more transparent, and team members can understand and cooperate conveniently; the business process is solidified into a task chain, which is helpful for promoting the standardization of the business process and realizing the automation of part or even all of the processes; and each step in the business control process is strictly controlled through a task chain, so that the compliance with relevant regulations and internal control requirements is ensured, and the business risk is reduced. The design idea of the divide-and-conquer method is adopted to split tasks needing to process millions of mass transaction data into a plurality of subtasks for processing small data. And simultaneously, a main task and a subtask processing mode are used, asynchronous multithreading is adopted, machine resources are fully utilized, and processing performance is improved. The data of the same merchant are in the same group, so that the settlement amount of the merchant only needs to be summarized once for subtasks with corresponding numbers, and the cost of batch application and databases is greatly saved. In an actual production environment, the time spent by merchants for 2000 ten thousand summarized data in different groups is about 47 minutes; merchants took about 23 minutes 2000 ten thousand aggregated data when grouped identically.

Drawings

FIG. 1 is a diagram of the steps of the method of the present invention;

FIG. 2 is a block diagram of a system according to the present invention;

FIG. 3 is a diagram of a main subtask link of the method of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, and the described embodiments are merely some, rather than all, embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

In one embodiment of the present invention, as shown in fig. 1, a method for summarizing mass data based on task chains and divide-and-conquer method, the method comprises:

S1, a main task chain and a subtask chain are pre-configured in a batch task chain flow definition table, and each node and the execution sequence of each node are defined;

S2, the dispatching center system sets a timing task, and issues a message of a specific topic and tag to the batch application system through a Message Queue (MQ) at a fixed settlement time point every day, and starts a batch summarizing main task;

S3, after a node (i.e. an operation instance) in the batch application system monitors messages corresponding to the topic and the tag in the MQ, starting to execute a batch summary main task chain; this means that the system starts to operate in a predefined task chain node order.

And S4, after the execution of each subtask is finished, the subtasks summarize the result data processed by the subtasks, and the main task integrates the subtask results to form a complete data set.

The working principle of the technical scheme is as follows: in the batch task chain flow definition table, a system administrator designs and configures a main task chain and a sub task chain in advance. The main task chain represents the core flow of the whole batch data summarization process, and the sub task chain represents the specific execution unit decomposed under the main task. The task content and the execution sequence of each node of each task chain are defined, so that the system gradually advances data processing work according to preset logic. The dispatch center system, as a control hub for the overall process, presets a timed task that is executed at a fixed point in time (e.g., settlement time) per day. When the set point in time is reached, the dispatch center system sends a message in a specific format to the batch application system through a Message Queue (MQ), and the message contains a specified topic (theme) and tag (tag) for indicating the destination and type of the message. Nodes running in the bulk application system (i.e., running instances) will monitor messages corresponding to topic and tag in the MQ in real time. When the node monitors the specific information issued by the dispatching center, the node immediately starts to execute the batch summarizing main task chain. After the main task chain is started, the tasks are executed one by one according to a pre-defined node sequence, wherein the related tasks may comprise a plurality of links such as data loading, preprocessing, subtask splitting, subtask executing and the like. The main task uses a divide-and-conquer strategy to cut mass data into a plurality of manageable small blocks, and then the subtasks are distributed to different computing resources to be executed in parallel, so that the processing efficiency is improved. Each subtask independently processes the part of data which is divided into, and the respective data summarization work is completed. And after the execution of the subtasks is completed, returning the result data processed by the subtasks to the main task. After receiving the result data returned by all the subtasks, the main task integrates the result data, namely, the processing results of all the subtasks are spliced together to form a complete and consistent data set. The integration process may include a series of operations such as data cleaning, deduplication, and verification, so as to ensure that the final data set is accurate.

The technical scheme has the effects that: the main task chain and the sub task chain are configured in the batch task chain flow definition table in advance, so that each node and the execution sequence thereof can be determined, the whole data summarization process is ensured to have high normalization and controllability, the error rate is reduced, and the accuracy and consistency of service processing are improved. The dispatching center system automatically triggers batch summarizing tasks at fixed settlement time points every day by setting timing tasks, manual intervention is not needed, automation of business processes can be achieved, labor cost is saved, and timeliness of business processing is guaranteed. By means of the task chain mode and the divide-and-conquer method, the system can split mass data into a plurality of subtasks and process the subtasks in parallel on different computing resources, data processing efficiency is greatly improved, summarizing time is shortened, and the system is particularly suitable for scenes with large data volume. With the increase of the traffic, the larger-scale data processing requirement can be easily met by only adding more nodes (i.e. running examples) to monitor the message queue, and the good elastic scalability and the lateral expansion capability are reflected. Each subtask is relatively independent, and communication is carried out among the tasks through the message queue, so that decoupling of data processing is realized. If one subtask fails, other tasks are not affected, retry or repair can be performed, and the fault tolerance and stability of the system are enhanced. After each subtask is executed, the processing results are summarized, the main task re-integrates the subtask results to form a complete data set, the comprehensiveness and the integrity of data summarization are ensured, and powerful support is provided for subsequent data analysis and decision making.

In one embodiment of the present invention, as shown in fig. 3, the pre-configuring the main task chain and the sub task chain in the batch task chain flow definition table, and defining each node and the execution sequence of each node, includes:

The detailed configuration main task chain node comprises:

(1) Adding a task lock: and inserting corresponding task node data into the batch task concurrency lock table, wherein the task name and the task number are combined unique indexes, so that only one thread can execute tasks at the same time, and the concurrency problem is solved.

(2) Idempotent: after the task is executed, a data record execution state is inserted into the batch task execution record table, wherein the data record execution state comprises in-process, success and failure. If the task is successfully executed on the same day, the task returns to the node when being executed again. Idempotent nodes ensure that the task is successfully performed only once a day.

(3) Cleaning data: the node clears dirty data, including summary table and refund temporary table data, from previous run batch updates, generated by a last failed execution task (subtasks not all succeeded or manual re-run task). And cleaning the data nodes to ensure the correctness of the batch summary task summary data.

(4) Sub-task splitting: the node generates 10000 packets. When the settlement details database falls down, 10000 is taken as a film by using hash codes of merchant numbers, and the obtained remainder is the grouping number of the settlement details, so that the effect is that the detail data of the same merchant are divided into the same group.

(5) Subtask distribution: the node generates 10000 subtasks from 10000 packets generated by the subtask splitting node, and distributes the 10000 subtasks to other batch nodes for consumption through the MQ. In order to avoid overlarge pressure on the database caused by concurrent inquiry of the subtasks, delay is set during subtask distribution, so that the time for completing execution of all the subtasks is about 8 minutes, and the CPU of the database is less than 10%.

(6) The polling subtask execution state: after all the subtasks are distributed, the node polls the execution state of all the subtasks every 10 seconds. And after the execution states of all subtasks are polled successfully, continuing to execute the next node.

(7) Summarizing subtask data: because the data of the same merchant are in the same group, the settlement amount of the merchant is summarized once only by subtasks, and the cost of batch application and a database is greatly saved. The commission fee of the same social agency needs to be subjected to secondary summarization by the main task, the commission fee summarization record inserted when the subtasks are summarized for the first time needs to be deleted before the secondary summarization, and finally, the commission fee settlement amount is summarized into a record which is inserted into a summarization table. The summary update settlement details logic see subtask summary data node.

(8) Handling the pending refund of this day before: the node processes the refund state before the batch date in the refund temporary table, which is refund in the process, and performs corresponding process according to different states of refund.

The detailed configuration subtask chain node includes:

(1) Adding a task lock: and adding a task lock with the main task chain.

(2) Subtask pre-inspection: the node checks whether the main task exists and is executing, and skips execution of the sub task if the main task is in other states.

(3) Subtask data query: the node subtask inquires data which are required to be summarized and have the expected settlement date smaller than or equal to the current running date in the corresponding group number in the settlement list according to the number of the node subtask.

(4) Record refunds in progress: and when the batch is summarized, the refund processing is regarded as successful refund. The node needs to record the refund in process in a refund temporary table, and corresponding process is carried out in a main task according to the final state in the refund temporary table.

(5) Subtask summary data: and configuring the summary dimension of each settlement type in the settlement type configuration table, loading data in the table into a cache when the batch application is started, and summarizing the settlement amount of each type according to the configuration by subtasks. When a new settlement type is added, only one configuration is added in the configuration table, and a new type of logic code is not required to be added. After subtasks are summarized, a snowflake algorithm pre-generates summary table ids and inserts the summary records into a batch summary table. Meanwhile, batch summary table ids need to be updated into a settlement list, wherein one summary id corresponds to a plurality of settlement lists. Each subtask node employs multi-threaded concurrent updating. In order to avoid excessive pressure of the database, delay processing is added during updating, and the database is updated in batches.

The working principle of the technical scheme is as follows: in the batch task chain flow definition table, a designer first determines the basic constitution of a main task chain, identifies key steps involved in the batch summarization process, such as data preparation, task decomposition, subtask execution, result summarization and the like, and converts the steps into specific main task chain nodes. For subtasks, a designer deeply analyzes the execution flow of the subtask, and defines the steps which each subtask should have, such as subtask front-end checking (ensuring data integrity and correctness), subtask data query (acquiring data fragments needing to be processed), subtask execution (summarizing the data of the subtask), recording refunds in progress (managing special cases of refund transactions) and subtask summarizing data (collecting subtask internal processing results). After the main task chain is started, the task locking operation is executed first, so that only one task instance can be ensured to run at the same time, and data competition and confusion are prevented. In the clean data phase, the system cleans up invalid or no longer needed data for subsequent processing. And then, the main task chain is used for splitting subtasks, decomposing mass data into a plurality of subtasks according to preset rules, and distributing the subtasks to different computing resources for execution, so that parallel processing is realized to improve the efficiency. During the execution of the subtasks, the main task chain monitors the progress of each subtask by polling the execution state of the subtasks, so as to ensure that all the subtasks can be successfully executed. After the subtasks are completed, the main task chain is responsible for summarizing the result data of each subtask and integrating the result data into a complete summarized result. The main task chain may also be particularly concerned with handling pending refunds prior to the present day, ensuring that refund transactions are properly handled when summarized. A tasking locking mechanism is provided in each of the plurality of nodes of the main and sub-tasking chains to ensure consistency of data processing in a distributed environment. The idempotent design ensures that the accuracy of the final summary result is not affected during processing even if tasks are repeatedly performed for some reason.

The technical scheme has the effects that: by configuring the main task chain and the subtask chain in the batch task chain flow definition table in advance, each node and the execution sequence thereof are defined, so that the whole data summarization process becomes clear in structure and orderly in flow. This helps to reduce error rate, improve work efficiency and data processing quality; the main task chain is responsible for splitting and distributing subtasks, allows the system to process the subtasks in parallel by utilizing a plurality of servers or a plurality of threads, greatly improves the data processing speed, fully digs and utilizes computing resources, and improves the throughput and response speed of the system; when the main task chain node is configured, a task lock adding and idempotent processing mechanism is added, so that consistency and accuracy of data processing in a high concurrency environment are ensured, data conflict and repeated processing are avoided, and accuracy of a summarized result is ensured; the subtask splitting and subtask distributing mechanism enables mass data to be split into manageable subsets and dispersed to different computing resources for processing, which is particularly critical for processing large-scale data, can effectively relieve single-point pressure and improves the stability and expansibility of the whole system; the main task can master the task progress in real time by polling the execution state of the subtasks, flexibly schedule resources, effectively manage and control the task execution process, be favorable for quickly finding out and processing problems, and ensure that the tasks are completed efficiently; the node configuration of prepositive inspection, data inquiry, refund recording, subtask summarized data and the like in the subtask chain ensures the integrity and accuracy in the data processing process, and simultaneously gives consideration to the processing of special service scenes such as refund and the like, so that the final summarized result can comprehensively and accurately reflect the actual condition of the service; according to the scheme, through refined task decomposition and node configuration, the resource utilization rate is improved, meanwhile, due to the characteristics of a task chain, the execution of the whole task cannot be affected by the faults of the single subtasks, the execution or recovery can be carried out again according to the needs, and the fault tolerance and the self-repairing capability of the system are improved.

In one embodiment of the present invention, the dispatch center system sets a timing task, issues a message of a specific topic and tag to a batch application system through a message queue at a fixed settlement time point every day, and starts a batch summary master task, including:

creating a timing task in the dispatching center system and designating the execution period of the task; the execution period is a fixed point in time of day, such as the morning zero point of each day.

Integrating the timing task with a target message queue system, and setting message content, a topic (topic) and a tag (tag) which need to be sent when the task is executed; ensuring that instructions can be correctly sent to the message queue when a task is triggered.

Defining the content of the message body; the message content comprises a task type identifier, service parameters and a starting time stamp, so that a receiver can accurately start a batch summary main task according to the information. Selecting or creating a specific topic (topic) for carrying batch summary related messages and using a tag (tag) with semantic meaning; to facilitate customer-side screening and processing of particular types of summary tasks.

The working principle of the technical scheme is as follows: a timed task is provided in the dispatch center system that automatically triggers at a fixed daily settlement time point (e.g., the early morning zero point). In this way, the system can perform batch data summarization tasks on time without manual intervention; the timing task is integrated with the message queue system, and when the timing task is triggered, a message is sent to the batch application system through the message queue. The message contains a specific theme (topic) and a tag (tag) to facilitate target system identification and processing; the message body content is designed to be quite rich and contains the key information such as task type identification, service parameters, start time stamp and the like. The information is enough for the batch application system to accurately understand and execute the corresponding batch summarization task; when a preset time point arrives, the dispatching center system issues the configured message content to a preset theme (topic) by calling a message queue API interface, and waits for consumption of a batch application system; the batch application system pre-configures message consumers to subscribe specifically to topics and tags associated with batch summary tasks. Upon receiving a message issued by the dispatch center, the message consumer triggers corresponding processing logic; in a batch application system, message processing logic is written as script or program code that, upon receipt of a timed task triggered message, the system parses the message and initializes and starts a main task chain based on the message content. The main task chain sequentially executes tasks according to the pre-configured nodes, including but not limited to locking, idempotent checking, data cleaning, subtask splitting, subtask distribution, monitoring subtask status, summarizing subtask data, refund processing and the like.

The technical scheme has the effects that: the timing tasks are set through the dispatching center system, so that batch summarizing of the main tasks can be automatically triggered at fixed time points (such as the early morning zero points) every day, manual intervention is greatly reduced, the automation level of business processing is improved, and the timeliness and the accuracy of data summarizing work are ensured; and the message queues are utilized for communication among the systems, so that asynchronous decoupling is realized between the dispatching center system and the batch application system, synchronous operation is not needed, and the dispatching center system and the batch application system are respectively focused on own core functions. The method is beneficial to improving the response speed of the system, reducing the dependency among the systems and enhancing the stability and expandability of the system; the batch tasks are executed in the low peak period (such as early morning), so that the contention of the system resources in the peak period can be avoided, and the idle resources are fully utilized to process a large amount of data, so that the utilization rate of the server resources is improved; by setting the content of the message body, the key information such as task type identification, service parameters, start time stamp and the like is contained, so that a batch application system can accurately identify and execute corresponding tasks, and the accuracy of data summarization tasks is ensured; different types of batch summarization tasks are distinguished by using a specific theme (topic) and a tag (tag), so that a consumer terminal can conveniently screen and process different types of summarization tasks according to business requirements, and a system can conveniently expand more types of batch processing tasks in the future; the message queue itself has message persistence and retry mechanism, even if batch application system fails in the process of receiving message or processing, the message queue can still ensure reliable delivery of message, ensures final execution of task, and enhances fault tolerance and robustness of system.

In one embodiment of the present invention, after a node in the batch application system monitors messages corresponding to topic and tag in MQ, it starts to execute a batch summary main task chain; comprising the following steps:

After the batch application system is started, the node instance and the message queue service establish stable network connection, subscribe a pre-configured topic, and set a filtering tag;

In the execution process, if an abnormal situation occurs; the node instance captures the abnormal situation and performs a secondary operation.

The working principle of the technical scheme is as follows: after the batch application system is started, one node instance in the system establishes network connection with the message queue service, subscribes to the pre-configured topic, and only receives the message notification related to the batch summary main task through filtering the tag. The node instance enters a listening state and waits for the MQ server to push messages matching topic and tag. When a message is received, the node instance reads the message and acknowledges successful receipt while preventing repeated delivery of the message. The node instance analyzes the received message content, and extracts parameters and context information required by executing a batch summarizing main task; according to the configuration in the batch task chain flow definition table, the node instance initializes the main task chain, allocates execution environments and parameters for each node, and executes the nodes one by one according to a predefined execution sequence. During execution of the main task chain, the node instance continuously monitors the execution state of each subtask, for example by polling the subtask execution state to learn about the task progress. After the execution of the subtasks is finished, the node instance collects the processing results of the subtasks and performs data summarization. If abnormal conditions occur in the execution process, such as failure or overtime of node execution, the node instance can capture the abnormal conditions and take corresponding secondary operations, such as attempting to recover the task or resubmitting the task to a message queue, waiting for re-execution, so as to ensure the final completion of the task and the reliability of data processing.

The technical scheme has the effects that: the asynchronous triggering of tasks is realized through the message queue, and the batch application system is decoupled from the dispatching center system, so that the system architecture is more loosely coupled, and the expansion and maintenance are easy. Meanwhile, the subtasks are processed in parallel by the node instance, so that distributed computation is realized, and the efficiency of data processing and the concurrent processing capacity of the system are greatly improved; the dispatching center system sets a timing task to automatically issue messages, and the batch application system nodes monitor and respond to the messages, so that a batch summarizing main task chain is automatically started, manual intervention is not needed, the operation cost is reduced, and the timeliness and accuracy of service processing are improved; the node instance subscribes to a specific topic and sets a filtering tag, so that only messages related to batch summary main tasks are ensured to be processed, the interference of irrelevant messages is avoided, and the pertinence and the efficiency of message processing are improved; the task chain mode enables the whole data processing process to have clear execution sequence and logic, and node instances can orderly execute each node in the task chain, so that monitoring and management are facilitated. Meanwhile, aiming at the processing mechanism of abnormal conditions (such as task recovery or retry after node execution failure or overtime), the stability and success rate of task execution are ensured; the subtasks are executed in parallel through the node instance, system resources are effectively utilized, load balancing is achieved, a single node is prevented from becoming a performance bottleneck, and the overall processing capacity and usability of the system are improved; and the message confirmation mechanism prevents repeated delivery of the message, and ensures consistency of data processing. In addition, the abnormal condition in the node executing process can be captured and operated for the second time in time, the fault tolerance and the self-healing capacity of the system are improved, and the service continuity and the data security are ensured.

In one embodiment of the present invention, after the execution of each subtask is completed, the subtask gathers the result data processed by each subtask, and the main task integrates the subtask results to form a complete data set, including:

Each subtask node completes the processing of each data set according to preset logic, wherein the processing comprises data cleaning, statistical calculation and verification; storing processing results, wherein the processing results comprise intermediate results, summarized statistical data and verification reports;

Each subtask node actively reports the respective processing result to the main task node through a message queue after completing the task; the main task node receives the result data reported by each subtask, stores the result data and preprocesses the stored result data; the preprocessing comprises removing repeated values and consistency verification;

The main task node is according to a preset merging strategy; the merging strategy comprises merging the result data of each subtask based on fields such as a time stamp, a transaction ID and the like to form a complete data set;

and summarizing and counting the combined data sets, such as accumulating transaction amount, average transaction amount, various index statistics and the like, generating a final summarized report or data view and outputting.

The working principle of the technical scheme is as follows: each subtask node obtains the data to be processed from the message queue and performs independent processing according to preset logic, and this stage can cover data cleaning (removing invalid, erroneous or redundant data), statistical calculation (such as summation, average value, maximum and minimum value, etc.), and data verification (ensuring that the data meets business rules and constraint conditions). After the subtasks are completed, the respective processing results (possibly including intermediate forms of the original data, processed summarized statistical data and reports of verification results) are stored in a lasting mode, and meanwhile the processing results are actively reported to the main task node through a message queue. The main task node is used as a summarizing layer and is responsible for collecting result data reported by all sub task nodes and carrying out preliminary preprocessing work. This includes, but is not limited to, data deduplication (removing the same record, maintaining data uniqueness), and consistency checking (ensuring that data processed by different subtasks remains consistent across key fields). The main task node intelligently integrates the result data from different sub task nodes according to a preset merging strategy, such as based on key identification fields of time stamps, transaction IDs and the like, and a complete and collision-free data set is constructed. After the data set is completed, the main task node further performs high-level summarizing statistical analysis on the integrated data set, such as calculating total transaction amount, average transaction amount and other various business indexes, so as to form a data statistical report or a visual data view which comprehensively reflects the running condition of the system. Finally, the summarized statistical results are output to relevant business systems or decision makers for decision support, business monitoring, report display and other purposes, so that a large amount of data is processed and analyzed efficiently, accurately and in real time.

The technical scheme has the effects that: the whole task is split into a plurality of subtasks, the advantages of distributed computing can be fully utilized, parallel processing is realized, the data processing speed is remarkably improved, and the time required by the whole flow is shortened. The subtasks can independently run on different computing resources according to the characteristics and the processing requirements of the data set, so that the computing resources can be more reasonably distributed, and the overload of a single node is avoided; each subtask comprises the processes of data cleaning, statistical calculation and verification, so that the quality of each part of data is ensured, and the possibility of subsequent errors or misleading conclusions caused by the quality problem of the data is reduced; the main task node ensures the consistency of the data by receiving the processing results of the sub task nodes and through consistency verification, and prevents the occurrence of data conflict or inconsistency; if a subtask fails or is delayed, the main task node can rearrange the subtask because the result is transmitted asynchronously through the message queue and other subtasks are not affected, thereby improving the robustness and the expandability of the system. The subtask results are integrated by using a preset merging strategy (such as based on a time stamp and a transaction ID), so that a complete service data set can be accurately spliced, and data fragments and redundancy are avoided; the combined data sets are subjected to deep summary statistics, so that more valuable information such as key indexes of accumulated transaction amount, average transaction amount and the like can be extracted from the global view, and detailed data support is provided for decision making. And finally, the generated summary report or data view is convenient for users to intuitively understand and use, and is helpful for quickly making business decisions or gaining insight into business trends.

In one embodiment of the present invention, as shown in fig. 2, a system for implementing the method for summarizing mass data based on task chains and divide-and-conquer method includes:

A batch starting module: the dispatching center system sets a timing task, issues a message of a specific topic and tag to the batch application system through a Message Queue (MQ) at a fixed settlement time point, and starts a batch summarizing main task;

approve the right to use of foreign exchange total modules: after a node (i.e. an operation instance) in the batch application system monitors messages corresponding to topic and tag in the MQ, starting to execute a batch summary main task chain;

An embodiment of the invention provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the memory, wherein the processor executes the program to realize the mass data summarization method based on a task chain and a divide-and-conquer method.

In one embodiment of the present invention, a non-transitory computer readable storage medium has stored thereon a computer program that is executed by a processor to implement a task chain and divide-and-conquer method-based massive data summarization method as described in any one of the above.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A mass data summarization method based on a task chain and a divide-and-conquer method is characterized by comprising the following steps:

2. The method for summarizing mass data based on task chains and divide-and-conquer method according to claim 1, wherein the pre-configuring the main task chain and the sub task chain in the batch task chain flow definition table and defining each node and the execution sequence of each node comprises:

3. The method for summarizing mass data based on task chains and divide-and-conquer method according to claim 2, wherein the configuring the main task chain node in detail comprises: adding task locks, idempotent, cleaning data, subtask splitting, subtask distribution, polling subtask execution status, summarizing subtask data, and handling the pending refund of the present day.

4. The method for summarizing mass data based on task chains and divide-and-conquer method according to claim 2, wherein the detailed configuration of the sub-task chain nodes comprises: adding a task lock, pre-task checking, sub-task data query, and recording refund in progress and sub-task summary data.

5. The method for summarizing mass data based on task chains and divide-and-conquer method according to claim 1, wherein the dispatching center system sets timing tasks, issues a message of specific topic and tag to the batch application system through the message queue at a fixed settlement time point every day, and starts batch summarizing main tasks, comprising:

Creating a timing task in the dispatching center system and designating the execution period of the timing task;

integrating the timing task with a target message queue system, and setting message content, topic and tag to be sent when the task is executed;

Defining the content of the message body; selecting or creating a specific topic, and using a tag with semantic meaning;

6. The method for summarizing mass data based on task chains and divide-and-conquer method according to claim 1, wherein after a node in the batch application system monitors messages corresponding to topic and tag in MQ, the batch summarizing main task chain is started to be executed; comprising the following steps:

In the executing process, if abnormal conditions occur, the node instance captures the abnormal conditions and executes secondary operations.

7. The method for summarizing mass data based on task chains and divide-and-conquer method according to claim 1, wherein after the execution of each subtask is completed, the subtask summarizes the result data processed by each subtask, and the main task integrates the subtask results to form a complete data set, comprising:

8. A system for implementing the task chain and divide-and-conquer method based mass data summarization method of claim 1, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the memory, the processor executing the program to implement the task chain and divide-and-conquer method based mass data summarization method of any one of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor to implement the task chain and divide-and-conquer method based mass data summarization method according to any one of claims 1-7.