CN117786008A - Method, apparatus, device, storage medium and program product for processing batch data - Google Patents

Method, apparatus, device, storage medium and program product for processing batch data Download PDF

Info

Publication number
CN117786008A
CN117786008A CN202311830847.8A CN202311830847A CN117786008A CN 117786008 A CN117786008 A CN 117786008A CN 202311830847 A CN202311830847 A CN 202311830847A CN 117786008 A CN117786008 A CN 117786008A
Authority
CN
China
Prior art keywords
processing
task
batch
data
batch data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311830847.8A
Other languages
Chinese (zh)
Inventor
廖宸
李东丽
刘启明
邓华丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202311830847.8A priority Critical patent/CN117786008A/en
Publication of CN117786008A publication Critical patent/CN117786008A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a batch data processing method, relates to the technical field of cloud computing, and can be applied to the technical field of finance. The method comprises the following steps: responding to a processing instruction of batch data, and acquiring the record number of the batch data to be processed and the historical processing performance data of the computing node; determining the maximum processing transaction number of a single node in preset processing time according to the historical processing performance data of the computing node; dividing batch tasks according to the maximum transaction processing number and the batch data record number to be processed; storing the segmented task information into a distributed cache; starting a computing node to obtain the task information from the distributed cache; and executing batch data tasks according to the task information. The present disclosure also provides a batch data processing apparatus, device, storage medium, and program product.

Description

Method, apparatus, device, storage medium and program product for processing batch data
Technical Field
The present disclosure relates to the field of cloud computing technology, and in particular, to the field of distributed technology, and more particularly, to a method, an apparatus, a device, a storage medium, and a program product for processing batch data.
Background
Yun Yuansheng a distributed database, such as a gaussian database, is deployed in a distributed manner, and generally, data access needs to be distributed to access data nodes through coordination nodes, and the access needs to be carried with a fragmentation field with a table structure, so that the coordination nodes route requests to specified data nodes according to the fragmentation field. However, for the batch processing of the day end, the situation that all day transaction records need to be traversed to perform centralized and unified processing cannot be specified in the fragment field. For the day end processing of small data volume, one job is started to traverse the whole volume of data, but for large batch of data, if only one batch of nodes are started, performance bottleneck exists, and if a plurality of batch of nodes are started, the situation of repeated processing of data exists. In addition, for a database with discontinuous transaction record serial numbers and serious inclination, if the processing time of a simple average slice is uncontrollable.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
In view of the above, the present disclosure provides a method, apparatus, device, storage medium, and program product for processing batch data to solve the problem of massive data concurrency of end-of-day applications based on cloud native distributed database deployment.
According to a first aspect of the present disclosure, there is provided a method of processing batch data, the method comprising:
responding to a processing instruction of batch data, and acquiring the record number of the batch data to be processed and the historical processing performance data of the computing node;
determining the maximum processing transaction number of a single node in preset processing time according to the historical processing performance data of the computing node;
dividing batch tasks according to the maximum transaction processing number and the batch data record number to be processed;
storing the segmented task information into a distributed cache;
starting a computing node to obtain the task information from the distributed cache; and
and executing batch data tasks according to the task information.
According to an embodiment of the disclosure, the dividing the batch task according to the maximum transaction number and the batch data record number to be processed includes:
determining the upper limit and the lower limit of the transaction sequence number of the batch data to be processed according to the batch data record number to be processed; and
and determining a transaction sequence number interval of each task according to the maximum transaction processing number and the upper and lower limits of the batch data transaction sequence numbers to be processed.
According to an embodiment of the disclosure, the determining the maximum number of processing transactions of a single node within a preset processing time according to the historical processing performance data of the computing node includes:
Calculating the transaction number processed in a single node unit time according to the historical processing performance data of the calculation node; and
and calculating the maximum transaction processing number in the preset processing time according to the transaction processing number in the unit time of the single node.
According to an embodiment of the present disclosure, the initiating computing node retrieving the task information from the distributed cache includes:
responding to a task trigger instruction issued by a batch executor console, and starting a plurality of computing nodes; and
and sequentially retrieving the task information from the distributed cache based on the distributed task lock.
According to an embodiment of the present disclosure, the sequentially retrieving, based on the distributed task lock, the task information from the distributed cache includes:
acquiring a distributed task lock corresponding to the task information before acquiring the task information; and
the task information state of the acquired distributed task lock is set as claimed.
According to an embodiment of the disclosure, the executing the batch data task according to the task information includes:
determining target service data in a data node according to a transaction sequence number interval in the task information; and
and processing the target service data.
A second aspect of the present disclosure provides an apparatus for processing batch data, the apparatus comprising:
The acquisition module is used for responding to the processing instruction of the batch data and acquiring the record number of the batch data to be processed and the historical processing performance data of the computing node;
the determining module is used for determining the maximum processing transaction number of the single node in the preset processing time according to the historical processing performance data of the computing node;
the task segmentation module is used for segmenting batch tasks according to the maximum transaction processing number and the batch data record number to be processed;
the storage module is used for storing the segmented task information into the distributed cache;
the task retrieval module is used for starting a computing node to retrieve the task information from the distributed cache; and
and the task execution module is used for executing batch data tasks according to the task information.
According to an embodiment of the present disclosure, the task segmentation module includes: a first determination sub-module and a second determination sub-module.
The first determining submodule is used for determining the upper limit and the lower limit of the transaction sequence number of the batch data to be processed according to the batch data record number to be processed; and
and the second determining submodule is used for determining the transaction sequence number interval of each task according to the maximum processing transaction number and the upper and lower limits of the batch data transaction sequence numbers to be processed.
According to an embodiment of the disclosure, the determination module includes a first calculation sub-module and a second calculation sub-module.
The first calculation sub-module is used for calculating the number of the processing transactions in the unit time of the single node according to the historical processing performance data of the calculation node; and
and the second calculation sub-module is used for calculating the maximum processing transaction number in the preset processing time according to the processing transaction number in the unit time of the single node.
According to an embodiment of the disclosure, the task retrieval module includes a compute node promoter module, a task retrieval sub-module.
The computing node starting sub-module is used for responding to a task trigger instruction issued by the batch executor console and starting a plurality of computing nodes; and
and the task acquisition sub-module is used for sequentially acquiring the task information from the distributed cache based on the distributed task locks.
According to an embodiment of the present disclosure, a task retrieval sub-module includes an acquisition unit and a status update unit.
The acquisition unit is used for acquiring the distributed task lock corresponding to the task information before acquiring the task information; and
and the state updating unit is used for setting the task information state for acquiring the distributed task lock as claimed.
According to an embodiment of the present disclosure, the task execution module includes: a third determination sub-module and a processing sub-module.
A third determining submodule, configured to determine target service data in a data node according to a transaction sequence number interval in the task information; and
and the processing sub-module is used for processing the target business data.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of processing bulk data as described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the method of processing bulk data as described above.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the method of processing batch data as described above.
According to the batch data processing method provided by the embodiment of the disclosure, the maximum processing transaction number of a single node in the preset processing time is determined according to the historical processing performance data of the computing node in response to the processing instruction of the batch data, batch tasks are divided according to the maximum processing transaction number and the batch data record number to be processed, and the upper limit of the processing time of each task is controllable by dividing the batch tasks, so that the overall processing time is controllable, and the concurrent processing capacity of the cloud primary distributed database is improved; storing the segmented task information into a distributed cache; starting a computing node to obtain the task information from the distributed cache; and executing batch data tasks according to the task information. By actively claiming tasks by the batch executors, the calculation pressure can be effectively balanced among the batch executors. Compared with the related art, the method for processing the batch data, provided by the embodiment of the disclosure, prevents the computing nodes from repeatedly processing by logically splitting a large batch of data to be processed into small processing tasks, and improves the processing efficiency of the batch data due to controllable processing time of each computing node.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
FIG. 1 schematically illustrates an application scenario diagram of a method, apparatus, device, storage medium, and program product for processing batch data according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a system architecture diagram of a batch data processing apparatus provided in accordance with an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a method for processing batch data provided in accordance with an embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a batch task segmentation method provided in accordance with another embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of a method of processing batch data provided in accordance with another embodiment of the present disclosure;
FIG. 6 schematically illustrates a block diagram of a batch data processing apparatus according to an embodiment of the present disclosure; and
fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a method of processing bulk data according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
The terms appearing in the embodiments of the present disclosure will first be explained:
cloud primary database: the cloud infrastructure is fully utilized, the distributed database is constructed on the basis of the cloud infrastructure, the calculation and storage separation of the application is realized, the cloud deployment of the application storage is satisfied, and the dynamic online capacity expansion characteristic is realized.
Database CN node: a coordination Node (Coordinator Node) in charge of receiving an access request from an application and returning an execution result to a client; and the task decomposition is responsible for decomposing the task, and the task segmentation is scheduled to be executed on each DN node in parallel.
Database DN node: the Data Node (Data Node) is responsible for storing service Data, executing Data query tasks and returning execution results to the CN.
Functional calculation node: and (3) applying for computing resources when the server settlement node is used, recovering the resources after the resources are used up, and realizing optimal dynamic allocation of the resources.
Under the architecture system of the sub-table and sub-database, each cluster is provided with independent database nodes, the data among the clusters are mutually isolated, and the data of all the clusters form the full traffic data. Each cluster is deployed with an independent batch executor to process the data of the cluster, and the number of batch cluster processing nodes is fixed. Yun Yuansheng a distributed database (such as a gaussian database) adopts distributed deployment, generally data access needs to be distributed by a Coordination Node (CN) to access Data Nodes (DNs), and the access needs to take a fragmentation field with a table structure, the CN routes a request to a specified DN according to the fragmentation field, otherwise, all DNs are accessed by broadcasting. The cloud primary database requires the application to connect with the CN node, does not allow to directly connect with the DN node, and cannot specify the fragment field under the condition that all daytime transaction records need to be traversed for centralized and unified processing for daily batch processing. For the day-end processing of small data volume, one job is started to traverse the full volume data, but for large batch data, if only one batch node is started, performance bottleneck exists, if a plurality of batch nodes are started, the situation of repeated processing of the data exists, and for a database with discontinuous transaction record serial numbers and severely inclined, the situation that the processing time is uncontrollable exists if the average piece is simple.
In the related art, by modifying online transaction, a data processing fragment such as a logical fragment number is specified in online transaction, and a logical fragment field needs to be incorporated into an index, so that concurrent processing capability cannot be dynamically expanded, more resources are wasted in deployment, and the problem of performance bottleneck in deployment is solved. Or the upper and lower boundaries of the large batch of data to be processed are queried, and then the upper and lower boundaries are divided into sections evenly according to the number of the existing executors and then are allocated to each batch of executors for execution. If the unique serial numbers of the transactions are continuous and monotonically increased, the whole scheme has no problem, and if the serial numbers are discontinuous and have serious inclination, the transaction records processed by part of executors are little or none, and the other part of transaction records have a lot of problems that the whole batch processing time is long and the processing time can not be satisfied.
Based on the technical problems, an embodiment of the present disclosure provides a method for processing batch data, where the method includes: responding to an account login request, and acquiring account login information, wherein the account login information comprises account information, an account login equipment physical address and an account login network address; inquiring a key value pair in a cache database according to the account information, the physical address of the account login device and the account login network address; rejecting an account login request when determining that the key value of any key value pair is a preset identifier, wherein the key value is used for representing the number of account login failure times, and the preset identifier is used for representing that the account is in a locking state; and verifying the account login information when the key value of any key value pair is not a preset identifier or a keyword corresponding to the current account does not exist in the cache database.
FIG. 1 schematically illustrates an application scenario diagram of a method, apparatus, device, storage medium, and program product for processing batch data according to an embodiment of the present disclosure.
As shown in fig. 1, the application scenario 100 according to this embodiment may include a processing scenario of batch data. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only) may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a batch execution console server, and may execute the batch data processing method provided by the embodiment of the disclosure, and obtain the number of batch data records to be processed and the historical processing performance data of the computing node in response to a processing instruction of the batch data; determining the maximum processing transaction number of a single node in preset processing time according to the historical processing performance data of the computing node; dividing batch tasks according to the maximum transaction processing number and the batch data record number to be processed; storing the segmented task information into a distributed cache; starting a computing node to obtain the task information from the distributed cache; and executing batch data tasks according to the task information.
It should be noted that, the method for processing batch data provided by the embodiment of the present disclosure may be generally performed by the server 105. Accordingly, the processing device for batch data provided by the embodiments of the present disclosure may be generally disposed in the server 105. The method of processing bulk data provided by the embodiments of the present disclosure may also be performed by a server or a cluster of servers that are different from the server 105 and that are capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the processing apparatus for batch data provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
It should be noted that, the method and the device for processing batch data determined by the embodiments of the present disclosure may be used in the field of internet technology, the field of financial technology, and any field other than the financial field, and the application field of the method and the device for processing batch data determined by the embodiments of the present disclosure is not limited.
Fig. 2 schematically illustrates a system architecture diagram of a batch data processing apparatus provided according to an embodiment of the present disclosure. As shown in fig. 2, the batch data processing apparatus provided in the embodiment of the present disclosure mainly includes a batch execution console and a functional batch computing node. Before issuing a processing task of batch data, the batch control console calculates and segments the batch data to be processed, and averagely segments a large number of batch tasks according to preset processing time, for example, the range of each task transaction sequence number interval can be 5 ten thousand transaction data; and caching the segmented task information into a redis cache. And starting the corresponding functional calculation nodes by the batch executor console according to the task calculation result of the first step. If the tasks are more, the corresponding starting is more, the number of the tasks is generally not exceeded, and the starting number can be specifically controlled through parameters and the upper limit is set. After each functional computing node receives the control console dispatching instruction, queuing and picking up tasks in the redis cache, inquiring corresponding service data in the cloud primary database according to the picked up task information, and processing the service data.
The method for processing batch data according to the embodiments of the present disclosure will be described in detail below with reference to fig. 3 to 5 based on the application scenario described in fig. 1 and the system architecture shown in fig. 2.
FIG. 3 schematically illustrates a flow chart of a method for processing batch data according to an embodiment of the present disclosure. As shown in fig. 3, the batch data processing method of this embodiment includes operations S210 to S260, and the method may be performed by a server or other computing device.
In operation S210, the number of batch data records to be processed and the compute node historical processing performance data are acquired in response to the processing instruction of the batch data.
In operation S220, a maximum number of transactions processed by a single node within a preset processing time is determined according to the historical processing performance data of the computing node.
In operation S230, the batch task is divided according to the maximum transaction number and the batch data record number to be processed.
In one example, for end-of-day batch processing, when a batch execution console receives a batch data processing instruction, the number of batch data records to be processed and the historical processing performance data of the compute node are acquired. And determining the maximum transaction processing number of the single node within the preset processing time according to the historical processing performance data of the computing node, wherein the preset processing time can be 5 minutes or 10 minutes, and calculating the maximum transaction processing number of the single node within 5 minutes according to the historical processing performance data of the computing node, wherein the maximum transaction processing number can be 5 ten thousands, 10 ten thousands and the like.
In one example, because the cloud native distributed database has discontinuous data transaction sequence numbers and severe inclination, if batch tasks are simply fragmented, the execution time of each batch task may be greatly different, thereby affecting the execution efficiency of the whole batch. Therefore, in the embodiment of the present disclosure, the preset processing time may be configured according to actual requirements, by determining the maximum processing transaction number of a single node within the preset processing time, dividing the batch task according to the maximum processing transaction number, and fixing the time upper limit of processing each batch task by the single node, for example, the preset processing time is 10 minutes, the time for completing the batch task by the single node is at most 10 minutes, and because of the situation of data inclination, the data amount in the task information after cutting is unsaturated, and the time for completing the batch task by the single node may be less than 10 minutes. But can ensure that the maximum duration of each node performing a task is controllable.
In operation S240, the divided task information is saved in the distributed cache.
In one example, after a large batch of data tasks are divided and calculated, the divided task information is cached in a distributed cache, the task information comprises a transaction sequence number interval range of data to be processed, and the task information is used for positioning service data when the batch of tasks are executed. The distributed cache may be, for example, a Redis database.
In operation S250, the initiating computing node retrieves the task information from the distributed cache.
In operation S260, a batch data task is performed according to the task information.
According to the embodiment of the disclosure, determining target service data in a data node according to a transaction sequence number interval in the task information; and processing the target service data.
In one example, a batch console issues a task trigger instruction to the computing nodes, a corresponding number of computing nodes is started, the number of the starting computing nodes is related to the number of tasks, if the number of tasks is more, the corresponding starting is more, the number of tasks is generally not more than the number of tasks, and the starting number is specifically controlled through parameters and an upper limit is set. And after each functional computing node receives the control console dispatching instruction, starting the batch processing. Specifically, the computing nodes queue to retrieve task information from the distributed cache. Determining target service data in a data node according to a transaction sequence number interval in the task information; and processing the target service data. If the task processing claimed by the current node is completed, the next task claimed in Redis is continued, and if no task can be claimed, the executor is terminated. Until all the actuators are not claimed, the large-batch job task execution is completed.
According to the batch data processing method provided by the embodiment of the disclosure, the maximum processing transaction number of a single node in the preset processing time is determined according to the historical processing performance data of the computing node in response to the processing instruction of the batch data, batch tasks are divided according to the maximum processing transaction number and the batch data record number to be processed, and the upper limit of the processing time of each task is controllable by dividing the batch tasks, so that the overall processing time is controllable, and the concurrent processing capacity of the cloud primary distributed database is improved; storing the segmented task information into a distributed cache; starting a computing node to obtain the task information from the distributed cache; and executing batch data tasks according to the task information. By actively claiming tasks by the batch executors, the calculation pressure can be effectively balanced among the batch executors. Compared with the related art, the method for processing the batch data, provided by the embodiment of the disclosure, prevents the computing nodes from repeatedly processing by logically splitting a large batch of data to be processed into small processing tasks, and improves the processing efficiency of the batch data due to controllable processing time of each computing node.
FIG. 4 schematically illustrates a flow chart of a batch task segmentation method provided in accordance with another embodiment of the present disclosure. As shown in fig. 4, operations S310 to S340 are included.
In operation S310, the number of transactions processed in a single node unit time is calculated based on the calculation node history processing performance data.
In operation S320, a maximum processing transaction number in a preset processing time is calculated according to the processing transaction number in the unit time of the single node.
In operation S330, an upper limit and a lower limit of the transaction sequence number of the batch data to be processed are determined according to the record number of the batch data to be processed.
In operation S340, a transaction sequence number interval of each task is determined according to the maximum transaction number and the upper and lower limits of the transaction sequence numbers of the batch data to be processed.
In one example, the statistical computing node historical processing performance data determines the number of transactions that can be processed per unit of time at a single node, such as 100 transactions per second or 1 million transactions per minute, etc.; and calculating the maximum transaction processing number in the preset processing time according to the transaction processing number in the unit time of the single node. Assuming that processing the 10 ten thousand records of the batch task takes about 5 minutes, the difference in processing between batches is about 10 minutes, and the record to be processed can be divided into 10 ten thousand tasks to be processed on average. Determining the upper limit and the lower limit of the transaction sequence number of the batch data to be processed according to the batch data record number to be processed, and assuming that the minimum value of the transaction record sequence number is Seqmin and the maximum transaction sequence number Seqmax, dividing the transaction sequence number interval of each task into: [ Seqmin+10w ], (Seqmin+10w, seqmin+20w ], (Seqmin+20w, seqmin+30w ], (Seqmin+n 10w, seqmax ].
Fig. 5 schematically illustrates a flow chart of a method of processing batch data provided in accordance with another embodiment of the present disclosure. As shown in fig. 5, operation S250 includes operations S251 to S252.
In operation S251, a plurality of compute nodes are started in response to a task trigger instruction issued by the batch executive console.
In operation S252, the task information is sequentially retrieved from the distributed cache based on the distributed task lock.
According to the embodiment of the disclosure, a distributed task lock corresponding to task information is acquired before the task information is acquired; and setting the task information state of the acquired distributed task lock as claimed.
In one example, after receiving a task trigger instruction issued by a batch executor console, a plurality of computing nodes are started, and since the plurality of computing nodes are started simultaneously to get a task to execute the task, in order to prevent data from being repeatedly processed, a redis distributed lock is required to be used when the task is taken to a distributed cache, so that repeated claim of the task is prevented. For example, the first functional computing node defaults to acquire a first task, namely, processes batch service data in the range of [ Seqmin, seqmin+10w ], and so on, and marks that the task is acquired before acquiring the task, so that other executors are prevented from claiming. If the processing of the claimed task of the node is completed, the claimed task in Redis is continued to be claimed, and because task strives exists, the claimed task needs to acquire a task lock first, the task can be set as claimed after acquiring the lock, and then the batch of tasks are processed.
Based on the batch data processing method, the disclosure also provides a batch data processing device. The device will be described in detail below in connection with fig. 6.
Fig. 6 schematically illustrates a block diagram of a batch data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 6, the processing apparatus 600 for batch data of this embodiment includes an acquisition module 610, a determination module 620, a task segmentation module 630, a storage module 640, a task retrieval module 650, and a task execution module 660.
The obtaining module 610 is configured to obtain the number of batch data records to be processed and the historical processing performance data of the computing node in response to the processing instruction of the batch data. In an embodiment, the obtaining module 610 may be configured to perform the operation S210 described above, which is not described herein.
The determining module 620 is configured to determine, according to the historical processing performance data of the computing node, a maximum number of processing transactions of the single node within a preset processing time. In an embodiment, the determining module 620 may be configured to perform the operation S220 described above, which is not described herein.
The task segmentation module 630 is configured to segment the batch task according to the maximum transaction number and the batch data record number to be processed. In an embodiment, the task segmentation module 630 may be used to perform the operation S230 described above, which is not described herein.
The storage module 640 is configured to store the segmented task information in a distributed cache. In an embodiment, the storage module 640 may be used to perform the operation S240 described above, which is not described herein.
The task retrieval module 650 is configured to initiate a computing node to retrieve the task information from the distributed cache. In an embodiment, the task retrieval module 650 may be configured to perform the operation S250 described above, which is not described herein.
The task execution module 660 is configured to execute batch data tasks according to the task information. In an embodiment, the task execution module 660 may be configured to execute the operation S260 described above, which is not described herein.
According to an embodiment of the present disclosure, the task segmentation module includes: a first determination sub-module and a second determination sub-module.
And the first determining submodule is used for determining the upper limit and the lower limit of the transaction sequence number of the batch data to be processed according to the batch data record number to be processed. In an embodiment, the first determining sub-module may be used to perform the operation S330 described above, which is not described herein.
And the second determining submodule is used for determining the transaction sequence number interval of each task according to the maximum processing transaction number and the upper and lower limits of the batch data transaction sequence numbers to be processed. In an embodiment, the second determining sub-module may be used to perform the operation S340 described above, which is not described herein.
According to an embodiment of the disclosure, the determination module includes a first calculation sub-module and a second calculation sub-module.
And the first calculation sub-module is used for calculating the number of the processing transactions in the unit time of the single node according to the historical processing performance data of the calculation node. In an embodiment, the first computing sub-module may be used to perform the operation S310 described above, which is not described herein.
And the second calculation sub-module is used for calculating the maximum processing transaction number in the preset processing time according to the processing transaction number in the unit time of the single node. In an embodiment, the second computing sub-module may be used to perform the operation S320 described above, which is not described herein.
According to an embodiment of the disclosure, the task retrieval module includes a compute node promoter module, a task retrieval sub-module.
And the computing node starting sub-module is used for responding to the task trigger instruction issued by the batch executor console and starting a plurality of computing nodes. In an embodiment, the computing node promoter module may be used to perform the operation S251 described above, which is not described herein.
And the task acquisition sub-module is used for sequentially acquiring the task information from the distributed cache based on the distributed task locks. In an embodiment, the task retrieval sub-module may be used to perform the operation S252 described above, which is not described herein.
According to an embodiment of the present disclosure, a task retrieval sub-module includes an acquisition unit and a status update unit.
The acquisition unit is used for acquiring the distributed task lock corresponding to the task information before acquiring the task information. In an embodiment, the obtaining unit may be configured to perform the operation S252 described above, which is not described herein.
And the state updating unit is used for setting the task information state for acquiring the distributed task lock as claimed. In an embodiment, the status updating unit may be used to perform the operation S252 described above, which is not described herein.
According to an embodiment of the present disclosure, the task execution module includes: a third determination sub-module and a processing sub-module.
And the third determining submodule is used for determining target service data in the data node according to the transaction sequence number interval in the task information. In an embodiment, the task execution module 660 may be configured to execute the operation S260 described above, which is not described herein.
And the processing sub-module is used for processing the target business data. In an embodiment, the task execution module 660 may be configured to execute the operation S260 described above, which is not described herein.
Any of the acquisition module 610, the determination module 620, the task segmentation module 630, the storage module 640, the task retrieval module 650, and the task execution module 660 may be combined in one module to be implemented, or any of the modules may be split into multiple modules, according to embodiments of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of the present disclosure, at least one of the acquisition module 610, the determination module 620, the task segmentation module 630, the storage module 640, the task retrieval module 650, and the task execution module 660 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the acquisition module 610, the determination module 620, the task segmentation module 630, the storage module 640, the task retrieval module 650, and the task execution module 660 may be at least partially implemented as a computer program module that, when executed, performs the corresponding functions.
Fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a method of processing bulk data according to an embodiment of the disclosure.
As shown in fig. 7, an electronic device 900 according to an embodiment of the present disclosure includes a processor 901 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. The processor 901 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 901 may also include on-board memory for caching purposes. Processor 901 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic device 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. The processor 901 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the program may be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.
According to an embodiment of the disclosure, the electronic device 900 may also include an input/output (I/O) interface 905, the input/output (I/O) interface 905 also being connected to the bus 904. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 909 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 909, so that a computer program read therefrom is installed into the storage section 908 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs that, when executed, implement a method of processing batch data according to an embodiment of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 902 and/or RAM 903 and/or one or more memories other than ROM 902 and RAM 903 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. When the computer program product runs in a computer system, the program code is used for enabling the computer system to realize the batch data processing method provided by the embodiment of the disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, via communication portion 909, and/or installed from removable medium 911. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 901. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (10)

1. A method for processing batch data, the method comprising:
responding to a processing instruction of batch data, and acquiring the record number of the batch data to be processed and the historical processing performance data of the computing node;
Determining the maximum processing transaction number of a single node in preset processing time according to the historical processing performance data of the computing node;
dividing batch tasks according to the maximum transaction processing number and the batch data record number to be processed;
storing the segmented task information into a distributed cache;
starting a computing node to obtain the task information from the distributed cache; and
and executing batch data tasks according to the task information.
2. The method of claim 1, wherein the partitioning of the batch task according to the maximum number of transactions processed and the number of batch data records to be processed comprises:
determining the upper limit and the lower limit of the transaction sequence number of the batch data to be processed according to the batch data record number to be processed; and
and determining a transaction sequence number interval of each task according to the maximum transaction processing number and the upper and lower limits of the batch data transaction sequence numbers to be processed.
3. The method of claim 1, wherein determining a maximum number of processing transactions for a single node within a preset processing time based on the computing node historical processing performance data comprises:
calculating the transaction number processed in a single node unit time according to the historical processing performance data of the calculation node; and
And calculating the maximum transaction processing number in the preset processing time according to the transaction processing number in the unit time of the single node.
4. The method of claim 1, wherein the initiating a computing node to retrieve the task information from the distributed cache comprises:
responding to a task trigger instruction issued by a batch executor console, and starting a plurality of computing nodes; and
and sequentially retrieving the task information from the distributed cache based on the distributed task lock.
5. The method of claim 4, wherein the sequentially retrieving the task information from the distributed cache based on the distributed task lock comprises:
acquiring a distributed task lock corresponding to the task information before acquiring the task information; and
the task information state of the acquired distributed task lock is set as claimed.
6. The method of claim 5, wherein performing a batch data task based on the task information comprises:
determining target service data in a data node according to a transaction sequence number interval in the task information; and
and processing the target service data.
7. A batch data processing apparatus, the apparatus comprising:
The acquisition module is used for responding to the processing instruction of the batch data and acquiring the record number of the batch data to be processed and the historical processing performance data of the computing node;
the determining module is used for determining the maximum processing transaction number of the single node in the preset processing time according to the historical processing performance data of the computing node;
the task segmentation module is used for segmenting batch tasks according to the maximum transaction processing number and the batch data record number to be processed;
the storage module is used for storing the segmented task information into the distributed cache;
the task retrieval module is used for starting a computing node to retrieve the task information from the distributed cache; and
and the task execution module is used for executing batch data tasks according to the task information.
8. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.
CN202311830847.8A 2023-12-28 2023-12-28 Method, apparatus, device, storage medium and program product for processing batch data Pending CN117786008A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311830847.8A CN117786008A (en) 2023-12-28 2023-12-28 Method, apparatus, device, storage medium and program product for processing batch data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311830847.8A CN117786008A (en) 2023-12-28 2023-12-28 Method, apparatus, device, storage medium and program product for processing batch data

Publications (1)

Publication Number Publication Date
CN117786008A true CN117786008A (en) 2024-03-29

Family

ID=90399891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311830847.8A Pending CN117786008A (en) 2023-12-28 2023-12-28 Method, apparatus, device, storage medium and program product for processing batch data

Country Status (1)

Country Link
CN (1) CN117786008A (en)

Similar Documents

Publication Publication Date Title
US10853368B2 (en) Distinct value estimation for query planning
US11475006B2 (en) Query and change propagation scheduling for heterogeneous database systems
US20190121901A1 (en) Database Sharding
US11275667B2 (en) Handling of workload surges in a software application
US10884839B2 (en) Processing system for performing predictive error resolution and dynamic system configuration control
US11030169B1 (en) Data re-sharding
US10838798B2 (en) Processing system for performing predictive error resolution and dynamic system configuration control
US11030213B2 (en) Transforming data structures and data objects for migrating data between databases having different schemas
US9026523B2 (en) Efficient selection of queries matching a record using a cache
CN113568938B (en) Data stream processing method and device, electronic equipment and storage medium
CN110036381B (en) In-memory data search technique
CN113760638A (en) Log service method and device based on kubernets cluster
US11093527B2 (en) Framework for continuous processing of a set of documents by multiple software applications
CN113626472B (en) Method and device for processing order data
CN117786008A (en) Method, apparatus, device, storage medium and program product for processing batch data
CN115617859A (en) Data query method and device based on knowledge graph cluster
CN110858199A (en) Document data distributed computing method and device
US20080163238A1 (en) Dynamic load balancing architecture
KR102571783B1 (en) Search processing system performing high-volume search processing and control method thereof
US20230342352A1 (en) System and Method for Matching into a Complex Data Set
US20220261389A1 (en) Distributing rows of a table in a distributed database system
US20230409575A1 (en) Database query processing with database clients
CN117575484A (en) Inventory data processing method, apparatus, device, medium and program product
CN118295803A (en) Resource batch processing method, device, equipment, medium and program product
CN118170811A (en) Data query method, device, apparatus, medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination