CN116955378A - Data processing method and system - Google Patents

Data processing method and system Download PDF

Info

Publication number
CN116955378A
CN116955378A CN202310897616.2A CN202310897616A CN116955378A CN 116955378 A CN116955378 A CN 116955378A CN 202310897616 A CN202310897616 A CN 202310897616A CN 116955378 A CN116955378 A CN 116955378A
Authority
CN
China
Prior art keywords
data
processed
actor
batches
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310897616.2A
Other languages
Chinese (zh)
Inventor
李凯
宋磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Picc Information Technology Co ltd
Original Assignee
Picc Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Picc Information Technology Co ltd filed Critical Picc Information Technology Co ltd
Priority to CN202310897616.2A priority Critical patent/CN116955378A/en
Publication of CN116955378A publication Critical patent/CN116955378A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control
    • G06F16/2336Pessimistic concurrency control approaches, e.g. locking or multiple versions without time stamps
    • G06F16/2343Locking methods, e.g. distributed locking or locking implementation details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24564Applying rules; Deductive queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method and system, and belongs to the field of computers. The data processing method comprises the following steps: acquiring a credential group code of task data to be processed and a distributed lock corresponding to the credential group code; judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired; determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result.

Description

Data processing method and system
Technical Field
The application belongs to the field of computers, and particularly relates to a data processing method and system.
Background
With the continuous development of information technology, more and more data are to be processed and classified, such as insurance field, in the financial accounting process inside the insurance company, the original certificates of each business transaction need to be classified, recorded, audited, adjusted and summarized according to the accounting principle and accounting specification, and finally accounting certificates are formed to reflect the financial condition and operation result of the company.
Taking insurance industry as an example, related technology often adopts an Oracle database, stores the data through a precompiled program consisting of a series of SQL sentences and PL/SQL codes, classifies and merges all the data according to accounting principles and accounting specifications.
Since all data is processed directly through the Oracle database, there is a problem in that the efficiency is low.
Disclosure of Invention
The embodiment of the application provides a data processing method and a data processing system, which can solve the problem of lower efficiency in the related technology.
In a first aspect, an embodiment of the present application provides a data processing method, including:
acquiring a credential group code of task data to be processed and a distributed lock corresponding to the credential group code;
judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value;
classifying the data to be processed into N batches of data according to a target classification rule;
And executing processing operation on the N batches of data in parallel to obtain a processing result.
In a second aspect, an embodiment of the present application provides a data processing system, including: the device comprises an acquisition module, a judgment module and a processing module;
the acquisition module is used for acquiring the credential group codes of the task data to be processed and the distributed locks corresponding to the credential group codes;
the judging module is used for judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
the processing module is used for determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result.
In the embodiment of the application, a credential group code of task data to be processed and a distributed lock corresponding to the credential group code are acquired; judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired; determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result. In this way, the data volume of the task data to be processed is judged with the threshold value, and under the condition that the data volume of the data to be processed is larger than the threshold value, the data to be processed is classified into N batches of data according to the target classification rule, and processing operations are executed on the N batches of data in parallel to obtain an operation result. By classifying the data to be processed and processing N batches of data in parallel, the problem that the efficiency is low easily exists in the related technology that all data are processed directly through an Oracle database is solved.
Drawings
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of the overall architecture of a data compute engine provided by an embodiment of the present application;
FIG. 3 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 4 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 5 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 6 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 7 is a block diagram of a data processing system according to an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
With the continuous development of information technology, more and more data are to be processed and classified, such as insurance field, in the financial accounting process inside the insurance company, the original certificates of each business transaction need to be classified, recorded, audited, adjusted and summarized according to the accounting principle and accounting specification, and finally accounting certificates are formed to reflect the financial condition and operation result of the company.
Taking insurance industry as an example, related technology often adopts an Oracle database, stores the data through a precompiled program consisting of a series of SQL sentences and PL/SQL codes, classifies and merges all the data according to accounting principles and accounting specifications. When the data to be processed is more, for example, millions of details need to take 10-12 hours or even exit abnormally, so that the aging requirement of financial accounting cannot be met, a large amount of database server resources are consumed to influence the processing of other normal services of the system, and the problems of low resource utilization rate and poor performance occur.
Meanwhile, since the storage process is run on the database server, some influence is brought to the database server. For example, the debugging and testing storage process requires a connection to the database server, indirectly increasing the difficulty of debugging and testing. And because the storage process occupies more server resources, if the storage process is not optimized or is unreasonable in design, the performance of the database may be negatively affected, and the maintenance cost of the storage process is high, so that a database manager or a professional developer is required to maintain and manage the database. Security problems may even occur, which may result if the storage process is not sufficiently protected, such as SQL injection attacks.
The data processing method provided by the application comprises the steps of obtaining a credential group code of task data to be processed and a distributed lock corresponding to the credential group code; judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired; determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result. In this way, the data volume of the task data to be processed is judged with the threshold value, and under the condition that the data volume of the data to be processed is larger than the threshold value, the data to be processed is classified into N batches of data according to the target classification rule, and processing operations are executed on the N batches of data in parallel to obtain an operation result. By classifying the data to be processed and processing N batches of data in parallel, the problem that the efficiency is low easily exists in the related technology that all data are processed directly through an Oracle database is solved.
Meanwhile, the data processing method provided by the embodiment of the application can be applied to an Actor model. The Actor model is briefly described below.
A series of thread problems such as lock and memory atomicity need to be paid special attention to when using Java for concurrent programming, and the state inside the Actor model is maintained by itself, i.e. its internal data can only be modified by itself (state modification by message passing), so that using the Actor model for concurrent programming can well avoid these problems, and the Actor consists of three parts, namely state (state), behavior (behavir) and mailBox (mailBox). Wherein:
state (state): the state in the Actor refers to variable information of the Actor object, and the state is managed by the Actor, so that the problems of lock, memory atomicity and the like in a concurrent environment are avoided.
Behavior (Behavior): the behavior specifies the computational logic in the Actor, through which the state of the Actor is changed by receiving a message.
mailBox (mailBox): the mailbox is a communication bridge between an Actor and an Actor, the mailbox stores the initiator information of a sender through a FIFO information queue, and the initiator of a receiver acquires the information from the mailbox queue.
The following benefits exist with the use of an Actor model:
first, event model driven—the communication between the actors is asynchronous, even though the Actor is able to handle other things after sending a message without blocking or waiting;
Secondly, the method in the strong isolation-the actors cannot be directly called by the outside, all are carried out through message transmission, so that the data sharing among the actors is avoided, and the state change of the other actors is required to be observed and only can be inquired through message transmission;
third, the location is transparent- -whether the Actor address is local or remote to the code;
fourth, lightweight—the Actor is a very lightweight single computing machine, and a single Actor only occupies more than 400 bytes, and can achieve high concurrency with a small amount of memory.
The embodiment of the application can adopt a scheme of concurrent processing of data fragments aiming at a large data volume accounting document, the front end supports the configuration of fragmentation rules, thresholds and the like, an accounting document merging task automatically calculates the number of split batches and the data volume of the batches according to the configuration to carry out data fragments, an Actor model with fault tolerance and high scalability is applied, a main Actor distributes, dispatches and collects sub-Actor processing results and ensures the consistency of final data, and Redis distributed locks are used for controlling concurrency, and supports the automatic compensation of failed sub-tasks, thereby greatly improving the merging and the evidence making and the write-back timeliness of the accounting statement of the large data volume and realizing the configuration and the automation.
It should be understood that the discussion of the embodiments of the present application with respect to the same terms or conditions may be referred to in tandem. That is, the discussion of a term or situation in one embodiment may also apply to the description of that term or situation in other embodiments as long as there is no logical conflict.
The method provided by the embodiment of the application is described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.
The data processing method provided by the embodiment of the application is applied to any scene needing to process data, such as the processing of accounting documents in the insurance industry. The data processing method provided by the embodiment of the application can be executed by the target equipment, wherein the target equipment can be an electronic equipment. The electronic device may be, for example, a terminal device such as a notebook computer or a tablet, or may be a server.
In the embodiment of the application, batch and parallel operation is completed by adopting a distributed lock, wherein the distributed lock is a mode which is realized on a distributed system and is used for mutually exclusive access to shared resources among the systems.
Fig. 1 is a flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 1, a flowchart of a data processing method according to an embodiment of the present application may include the following steps:
Step 110, acquiring a credential group code of task data to be processed and a distributed lock corresponding to the credential group code;
step 120, judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
step 130, determining the number N of batches of the task data to be processed according to the data amount of the task data to be processed when the data amount of the data to be processed is greater than a threshold value;
step 140, classifying the data to be processed into N batches of data according to a target classification rule;
and step 150, processing operation is executed on the N batches of data in parallel, and a processing result is obtained.
In the embodiment of the present application, the task data to be processed may be any data to be processed and combined, for example, order data of an e-commerce industry and credential detail data of an insurance industry, and the credential group code may be any group code having uniqueness in the task data to be processed, for example, accounting credential group code (CombCode) of the insurance industry.
In step 120, when the distributed lock acquisition corresponding to the credential group code fails, a reacquisition may be attempted. When the data size of the task data to be processed is smaller than the threshold value, the processing can be directly performed without using the steps 130-140 to obtain the processing result.
In step 140, the target classification rule may be a custom classification rule. For example, the task data to be processed may be accounting task data in an insurance field, and the target classification rule may be a fee classification rule of the accounting task data. The classifying the data to be processed into N batches of data according to the target classification rule in step 140 may include: and classifying the accounting task data into N batches of data according to the cost identification of the accounting task data.
In one embodiment, the performing the processing operation on the N batches of data in parallel in step 150, to obtain a processing result, includes: for each batch of data, the following operations are performed: determining whether a subtask is in process; and under the condition that the subtasks are not in processing, adding a distributed lock to the subtasks, executing processing operation on the subtasks to obtain the processing result of the subtasks, and releasing the distributed lock of the subtasks.
In the embodiment of the application, a credential group code of task data to be processed and a distributed lock corresponding to the credential group code are acquired; judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired; determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result. In this way, the data volume of the task data to be processed is judged with the threshold value, and under the condition that the data volume of the data to be processed is larger than the threshold value, the data to be processed is classified into N batches of data according to the target classification rule, and processing operations are executed on the N batches of data in parallel to obtain an operation result. By classifying the data to be processed and processing N batches of data in parallel, the problem that the efficiency is low easily exists in the related technology that all data are processed directly through an Oracle database is solved.
In one embodiment of the application, fault-tolerant and high-scalability Actor (role) calculation unit model application can be adopted, a main Actor calculation unit distributes, dispatches and collects processing results of sub-Actor calculation units and ensures final data consistency, distributed lock control is used for concurrency, automatic compensation of failed subtasks is supported, merging and certification of accounting details of a large amount of data and timeliness of write-back are greatly improved, and configuration and automation are realized. Each Actor computing unit is an independent computing unit, and can execute computing tasks in parallel. All computing tasks interact through a messaging mechanism, and each Actor computing unit may asynchronously receive and process computing tasks. Meanwhile, each Actor calculation unit is provided with a state memory for storing the state information of the current Actor calculation unit. When one Actor computing unit fails, the system automatically restores the state information of the Actor computing unit to the last stored state, so that the stability and the reliability of the system are ensured. The system also adopts a distributed computing technology, and can distribute computing tasks to different computing nodes for parallel computing, thereby improving the computing efficiency and the scalability of the system. Meanwhile, the system adopts distributed lock control concurrency, so that only one Actor computing unit can access the appointed resource at the same time. The method comprises the steps of locking a main Actor computing unit and locking sub Actor computing units respectively. If a subtask fails to execute, the system records the state, the batch process is recalled, the task execution subtask is attempted to be redistributed, and if the continuous failure reaches the configuration times, the operation and maintenance intervention is notified.
FIG. 2 is a schematic diagram of the overall architecture of a data compute engine provided by an embodiment of the present application. As shown in fig. 2, the data processing process based on the Actor model and the distributed lock provided by the embodiment of the present application may be as follows; and locking the single certificate and the single sub-Actor computing unit by using a distributed lock, receiving a batch processing request by a certificate merging Actor (main Actor) computing unit, inquiring threshold information, and calling a data saving Actor computing unit to save data after the sub-Actor computing unit is called to finish processing. In the processing process, the sub-Actor calculation unit and the save data Actor calculation unit input data into the database for inquiry after completion. The data computing engine design based on the Actor model and the distributed lock provided by the embodiment of the application can support fault tolerance, high scalability, high concurrency and automatic compensation.
In one embodiment of the present application, the data processing method shown in fig. 1 may be executed by the first Actor computing unit, and the acquiring credential grouping code of the task data to be processed in step 110 may include: a credential group code of the task data to be processed is received from the third Actor calculation unit. Before determining whether the data size of the task data to be processed is greater than the threshold in step 120, the first Actor calculation unit may obtain the threshold from the second Actor calculation unit. And the first Actor calculation unit is provided with a state memory, and the state memory is used for storing the state information of the first Actor calculation unit, and when the first Actor calculation unit fails, the state information of the first Actor calculation unit is restored to the last stored state according to the state information stored in the state memory.
It should be understood that, in the embodiment of the present application, the first Actor calculation unit may be a main Actor calculation unit. The first Actor calculation unit, the second Actor calculation unit, and the third Actor calculation unit may each correspond to a network side server. Of course, in some cases, the first Actor calculation unit, the second Actor calculation unit, and the third Actor calculation unit may be three calculation units in the same server.
Fig. 3 is a flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 3, a flowchart of a data processing method according to an embodiment of the present application may include the following steps:
step 310, acquiring a credential group code of accounting task data and a distributed lock corresponding to the credential group code;
step 320, judging whether the data volume of the accounting task data is greater than a threshold value under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
step 330, determining the batch number N of the accounting task data according to the data amount of the accounting task data when the data amount of the accounting task data is greater than a threshold;
step 340, classifying the accounting task data into N batches of data according to the cost identification of the accounting task data;
Step 350, acquiring a cost list for each batch of data in the N batches of data to obtain N cost lists;
step 360, saving the N expense lists and N batch identifiers corresponding to the N expense lists into a database; wherein each of the N lot identifications is configured to uniquely identify a lot of data;
and 370, executing processing operation on the N batches of data in parallel to obtain a processing result.
In the embodiment of the present application, in step 340, the classifying the accounting task data according to the cost identifier of the accounting task data may be performed by taking a margin, for example, dividing the cost identifier of the accounting task data by the lot number of the accounting task data.
In the embodiment of the application, the accounting task data is classified according to the cost identification of the accounting task data to obtain more average groups, so that the purposes of classifying the data, shortening the processing time and improving the efficiency are achieved. And saving the N expense lists and N batch identifiers corresponding to the N expense lists in a database so as to facilitate subsequent operation, if a certain subtask fails to be executed, attempting to read the N expense lists and the N batch identifiers corresponding to the N expense lists in the subtask again, and executing the subtask until the execution is successful, and simultaneously, recording the execution times and execution time of each subtask so as to facilitate subsequent statistics and analysis, thereby realizing automatic compensation of the failed subtactor calculation unit task.
Fig. 4 is a flowchart of a data processing method provided by an embodiment of the present application, and as shown in fig. 4, the data processing method provided by the embodiment of the present application may include the following steps:
step 410, acquiring a credential group code of task data to be processed and a distributed lock corresponding to the credential group code;
step 420, judging whether the data volume of the task data to be processed is greater than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
step 430, determining the number N of batches of the task data to be processed according to the data amount of the task data to be processed when the data amount of the task data to be processed is greater than a threshold value;
step 440, classifying the data to be processed into N batches of data according to a target classification rule;
step 450, for each batch of data, determining whether a subtask is in process;
step 460, adding a distributed lock to the subtask for each batch of data when the subtask is not in process, executing a processing operation on the subtask to obtain a processing result of the subtask, and releasing the distributed lock of the subtask;
step 470, obtaining N sub-processing results through N batches of data;
And 480, merging the N sub-processing results to obtain a final processing result, and unlocking the distributed lock corresponding to the credential group code.
In the embodiment of the application, the distributed lock of each batch of tasks is acquired for each batch of data, and each batch of data is processed separately, so that the processing amount of each data processing is reduced. And after the output result of the subtask is obtained, releasing the distributed locks of the subtask, after the N sub-processing results are combined to obtain the final processing result, unlocking the distributed locks corresponding to the credential group codes, and realizing high-performance processing of task data to be processed by locking and unlocking the distributed locks in each processing process.
Fig. 5 is a flowchart of a data processing method provided by an embodiment of the present application, and as shown in fig. 5, the data processing method provided by the embodiment of the present application may include the following steps:
step 510, the first Actor calculation unit receives, from the third Actor calculation unit, a credential group code of accounting task data and a distributed lock corresponding to the credential group code;
step 515, obtaining the threshold value from the second Actor calculation unit;
Step 520, judging whether the data size of the accounting task data is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
step 525, determining the batch number N of the accounting task data according to the data amount of the accounting task data when the data amount of the accounting task data is greater than a threshold value;
step 530, classifying the accounting task data into N batches of data according to the cost identification of the accounting task data;
step 535, obtaining a cost list for each batch of data in the N batches of data, to obtain N cost lists;
step 540, saving the N expense lists and N batch identifiers corresponding to the N expense lists in a database; wherein each of the N lot identifications is configured to uniquely identify a lot of data;
step 545, for each batch of data, determining whether a subtask is in process;
step 550, adding a distributed lock to the subtask for each batch of data when the subtask is not in process, executing a processing operation on the subtask to obtain a processing result of the subtask, and releasing the distributed lock of the subtask;
Step 555, performing processing operation on the N batches of data in parallel through N Actor calculation units to obtain a processing result; wherein, an Actor calculation unit corresponds to the processing of a batch of data;
step 560, obtaining N sub-processing results through N batches of data;
step 565, merging the N sub-processing results to obtain a final processing result, and unlocking the distributed lock corresponding to the credential group code.
It should be understood that each step in fig. 5 may be performed by the first Actor calculation unit. The first Actor calculation unit may be a main Actor calculation unit. In this case, the N number of Actor calculation units in step 555 may be all or most of the Actor calculation units, and each of the Actor calculation units may return the processing result to the main Actor calculation unit after performing the processing operation. Of course, in one embodiment, after each sub-Actor calculation unit performs the processing operation, the processing result may also be stored in the database, and then the main Actor calculation unit may obtain the processing result from the database.
In one embodiment of the present application, processing operations may be performed on the N batches of data in parallel by using N Actor calculation units, so as to obtain a processing result; wherein an Actor calculation unit corresponds to the processing of a batch of data. The N Actor calculation units may be on different servers or on the same server, which is not particularly limited in the embodiment of the present application.
For better understanding of the present application, the following will illustrate a data processing method provided in the embodiments of the present application, and it should be understood that the present application is not limited thereto.
Fig. 6 is a flowchart of a data processing method according to an embodiment of the present application. As shown in fig. 6, the procedure of the data merging processing method provided by the embodiment of the present application may be as follows:
the third Actor calculation unit loops and pushes accounting task data to a receiving interface (for example, a GlRestPort. BizComb () method) in the first Actor calculation unit by fishing out the accounting task data, wherein the accounting task data comprises information such as a credential group code (for example, a combCode, an accounting credential group number) and the like.
The GlRestPort.bizComb () receiving interface calls the management interface (i.e. combineManager) of the sub-Actor calculation units in the first Actor calculation unit to receive accounting task data, and transmits information such as combineManager packets to the combineManager.
In CombineManager [ Actor ], an attempt is made to acquire the combCode's distributed lock through the combCode value. If the acquisition is successful, the operation can be continued; if the acquisition fails, return is made.
After the acquisition is successful, the combineManager [ Actor ] checks the combCode grouping and other information, and acquires and stores the data volume of the accounting task data, the cost identification (such as cost id) of the accounting task data and other information through the information. And then acquiring threshold parameters of front-end configuration through a second Actor calculation unit, namely: a minimum number of batch/slices is required. When the default configuration of the threshold parameter is 10 ten thousand and the detail data of the accounting task data is greater than 10w, for example, generating the credential id [ postimid ] of the combode packet, wherein the credential id [ postimid ] of the combode packet is a unique identifier representing the processing time of the accounting task data of the combode packet, and calling a credential merging Actor calculation unit (i.e., a Glinterface batch generator [ Actor ]) as shown in fig. 2.
The Glinterfacial batch generator [ Actor ] obtains the front-end configured fragment base from the second Actor computing unit.
By dividing the data volume of the accounting task data by the partition basis, it is confirmed that this accounting task data processing will be divided into N batches. For example, when the data amount of the accounting task data is 25w and the slice base is 1w, it is calculated [ 250000/10000=25 ] that 25 batches will be sliced, that is, N is 25. And dividing the cost identification of the accounting task data by the batch number of the accounting task data to obtain a batch result of the accounting task data.
A cost List of cost ids for each batch/slice is obtained (i.e., list < FeeId >).
While each BATCH/packet generates a BATCH identification (i.e., BATCH id) that uniquely identifies a BATCH of data and saves the BATCH identification to a database, such as a t_gl_comb_sub_batch table. And storing the corresponding relation between List < Feeid > and the latch Id to a database T_GL_POSING_DETAIL table.
Each slice is circulated to correspond to a slice sequence number (i.e., a batch no) for processing the accounting task data at this time, and the batch no and the credential id [ postingId ] (a batch no and a credential id [ postingId ] together represent a batch of data in the accounting task data, and similarly, the batch id can be replaced by a unique identifier, which represents a batch of data, and the embodiment of the present application does not limit this specifically) is pushed to a conventional credential merging slice Actor computing unit (i.e., a glaterfacebatch subaactor) as shown in fig. 2.
In GlInterfaceBatchSubGenerator, according to the combined key value of combode and patchId (i.e. key [ combode+patchID ]), try to acquire the distributed lock of the sub-Actor computing unit, if acquisition is successful, continue operation. Skipping if the failed specification has been obtained and may be in process.
The credential GROUP CODE (combCode) queries t_gl_posting_detail table [ POSTING-DETAIL mapping table ] and t_gl_biz_detail [ accounting credential DETAILs table ] by a batch id, a fee service type (bussinesstype), and then obtains a grouping result List < GlBizInterface > of the batch data for the data GROUP by a GROUP by group_code.
The List < GlBizInterface > data is saved to the database t_gl_biz_interface_INNER_BUDGET table.
The process completion returns to the GlInterfaceBatchGenerator [ Actor ] and releases the distributed lock for that sub-Actor.
The Glinterfacial batch generator [ Actor ] waits for all Glinterfacial batch subgenerator [ subaactor ] processes to complete, and queries the T_GL_BIZ_INFACE_INNER_BUDGET [ credential INTERFACE Table-temporary ] data in the summary database.
The data is saved to the T_GL_BIZ_INTERFACE table [ credential INTERFACE table ].
And processing the data, and releasing the distributed lock corresponding to the credential packet code.
The following is specific information of a partial table structure in the database, and it should be emphasized that the following table structure is merely exemplary, and not limiting. The consolidated data split batch status table is shown in table 1-1:
TABLE 1-1
The non-commission accounting document details are shown in tables 1-2:
TABLE 1-2
/>
/>
/>
/>
/>
/>
The post-DETAIL mapping table is shown in tables 1-3:
tables 1 to 3
The credential interface table is shown in tables 1-4:
tables 1 to 4
/>
/>
/>
/>
/>
/>
From the above, the data processing method provided by the embodiment of the application may have the following characteristics:
1. infrastructure architecture
The data computing engine adopts an Actor model as an infrastructure, and each Actor computing unit is an independent computing unit and can execute computing tasks in parallel. Meanwhile, the system also adopts a distributed computing technology, and can distribute computing tasks to different computing nodes for parallel computing, so that the computing efficiency and the scalability of the system are improved.
2. Fault tolerant mechanism
In order to ensure the fault tolerance of the system, each Actor computing unit is provided with a state memory for storing the state information of the current Actor computing unit. When one Actor computing unit fails, the system automatically restores the state information of the Actor computing unit to the last stored state, so that the stability and the reliability of the system are ensured.
3. High scalability
The data computing engine can dynamically expand computing resources according to actual computing demands, and the computing capacity of the system can be improved by increasing the number of Actor computing units. Meanwhile, the system also adopts a distributed computing technology, and can distribute computing tasks to different computing nodes for parallel computing, so that the computing efficiency and the scalability of the system are improved.
4. High concurrency
To support the high concurrent computing demands, the data computation engine employs an asynchronous messaging mechanism, where each Actor may asynchronously receive and process computing tasks, thereby increasing the concurrent processing capacity of the system. Meanwhile, the system also adopts a re-entrant calculation model, and can process a plurality of calculation tasks at the same time, thereby further improving the concurrency performance of the system.
5. Distributed lock
In order to control concurrency, the data computing engine adopts a distributed lock, so that only one Actor computing unit can be ensured to process at the same time. The method comprises the steps of locking a main Actor computing unit and locking sub Actor computing units respectively.
Pushing to the Actor computing unit a combCode value by batch processing, by which it will first try to acquire the combCode's distributed lock. If the acquisition is successful, the operation merging calculation can be continued; if the acquisition fails, return is made. It is necessary to wait for the next batch call to try again to get.
For each sub-Actor calculation unit under each combode data, a key [ combode+batch ID ] is pushed to try to acquire the distributed lock of the key [ combode+batch ID ], if acquisition is successful, the calculation of the sub-Actor calculation unit is operated, if acquisition fails, calculation is indicated, and if calculation is not performed, the calculation is skipped. And unlocking the distributed lock of the sub-Actor computing unit after the sub-Actor computing unit is calculated. Returning to the main Actor computing unit, and unlocking the distributed lock of the main Actor computing unit after the calculation of the main Actor computing unit is completed.
6. Automatic compensation
To support automatic compensation for failed subtasks, the data calculation engine will save the results of execution of each subtask in a database. If a subtask fails to execute, the system will attempt to re-execute the subtask until execution is successful. Meanwhile, the system also records the execution times and execution time of each subtask so as to facilitate subsequent statistics and analysis.
7. Reading configuration
The front end supports the configuration of the slicing rules, the threshold values and the like, the slicing conditions are configured through the page, and the back end Actor acquires the front end configuration through calling to perform slicing according to the requirement.
FIG. 7 is a frame diagram of a data processing system according to an embodiment of the present application, where, as shown in FIG. 7, the data processing system according to the embodiment of the present application includes:
An obtaining module 710, configured to obtain a credential group code of task data to be processed and a distributed lock corresponding to the credential group code;
a judging module 720, configured to judge whether the data size of the task data to be processed is greater than a threshold value when the distributed lock corresponding to the credential group code is successfully acquired;
a processing module 730, configured to determine, according to the data amount of the task data to be processed, the number N of batches of the task data to be processed, if the data amount of the data to be processed is greater than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result.
In the embodiment of the application, the data volume of the task data to be processed is judged with the threshold value, and the data to be processed is classified into N batches of data according to the target classification rule under the condition that the data volume of the data to be processed is larger than the threshold value, and the N batches of data are processed in parallel to obtain the operation result. By classifying the data to be processed and processing N batches of data in parallel, the problem that the efficiency is low easily exists in the related technology that all data are processed directly through an Oracle database is solved.
Optionally, in an embodiment of the present application, the method is performed by a first Actor calculation unit, and the obtaining module 710 is specifically configured to: and before judging whether the data volume of the task data to be processed is larger than a threshold value, acquiring the threshold value from a second Actor calculation unit.
Optionally, in an embodiment of the present application, the method is performed by a first Actor calculation unit, and the obtaining module 710 is specifically configured to: a credential group code of the task data to be processed is received from the third Actor calculation unit.
Optionally, in an embodiment of the present application, the task data to be processed is accounting task data in an insurance field; the processing module 730 is specifically configured to:
and classifying the accounting task data into N batches of data according to the cost identification of the accounting task data.
Optionally, in an embodiment of the present application, the task data to be processed is accounting task data in an insurance field; the processing module 730 is specifically configured to:
acquiring a cost list aiming at each batch of data in the N batches of data to obtain N cost lists;
storing the N expense lists and N batch identifiers corresponding to the N expense lists into a database;
Wherein each of the N batch identifications is for uniquely identifying a batch of data.
Optionally, in one embodiment of the present application, the processing module 730 is configured to:
for each batch of data, the following operations are performed:
determining whether a subtask is in process;
and under the condition that the subtasks are not in processing, adding a distributed lock to the subtasks, executing processing operation on the subtasks to obtain the processing result of the subtasks, and releasing the distributed lock of the subtasks.
Optionally, in one embodiment of the present application, the processing module 730 is configured to:
obtaining N sub-processing results through N batches of data;
and merging the N sub-processing results to obtain a final processing result, and unlocking the distributed lock corresponding to the credential group code.
Optionally, in one embodiment of the present application, the method is performed by a first Actor calculation unit provided with a state memory for: storing state information of the first Actor calculation unit:
and when the first Actor computing unit fails, recovering the state information of the first Actor computing unit to the last saved state according to the state information stored in the state memory.
Optionally, in one embodiment of the present application, the processing module 730 is configured to:
processing operation is carried out on the N batches of data in parallel through N Actor computing units, so that a processing result is obtained;
wherein an Actor calculation unit corresponds to the processing of a batch of data.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims (10)

1. A method of data processing, comprising:
acquiring a credential group code of task data to be processed and a distributed lock corresponding to the credential group code;
judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value;
classifying the data to be processed into N batches of data according to a target classification rule;
and executing processing operation on the N batches of data in parallel to obtain a processing result.
2. The method according to claim 1, wherein the method is performed by a first Actor calculation unit, and wherein prior to the determining whether the amount of data of the task data to be processed is greater than a threshold, the method further comprises:
the threshold is obtained from a second Actor calculation unit.
3. The method of claim 2, wherein the acquiring credential group code for the task data to be processed comprises:
a credential group code of the task data to be processed is received from the third Actor calculation unit.
4. The method of claim 1, wherein the task data to be processed is accounting task data of an insurance domain; the data to be processed is classified into N batches of data according to the target classification rule, and the method comprises the following steps:
and classifying the accounting task data into N batches of data according to the cost identification of the accounting task data.
5. The method of claim 4, wherein after the classifying the accounting task data into N batches of data, the method further comprises:
acquiring a cost list aiming at each batch of data in the N batches of data to obtain N cost lists;
storing the N expense lists and N batch identifiers corresponding to the N expense lists into a database;
wherein each of the N batch identifications is for uniquely identifying a batch of data.
6. The method of claim 1, wherein performing processing operations on the N batches of data in parallel results in a processing result, comprising:
for each batch of data, the following operations are performed:
determining whether a subtask is in process;
and under the condition that the subtasks are not in processing, adding a distributed lock to the subtasks, executing processing operation on the subtasks to obtain the processing result of the subtasks, and releasing the distributed lock of the subtasks.
7. The method of claim 6, wherein the performing processing operations on the N batches of data in parallel results in a processing result, further comprising:
obtaining N sub-processing results through N batches of data;
and merging the N sub-processing results to obtain a final processing result, and unlocking the distributed lock corresponding to the credential group code.
8. The method according to claim 2, wherein the first Actor calculation unit is provided with a state memory for storing state information of the first Actor calculation unit; the method further comprises the steps of:
and when the first Actor computing unit fails, recovering the state information of the first Actor computing unit to the last saved state according to the state information stored in the state memory.
9. The method of claim 1, wherein performing processing operations on the N batches of data in parallel results in a processing result, comprising:
processing operation is carried out on the N batches of data in parallel through N Actor computing units, so that a processing result is obtained;
wherein an Actor calculation unit corresponds to the processing of a batch of data.
10. A data processing system, comprising:
the acquisition module is used for acquiring the credential group codes of the task data to be processed and the distributed locks corresponding to the credential group codes;
the judging module is used for judging whether the data volume of the task data to be processed is larger than a threshold value or not under the condition that the distributed lock corresponding to the credential group code is successfully acquired;
the processing module is used for determining the batch number N of the task data to be processed according to the data volume of the task data to be processed under the condition that the data volume of the data to be processed is larger than a threshold value; classifying the data to be processed into N batches of data according to a target classification rule; and executing processing operation on the N batches of data in parallel to obtain a processing result.
CN202310897616.2A 2023-07-20 2023-07-20 Data processing method and system Pending CN116955378A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310897616.2A CN116955378A (en) 2023-07-20 2023-07-20 Data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310897616.2A CN116955378A (en) 2023-07-20 2023-07-20 Data processing method and system

Publications (1)

Publication Number Publication Date
CN116955378A true CN116955378A (en) 2023-10-27

Family

ID=88452404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310897616.2A Pending CN116955378A (en) 2023-07-20 2023-07-20 Data processing method and system

Country Status (1)

Country Link
CN (1) CN116955378A (en)

Similar Documents

Publication Publication Date Title
US11586692B2 (en) Streaming data processing
US11615087B2 (en) Search time estimate in a data intake and query system
US11921672B2 (en) Query execution at a remote heterogeneous data store of a data fabric service
US20220327149A1 (en) Dynamic partition allocation for query execution
US11615104B2 (en) Subquery generation based on a data ingest estimate of an external data system
US11604795B2 (en) Distributing partial results from an external data system between worker nodes
US10726009B2 (en) Query processing using query-resource usage and node utilization data
US11416528B2 (en) Query acceleration data store
US11163758B2 (en) External dataset capability compensation
US20180089258A1 (en) Resource allocation for multiple datasets
US20130117226A1 (en) Method and A System for Synchronizing Data
CN111447102B (en) SDN network device access method and device, computer device and storage medium
CN109901918B (en) Method and device for processing overtime task
CN111400011A (en) Real-time task scheduling method, system, equipment and readable storage medium
CN113419823A (en) Alliance chain system suitable for high-concurrency affairs and design method thereof
CN104618304A (en) Data processing method and data processing system
CN111752970B (en) Distributed query service response method based on cache and storage medium
CN113553153A (en) Service data processing method and device and micro-service architecture system
CN113645260A (en) Service retry method, device, storage medium and electronic equipment
CN116955378A (en) Data processing method and system
CN116151631A (en) Service decision processing system, service decision processing method and device
US11915044B2 (en) Distributed task assignment in a cluster computing system
WO2022261249A1 (en) Distributed task assignment, distributed alerts and supression management, and artifact life tracking storage in a cluster computing system
CN114416717A (en) Data processing method and architecture
CN111061576B (en) Method and system for creating entity object

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination