CN116991562A

CN116991562A - Data processing method and device, electronic equipment and storage medium

Info

Publication number: CN116991562A
Application number: CN202311268839.9A
Authority: CN
Inventors: 罗盛; 严思齐; 张辰; 陈萌; 尹棋; 陈璐
Original assignee: Bank Of Ningbo Co ltd
Current assignee: Bank Of Ningbo Co ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-11-03
Anticipated expiration: 2043-09-28
Also published as: CN116991562B

Abstract

The disclosure provides a data processing method, a device, an electronic device and a storage medium, which are applied to the technical field of data processing, wherein the method comprises the following steps: acquiring variable data information of different types of data sources; creating at least two tasks corresponding to the change data information in at least two nodes in a cluster through a first interface, and storing the tasks in the nodes; the nodes in the cluster backup tasks created by at least one node except the nodes in the cluster, so that each node in the cluster comprises a task set; the task set comprises tasks created by the node and tasks backed up by the node; each node in the cluster processes the tasks built by itself in the task set in parallel; and under the condition that any node in the cluster meets the first condition, processing tasks created by other nodes in the task set. Therefore, the real-time batch processing of the multi-source data can be realized, and the high requirements of big data processing and real-time performance under the current service scene are met.

Description

Data processing method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of data processing, and in particular relates to a data processing method, a data processing device, electronic equipment and a storage medium.

Background

In the process of realizing service demands by using traditional data sources, the method discovers that the related service demands can be realized only by managing a large number of data tables due to the large number of data sources and huge data volume, and has poor performance; in addition, the traditional data source relies on T+1 running batch processing, and the timeliness of the data can not meet the timeliness requirements of real time and quasi real time in the current business scene.

Disclosure of Invention

The present disclosure provides a data processing method, apparatus, electronic device, and storage medium, so as to at least solve the above technical problems in the prior art.

According to a first aspect of the present disclosure, there is provided a data processing method comprising:

acquiring variable data information of different types of data sources; the data sources at least comprise a real-time data source and a non-real-time data source;

creating at least two tasks corresponding to the change data information in at least two nodes in a cluster through a first interface, and storing the tasks in the nodes; the first interfaces are interfaces corresponding to the distributed memory databases;

the nodes in the cluster backup tasks created by at least one node except the nodes in the cluster, so that each node in the cluster comprises a task set; the task set comprises tasks created by the node and tasks backed up by the node;

Each node in the cluster processes the tasks built by itself in the task set in parallel; under the condition that any node in the cluster meets a first condition, processing tasks created by other nodes in the task set;

the first condition includes: the node is in an idle state.

According to a second aspect of the present disclosure, there is provided a data processing apparatus comprising:

the multi-data source extraction module is used for acquiring the variable data information of different types of data sources; the data sources at least comprise a real-time data source and a non-real-time data source; for creating at least two tasks corresponding to the change data information at least two nodes in the cluster through a first interface, storing the task in a node; the first interfaces are interfaces corresponding to the distributed memory databases; the method comprises the steps that a node in a cluster is used for backing up tasks created by at least one node except the node in the cluster, and each node in the cluster comprises a task set; the task set comprises tasks created by the node and tasks backed up by the node;

the parallel computing module is used for enabling each node in the cluster to process the tasks created by the task set per se in parallel; under the condition that any node in the cluster meets a first condition, enabling the node to process tasks created by other nodes in the task set;

The first condition includes: the node is in an idle state.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods described in the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of the present disclosure.

According to the data processing method, variable data information of different types of data sources is obtained; the data sources at least comprise a real-time data source and a non-real-time data source; creating at least two tasks corresponding to the change data information in at least two nodes in a cluster through a first interface, and storing the tasks in the nodes; the first interfaces are interfaces corresponding to the distributed memory databases; the nodes in the cluster backup tasks created by at least one node except the nodes in the cluster, so that each node in the cluster comprises a task set; the task set comprises tasks created by the node and tasks backed up by the node; each node in the cluster processes the tasks built by itself in the task set in parallel; under the condition that any node in the cluster meets a first condition, processing tasks created by other nodes in the task set; as such, data is extracted from the real-time data source and the non-real-time data source; the extracted data may unify the summarization, collation, creation of tasks, and distribution of tasks to different nodes of the cluster. Each node locks the task after receiving the task, and then invokes the data processing scheme designated by the task to process the data. After the task processing is completed, the data will be output to the target data source; the real-time batch processing of the multi-source data is realized, and the high requirements of big data processing and real-time performance under the current service scene are met.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The above, as well as additional purposes, features, and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description when read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 illustrates an alternative flow diagram of a data processing method provided by an embodiment of the present disclosure;

FIG. 2 shows another alternative flow diagram of a data processing method provided by an embodiment of the present disclosure;

FIG. 3 is a data diagram illustrating a data processing method provided by an embodiment of the present disclosure;

FIG. 4 shows an alternative architecture diagram of a data processing apparatus provided by an embodiment of the present disclosure;

fig. 5 shows a schematic diagram of a composition structure of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, features and advantages of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure will be clearly described in conjunction with the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used in the present disclosure is for the purpose of describing embodiments of the present disclosure only and is not intended to be limiting of the present disclosure.

It should be understood that, in various embodiments of the present disclosure, the size of the sequence number of each implementation process does not mean that the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

In the related art, the following problems are found in the process of implementing business requirements using conventional data sources (e.g., oracle):

1. the data sources are more, the data volume is large, and the performance is poor. In some business scenarios, the data sources of the query are numerous, the data volume is large, and often, very many related data tables are needed to complete the operation, and the performance is poor when the traditional relational database is used.

2. The ageing is different, the batch processing is performed, and the real-time performance is poor. Traditional Oracle data relies on T+1 running batch processing, the timeliness of the data is poor, and the timeliness requirements of the real-time and quasi-real-time data in the service scene are not met.

In view of the drawbacks of the related art, the present disclosure provides a data processing method to solve some or all of the above technical problems.

Fig. 1 shows an alternative flowchart of a data processing method according to an embodiment of the present disclosure, and will be described according to the steps.

Step S101, obtaining variable data information of different types of data sources.

In some embodiments, the data sources include at least a real-time data source and a non-real-time data source.

In some embodiments, a carrier implementing the data processing method of the embodiments of the present disclosure obtains variable data information in different types of data sources, and optionally, the types of data sources are different, and the obtaining manners are different.

Wherein the fluctuation data information includes fluctuation information of at least two pieces of data.

Step S102, at least two tasks corresponding to the change data information are created in at least two nodes in the cluster through a first interface, and the tasks are stored in the nodes.

In some embodiments, the first interface is an interface corresponding to a distributed memory database, the carrier adopts a distributed manner, creates tasks corresponding to the at least two pieces of change data information in the at least two nodes respectively, and stores the tasks in the nodes. Under the condition that the quantity of the variable data information is greater than or equal to the quantity of the nodes in the cluster, each node creates at least one corresponding task.

When each node creates a task, the data information based on the task is different, and the task stored in each node is different. The nodes include application nodes.

Step S103, the nodes in the cluster backup tasks created by at least one node except the nodes in the cluster, so that each node in the cluster comprises a task set.

In some embodiments, the nodes in the cluster backup tasks created by other nodes in the cluster, so that on one hand, data loss caused by a fault of a certain node after the fault of the certain node is off-line is avoided, and on the other hand, the problem of data damage of the certain node can be avoided.

In some embodiments, each node in the cluster may backup tasks of all other nodes, and may also backup tasks of some other nodes.

In some embodiments, after each node backs up tasks created by at least one node other than itself, a task set is formed in the node, the task set including the tasks created by the node itself and the backed up tasks.

Step S104, each node in the cluster processes the tasks created by the task set in parallel.

In some embodiments, each node in the cluster processes tasks in the task set in parallel, preferentially processing tasks created by itself; and when the node is in an idle state, processing tasks created by other nodes in the task set.

Thus, by the data processing method provided by the present disclosure, data is extracted from real-time data sources and non-real-time data sources; the extracted data may unify the summarization, collation, creation of tasks, and distribution of tasks to different nodes of the cluster. Each node locks the task after receiving the task, and then invokes the data processing scheme designated by the task to process the data. After the task processing is completed, the data will be output to the target data source; the real-time batch processing of the multi-source data is realized, and the high requirements of big data processing and real-time performance under the current service scene are met.

Fig. 2 shows another alternative flow diagram of the data processing method provided by the embodiment of the present disclosure, and fig. 3 shows a data diagram of the data processing method provided by the embodiment of the present disclosure, which will be described according to the respective steps.

As shown in fig. 3, the Kafka system may obtain the change data information in the application database based on OGG (OGG is a shorthand of Oracle Golden Gate, which supports copying data between the Oracle database and other supported heterogeneous databases) or change data acquisition mode (Change Data Capture, CDC), obtain the change data information in the business application based on the producer instruction, and obtain the change data information in the real-time platform based on the application editing interface (Application Programming Interface, API). Further, the multi-data source extraction module acquires the change data information of the real-time data source from the Kafka system through Kafka-Ignite-stream, and acquires the change data information of the non-real-time data source through Jdbc-Template or hadoop-source; after task distribution and parallel batch processing are performed to obtain processing data, the processing data is sent to different types of target data sources, for example, a hypertext transfer protocol (Hypertext Transfer Protocol, HTTP) based data source is sent to an Elastic Search data source, a Jdbc-Template based data source is sent to an Oracle data source, and a hadoop-source based data source is sent to an Hbase data source.

The specific implementation process is as follows:

step S201, obtaining variable data information of different types of data sources.

In some embodiments, different data sources correspond to different acquisition modes.

When the method is implemented, in response to the fact that a data source is an Oracle database, the carrier newly builds an OGG user in the Oracle database, obtains change data information (namely information before and after change corresponding to changed data) based on a log (such as an online redo log or an archive log) corresponding to the Oracle database, and generates a queue (Trail) file; the data pushing component based on the data source end reads the queue file and pushes the queue file to the target end OGG user; the target end receives the queue file through the data receiving component, reads the queue file based on the data copying component and pushes the data change to the Kafka system; and the data source extraction module acquires the card information (namely the consumption card information) from the card system to acquire the change data information corresponding to the data source.

In some alternative embodiments, the above method may be used to obtain the variation data information of the Hbase data source and the Impala data source.

Or, in the implementation, in response to the data source being an application program, after the application program performs related business operations (related to data update, such as password update, identity information update or user account update, etc.), the mobile terminal or the computer terminal generates change data information, and the business operations of the front end are submitted to the channel back end in the form of a form, and then the channel back end performs preliminary processing, including but not limited to authority verification, session verification, user identity information addition, etc. After the logic is completed, the channel back end can call the related business center (client center, marketing center, management center and basic service) interface. And after the request arrives at the middle stage, the authority and the data are checked preferentially, then the business logic is completed, and finally the data are modified. The data modification process generates variable data information, the variable data information is assembled into a piece of Kafka information, the Kafka information is pushed to a Kafka system according to the identity of a Kafka producer, and the data source extraction module acquires the Kafka information from the Kafka system to acquire variable data information corresponding to the data source.

Or, in implementation, the data source is a real-time platform, such as a data system, the whole flow is similar to the business application, only the processing of data is changed from a series of back-end applications to a plurality of data processing nodes, namely, the business operation of the front end is submitted to a plurality of data processing nodes, a series of data processing logic is triggered by specified data change through newly establishing a data processing flow on the real-time platform, after the data stream processing is completed, the assembled change data information is sent in a form of calling a Kafka API, and the data source extraction module acquires the assembled change data information from the Kafka system to acquire the change data information corresponding to the data source.

Step S202, creating tasks in each node of the cluster.

In some embodiments, after the carrier obtains the variable data information of the different types of data sources, at least two tasks corresponding to the variable data information are created in at least two nodes in the cluster through the first interface by using a distributed internal memory database (Apache igite), specifically, the data processing strategy is taken as a parameter, and the tasks are created in different nodes in the cluster.

The first interface may be an IgniteClosure interface, and the node performs the task based on a data processing policy, that is, performs data processing on the variable data information of the node based on the data processing policy.

In some embodiments, the multi-source data processing application corresponding to the carrier is deployed by a cluster, and a plurality of nodes exist; the change data information comprises at least two pieces of data information, different nodes create tasks corresponding to different data information and store the tasks, and distributed storage is realized.

Step S203, the nodes in the cluster backup tasks.

In some embodiments, a node in the cluster backs up tasks created by at least one node in the cluster other than itself, such that each node in the cluster includes a task set; the task set comprises tasks created by the node and tasks backed up by the node.

In implementation, by configuring the number N of backup copies, the cluster will create n+1 copies, one of which is referred to as a master node, where the node is located, and the other of which is referred to as a backup node. Wherein N is a positive integer, and the value is smaller than the number of all nodes in the cluster.

In step S204, each node in the cluster processes the tasks created by itself in the task set in parallel.

In some embodiments, after each node in the cluster performs task backup, each node obtains a task set, and each node in the cluster preferentially processes the task created by itself.

In some embodiments, to ensure timeliness of data processing, each node adopts a mode of multi-thread concurrent processing, and the situation that multiple threads (nodes) repeatedly execute the same task must be avoided. In this regard, the embodiments of the present disclosure ensure that each task can only be executed once by locking the task, when one data processing thread (node) attempts to start processing a new task, it will first attempt to lock the task, and if the locking fails, it proves that the task is being executed by other threads (or being processed by other nodes), and further attempts to lock the next task, and if the locking succeeds, it acquires the task for processing.

In specific implementation, each node in the cluster performs the following operations: executing locking operation on a task to be executed in a task thread; responding to successful locking operation, and executing the locked task based on a data processing strategy corresponding to the locked task; and responding to the failure of the locking operation, representing that the task to be executed is executed by other nodes, executing the locking operation on other tasks to be executed until the locking operation is successful, and executing the locked task based on the data processing strategy corresponding to the locked task.

In some embodiments, to avoid the occurrence of idle nodes results in wasted resources. Embodiments of the present disclosure provide for cross-node acquisition and execution of tasks, where idle nodes may proactively attempt to acquire additional tasks from other busy nodes for execution.

When the method is implemented, in response to the fact that any node is idle, namely all self-created tasks are executed, tasks created by other nodes in a task set are processed, before processing, locking operation is executed on the tasks to be executed in task threads, and if the locking is successful, the tasks are executed; if the locking fails, the task is executed by other nodes, and other unexecuted tasks in the task set are locked.

Step S205, task reassignment.

In some embodiments, to ensure high scalability and high availability of the cluster, multiple nodes are guaranteed to store each piece of data due to task distributed storage and multi-node backup. When a node leaves the cluster, in order to ensure that the data originally belonging to the node is not lost, the cluster sets one backup segment as a main segment, and starts the rebalancing of the data, so that the data in the remaining nodes is still uniformly distributed. Similarly, when a new node joins the cluster, the cluster will rebalance the data to the new node to save a portion of the data in order to satisfy the balanced distribution of the data across all the nodes.

When the method is implemented, in response to deletion of a first node in the cluster, confirming a task corresponding to the first node (namely, a task created by the first node), and distributing the task corresponding to the first node to other nodes in the cluster; or in response to the newly added second node in the cluster, uniformly distributing tasks corresponding to other nodes in the cluster (i.e. tasks created by other nodes) to the second node.

In some embodiments, rebalancing may be performed by both Synchronous (SYNC) and Asynchronous (ASYNC).

In some alternative embodiments, in order to adapt to servers with different configurations, when an application program corresponding to a service is deployed to a server, the configuration of the server, such as the CPU core number M, is read. The thread initialization method of the application program can create a thread pool with the core thread number of M-1 according to the obtained configuration result. The characteristics ensure that the application can be rapidly deployed on servers with different configurations and can be directly and rapidly expanded. The application program is an application program corresponding to the task.

In some alternative embodiments, the task creation supports the data processing strategy as a parameter to be input, so that the access cost of the subsequent new service scene and the learning cost of the developer can be simplified. The access of a new scene can multiplex the whole architecture of the application without additional construction, and meanwhile, for a developer, the developer does not need to pay attention to other realization and only needs to pay attention to the data processing step of the own service scene.

In some alternative embodiments, based on task reassignment and adaptive configuration characteristics, it is ensured that the cluster can quickly enhance task processing capacity by laterally expanding servers when encountering performance bottlenecks, and share cluster pressure by rebalancing policies.

Step S206, processing data corresponding to processing tasks of all nodes in the cluster are obtained, and the processing data are sent to different types of target data sources based on service requirements.

In some embodiments, the target data source may include multiple types, corresponding to acquiring data from multiple types of data sources, and the present disclosure may output processed data processed by each node in parallel to different types of target data sources according to the service scenario needs. Because the ways of writing data to different types of target data sources are different, an adaptive output way is required.

When the method is implemented, the carrier responds to the fact that the target data source is a first type data source, packages the processing data, sends the processing data to a Kaff card system, and loads a first application based on a first index; the first application obtains the process data from the card system to modify data in a target data source based on a first request and the process data.

Specifically, if the target data source is a ElasticSearch (ES) database, pushing the Kafka message by encapsulating the Kafka message with changed data, then loading an application program (namely, a first application) by an ES index (first index) for consumption, performing an http request (first request), and modifying the data in the ES.

The elastiscearch is a distributed, high-expansion, high-real-time search and data analysis engine. The method can conveniently enable a large amount of data to have the capabilities of searching, analyzing and exploring.

Specifically, if the target data source is an Hbase database, unlike the elastic search database, the Hbase database is used as a distributed big data engine of a distributed file system implemented based on Hadoop, and cannot update data by directly acquiring Kafka messages. A tool capable of operating the file system is required to process the data therein. In the embodiment of the disclosure, a Hadoop distributed file system (Hadoop Distributed File System, HDFS) component provided by Hadoop acquires the processing data by monitoring changes (tailDir Source) of a file of a specified type, and completes data update through the component.

In the implementation, the carrier responds to the target data source as a second type data source, and then connects the cluster and the target data source based on the first driver, and acquires the processing data from the cluster based on the structured query language so as to modify the data in the target data source based on the processing data.

Specifically, if the target data source is an Oracle database, the cluster is connected to the target data source through Jdbc (first driver), and the data in the data source is modified through a structured query language (Structured Query Language, SQL) form.

Thus, by the data processing method provided by the embodiment of the present disclosure, data is extracted from the real-time data source and the non-real-time data source; and selecting proper data extraction modes according to different data sources, and uniformly summarizing the extracted data to a data source extraction module of the multi-source big data application. The multi-source big data application sorts the extracted data, generates tasks and distributes the tasks. Each data processing node locks the task after receiving the task, and then invokes the data processing scheme designated by the task to process the data. After the task processing is completed, the data will be output to the target data source. Thus, the data processing performance and timeliness under the condition of a large amount of data can be improved.

Aiming at different service scenes of multiple data sources, the data processing method provided by the embodiment of the disclosure is adopted to process the data, so that the improvement is as follows:

1) Business scenario of business machine distribution center, a large number of data tables in the same library are associated, query performance is poor, query performance can be improved based on the data processing method provided by the embodiment of the disclosure, timeliness is changed into real time, and compared with the related technology, query time is less than 1 second, and query performance is improved by 500%.

2) The client visitor records a scene, is associated with a large number of data tables in a database, has poor query performance, and the data processing method provided by the embodiment of the disclosure can improve the query performance to enable timeliness to be real-time, and compared with the related art, the time for querying is less than 1 second, and the query performance is improved by 400%.

3) The business machine distributes batch import scenes, multi-system cross-database data processing, and is related to a large number of data tables in a database, query performance is poor, and the data processing method provided by the embodiment of the disclosure can improve the query performance, so that timeliness is changed from T+1 to real-time (less than 30 seconds), and compared with the related technology, the time consumption of query is less than 500 milliseconds, and the query performance is improved by 300%.

Aiming at different business scenes with large data volume, the data processing method provided by the embodiment of the disclosure is adopted to process the data, and the obtained promotion is as follows:

1) The method for processing the data based on the embodiment of the disclosure can improve query performance and timeliness, change timeliness from T+1 to real time, and save less than 2 seconds compared with the related art, and improve the query performance by 400%.

2) The deposit monitoring scene is implemented, the ten-million data are grouped and summarized for inquiry after being multiplied in association, the timeliness and the inquiry performance are poor, the inquiry performance and the timeliness can be improved by the data processing method provided by the embodiment of the disclosure, the timeliness is changed from T+1 to real time, and compared with the related technology, the inquiry time is less than 1 second, and the inquiry performance is improved by 400%.

Fig. 4 shows an alternative structural schematic diagram of a data processing apparatus provided in an embodiment of the present disclosure, and will be described according to respective steps.

In some embodiments, the data processing apparatus includes a multiple data source extraction module 301 and a parallel computing (igite computer) module 302.

The multiple data source extraction module 301 is configured to obtain variable data information of different types of data sources; the data sources at least comprise a real-time data source and a non-real-time data source; for creating at least two tasks corresponding to the change data information at least two nodes in the cluster through a first interface, storing the task in a node; the first interfaces are interfaces corresponding to the distributed memory databases; the method comprises the steps that a node in a cluster is used for backing up tasks created by at least one node except the node in the cluster, and each node in the cluster comprises a task set; the task set comprises tasks created by the node and tasks backed up by the node;

The parallel computing module 302 is configured to enable each node in the cluster to process the tasks created by the task set in parallel; under the condition that any node in the cluster meets a first condition, enabling the node to process tasks created by other nodes in the task set;

the first condition includes: the node is in an idle state.

The multiple data source extraction module 301 is specifically configured to obtain variable data information based on a log corresponding to any data source, and generate a queue file; the data pushing component based on the data source end reads the queue file and pushes the queue file to the target end; the target end receives the queue file through the data receiving component, reads the queue file based on the data copying component and pushes the change data information to the card system; the multiple data source extraction module 301 consumes the kaff card message to obtain the change data information corresponding to the data source.

In the implementation, an OGG user is newly built in a source database, data change information is obtained based on an online redox log or an archive log of the source database, and a Trail file is generated. The data pushing component (DataPump) reads the Trail file and pushes the Trail file to the target end OGG; the target end receives the Trail file pushed by the source end through a data receiving component (Collector); the target end data copying component (Replicate) reads the Trail file and pushes the change data information to the Kafka; the multiple data source extraction module 301 consumes the Kafka message to obtain the change data information.

The multiple data source extraction module 301 is also used for task generation.

In some embodiments, after the multiple data source extraction module 301 obtains the change data information, at least two tasks corresponding to the change data information are created in at least two nodes in the cluster through the first interface by using the distributed memory database, and the tasks are stored in the nodes.

In practice, after the multiple data sources extraction module 301 obtains the data, the data is passed to the task generation module. Using Apache Ignit, by implementing IgniteClosure interface, data processing method is used as parameter, and data processing task is created in different cluster nodes.

The multiple data source extraction module 301 is also used for task distributed storage.

Because the multi-source data processing application adopts cluster deployment and a plurality of nodes exist, tasks can be created on different nodes, and a distributed storage mode is adopted as a whole.

The multiple data source extraction module 301 is also used for task multiple node backup.

By configuring the number of backup copies N, the cluster will create n+1 copies, one of which is referred to as a primary shard, where the node is referred to as a primary node, and the remaining shards are backup shards, where the node is referred to as a backup node.

In some embodiments, the data processing apparatus further comprises a task distribution (Ignite Cache) module 303.

The task distribution module 303 is configured to, in response to deletion of a first node in the cluster, confirm a task corresponding to the first node, and distribute the task corresponding to the first node to other nodes in the cluster; or in response to the newly added second node in the cluster, uniformly distributing tasks corresponding to other nodes in the cluster to the second node.

Specifically, the task distribution module 303 is further configured to pull tasks.

After the task creation and the sharded backup are completed, each node obtains its own set of tasks to be processed. The task processing node can preferentially pull and execute the task corresponding to the self node, and after all the tasks are executed, the task processing node aims to avoid resource waste caused by idle nodes. The scheme realizes cross-node acquisition and execution of tasks, and idle nodes can actively attempt to acquire additional tasks from other busy nodes for execution through the task distribution module 303.

The task distribution module 303 is further configured to lock tasks.

In order to ensure timeliness of data processing, each node adopts a mode of multi-thread concurrent processing, and the situation that the multi-thread repeatedly executes the same task must be avoided. In this regard, the task is locked to ensure that each task can only be executed once, when one data processing thread attempts to start a new task, the task distribution module 303 will first attempt to lock the data task, if the locking fails, it will prove that the task is being executed by other threads, and the task distribution module 303 will attempt to lock the next task, and if the locking succeeds, it will acquire the task for processing.

The task distribution module 303 is further configured to reassign tasks.

In order to ensure high scalability and high availability of the cluster, each piece of data is ensured to be stored by a plurality of nodes due to task distributed storage and multi-node backup. When a node leaves the cluster, in order to ensure that the data originally belonging to the node is not lost, the task distribution module 303 sets one of the backup slices as a main slice, and starts the rebalancing of the data, so as to ensure that the data is still uniformly distributed in the remaining nodes. Similarly, when a new node joins the cluster, the task distribution module 303 will rebalance the data to a new node to save a portion of the data in order to satisfy the balanced distribution of the data across all the nodes.

Among them, there are two ways of rebalancing, SYNC and ASYNC.

The parallel computing module 302 is further configured to obtain processing data corresponding to processing tasks of each node in the cluster, and send the processing data to different types of target data sources based on service requirements.

The parallel computing module 302 is specifically configured to package the processed data and send the packaged processed data to the kaff card system in response to the target data source being a first type data source, and load a first application consumption based on a first index; the first application obtains the processing data from the Kaff card system to modify data in a target data source based on a first request and the processing data; and responding to the target data source as a second type data source, connecting the cluster and the target data source based on the first drive, and acquiring the processing data from the cluster based on the structured query language so as to modify the data in the target data source based on the processing data.

The parallel computing module 302 is further configured to adaptively configure.

Specifically, to accommodate servers of different configurations, when an application is deployed to a server, the parallel computing module 302 may read the configuration of the server, such as the CPU core number M. The applied thread initialization method can create a thread pool with the core thread number of M-1 according to the obtained configuration result. The characteristics ensure that the application can be rapidly deployed on servers with different configurations and can be directly and rapidly expanded.

The parallel computing module 302 is also used for method-oriented programming.

Specifically, the data processing method is supported to be used as a parameter to be transmitted in during task creation, and the characteristic simplifies the access cost of a subsequent new service scene and the learning cost of a developer. The access of a new scene can multiplex the whole architecture of the application without additional construction, and meanwhile, for a developer, the developer does not need to pay attention to other realization and only needs to pay attention to the data processing step of the own service scene.

The parallel computing module 302 is also used for fast expansion.

The function depends on task redistribution and self-adaptive configuration characteristics, so that when a cluster encounters a performance bottleneck, the task processing capacity can be quickly enhanced through a transverse expansion server, and cluster pressure is shared through a rebalancing strategy.

The parallel computing module 302 is also used for multi-source output.

The functional characteristics correspond to the multiple data source extraction module, and the core idea of the embodiment of the disclosure is to acquire a large amount of data from multiple data sources, and output the data to the multiple data sources according to the service scene after parallel processing. Data will be output to the elastic search and Oracle according to current business needs. Since the ways of writing data from different data sources are different, a proper output way needs to be selected. If the elastiscearch is Kafka message changed by encapsulation data, pushing Kafka, then consuming by ES index loading application, making http request and modifying ES data. Oracle modifies data in the form of sql, via Jdbc links.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device and a readable storage medium.

Fig. 5 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, such as a data processing method. For example, in some embodiments, the data processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the data processing method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the data processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

The foregoing is merely specific embodiments of the disclosure, but the protection scope of the disclosure is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the disclosure, and it is intended to cover the scope of the disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of data processing, the method comprising:

the first condition includes: the node is in an idle state.

2. The method of claim 1, wherein the obtaining variable data information for different types of data sources comprises:

acquiring change data information based on a log corresponding to any data source, and generating a queue file;

The data pushing component based on the data source end reads the queue file and pushes the queue file to the target end;

the target end receives the queue file through the data receiving component, reads the queue file based on the data copying component and pushes the change data information to the card system;

and the data source extraction module obtains variable data information corresponding to the data source based on the Kaff card system.

3. The method according to claim 1, wherein creating at least two tasks corresponding to the change data information at least two nodes in a cluster and storing the tasks at the nodes via the first interface comprises:

after the change data information is obtained, at least two tasks corresponding to the change data information are created in at least two nodes in a cluster through a first interface by using a distributed memory database, and the tasks are stored in the nodes;

wherein each change data information corresponds to a task.

4. The method of claim 1, wherein each node in the cluster processes its own created task in the set of tasks in parallel, comprising each node:

Executing locking operation on the task to be executed;

responding to successful locking operation, and executing the locked task based on a data processing strategy corresponding to the locked task;

and responding to the failure of the locking operation, representing that the task to be executed is executed by other nodes, executing the locking operation on other tasks to be executed until the locking operation is successful, and executing the locked task based on the data processing strategy corresponding to the locked task.

5. The method according to claim 1, wherein the method further comprises:

in response to the deletion of a first node in the cluster, confirming a task corresponding to the first node, and distributing the task corresponding to the first node to other nodes in the cluster;

or in response to the newly added second node in the cluster, uniformly distributing tasks corresponding to other nodes in the cluster to the second node.

6. The method according to claim 1, wherein the method further comprises:

processing data corresponding to processing tasks of all nodes in the cluster are obtained, and the processing data are sent to different types of target data sources based on service requirements.

7. The method of claim 6, wherein said sending the processed data to a different type of target data source comprises:

responding to the target data source as a first type data source, packaging the processing data, and sending the processing data to a Kaff card system, and loading the processing data after the first application consumes the package based on a first index; the first application obtains the encapsulated processing data from the card system to modify data in a target data source based on a first request and the encapsulated processing data;

and responding to the target data source as a second type data source, connecting the cluster and the target data source based on the first drive, and acquiring the processing data from the cluster based on a structured query language so as to modify the data in the target data source based on the processing data.

8. A data processing apparatus, the apparatus comprising:

the first condition includes: the node is in an idle state.

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-7.