CN116628082A - Data synchronization method, electronic device and computer readable storage medium - Google Patents
Data synchronization method, electronic device and computer readable storage medium Download PDFInfo
- Publication number
- CN116628082A CN116628082A CN202310457266.8A CN202310457266A CN116628082A CN 116628082 A CN116628082 A CN 116628082A CN 202310457266 A CN202310457266 A CN 202310457266A CN 116628082 A CN116628082 A CN 116628082A
- Authority
- CN
- China
- Prior art keywords
- data
- database
- description information
- processing node
- synchronization
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000005192 partition Methods 0.000 claims abstract description 58
- 238000012545 processing Methods 0.000 claims abstract description 54
- 230000001360 synchronised effect Effects 0.000 claims abstract description 39
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 230000002688 persistence Effects 0.000 claims description 6
- 238000000638 solvent extraction Methods 0.000 claims description 6
- 230000004044 response Effects 0.000 claims description 5
- 230000008569 process Effects 0.000 description 11
- 238000010586 diagram Methods 0.000 description 6
- 230000005856 abnormality Effects 0.000 description 3
- 230000001960 triggered effect Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000750004 Nestor meridionalis Species 0.000 description 1
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/275—Synchronous replication
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a data synchronization method, electronic equipment and a computer readable storage medium. The method comprises the steps that a processing node responds to data synchronization task triggering to obtain first data description information in a metadata base; the first data description information comprises second data description information of the main database and third data description information of the synchronous database; the processing node consumes the data in the main database according to the second data description information; the processing node partitions the consumed data to obtain a temporary data set; the processing node synchronizes the temporary data set to the synchronization database according to the third data description information. By the method, the technical problem of low expansibility caused by different business rules of each business database during database synchronization can be solved.
Description
Technical Field
The present application relates to the field of data processing, and in particular, to a data synchronization method, an electronic device, and a computer readable storage medium.
Background
With the rapid development of business, various data synchronization scenarios have arisen to enable data in a master database to be more rapidly synchronized into individual business databases to provide data support for their functions. The method is usually implemented by consuming the data in the main database and synchronizing the data to the corresponding database, but the service database can be assigned with related rules, so that the expansibility of the service is not strong, and if a new table structure is needed in the service database, related codes need to be modified and cannot be updated in real time.
Disclosure of Invention
The application mainly aims to provide a data synchronization method, electronic equipment and a computer readable storage medium, which can solve the technical problem of low expansibility caused by different business rules of various business databases during database synchronization.
In order to solve the technical problems, the first technical scheme adopted by the application is as follows: a data synchronization method is provided. The method is applied to a distributed system which comprises a processing node, a metadata base, a main database and a synchronous database. The method comprises the steps that a processing node responds to data synchronization task triggering to obtain first data description information in a metadata base; the first data description information comprises second data description information of the main database and third data description information of the synchronous database; the processing node consumes the data in the main database according to the second data description information; the processing node partitions the consumed data to obtain a temporary data set; the processing node synchronizes the temporary data set to the synchronization database according to the third data description information.
In order to solve the technical problems, a second technical scheme adopted by the application is as follows: an electronic device is provided. The electronic device comprises a memory for storing program data executable by the processor for implementing the method as in the first aspect.
In order to solve the technical problems, a third technical scheme adopted by the application is as follows: a computer-readable storage medium is provided. The computer readable storage medium stores program data executable by a processor to implement the method as in the first aspect.
The beneficial effects of the application are as follows: and when the data synchronization task is triggered, pulling second description information which is pre-stored in the metadata database and is acquired from the main database and third data description information which is acquired from the synchronization database, reading data which is not synchronized in the main database by using the second data description information of the main database, and synchronizing the read data into the synchronization database by using the third data description information of the synchronization database, thereby realizing the data synchronization of the main database and the synchronization database. By constructing the metadata database, the data description information of the downstream synchronous database is acquired, so that the scheme can support the data synchronization of various downstream synchronous databases at the same time, and flexibly support the update expansion and the new addition of the downstream synchronous database.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a first embodiment of a data synchronization method of the present application;
FIG. 2 is a schematic diagram of a distributed system to which the present application is applied;
FIG. 3 is a flow chart of a second embodiment of the data synchronization method of the present application;
FIG. 4 is a flow chart of a third embodiment of the data synchronization method of the present application;
FIG. 5 is a flow chart of a fourth embodiment of the data synchronization method of the present application;
FIG. 6 is a schematic diagram of an embodiment of a data synchronization process;
FIG. 7 is a schematic view of the structure of a first embodiment of the electronic device of the present application;
fig. 8 is a schematic structural view of a first embodiment of the computer-readable storage medium of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," and the like in this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
Referring to fig. 1, fig. 1 is a flowchart of a first embodiment of a data synchronization method according to the present application. The method is applied to a distributed system. The distributed system includes a processing node, a metadata database, a master database, and a synchronization database. Fig. 2 is a schematic diagram of a distributed system to which the present application is applied, as shown in fig. 2. Wherein the master database, the meta database and the synchronization database establish a communication connection with the processing node. The method comprises the following steps:
s11: and the processing node responds to the triggering of the data synchronization task and acquires the first data description information in the metadata base.
And the processing node responds to the data synchronization task, requests the service of the metadata base and acquires the first data description information in the metadata base. The first data description information includes second data description information of the master database and third data description information of the synchronization database. The data description information is equivalent to metadata in a database, the metadata is also called intermediate data and relay data, is data describing data, and mainly is information describing data attributes, and is used for supporting functions such as indicating storage positions, historical data, resource searching, file recording and the like.
S12: and the processing node consumes the data in the main database according to the second data description information.
The processing node uses the acquired second data description information of the main database to consume the data in the main database so as to read the data in the main database which is not synchronized, and prepare for the subsequent synchronization of the synchronous database. Consumption may be understood as an acquisition behavior of data.
S13: and the processing nodes partition the consumed data to obtain a temporary data set.
Partitioning data in the main database according to a preset rule, and taking the data in the same area as a temporary data set. And synchronizing the obtained temporary data sets, and synchronizing the data into a synchronous database.
S14: the processing node synchronizes the temporary data set to the synchronization database according to the third data description information.
And when the temporary data set is synchronized, synchronizing the temporary data set according to the third data description information of the synchronous database so that the data in the temporary data set can accord with the data storage rule in the synchronous database.
In this embodiment, a metadata base is constructed to store data description information of a main database and synchronous data, and when a data synchronization task is triggered, second description information which is pre-stored in the metadata base and is acquired from the main database and third data description information which is acquired from the synchronous database are pulled, the data which is not obtained by synchronization in the main database is read by using the second data description information of the main database, and the read data is synchronized to the synchronous database by using the third data description information of the synchronous database, so that the data synchronization of the main database and the synchronous database is realized. By constructing the metadata database, the data description information of the downstream synchronous database is acquired, so that the scheme can support the data synchronization of various downstream synchronous databases at the same time, and flexibly support the update expansion and the new addition of the downstream synchronous database.
In an embodiment, the distributed system performs cluster coordination by using Hadoop zookeeper when in use, so that good operation of a database cluster can be fully ensured.
In one embodiment, before the processing node responds to the data synchronization task, the method further comprises constructing a metadata base, and storing data description information of the main database and the synchronization database into the metadata base.
Specifically, kafka is taken as a main database, hive and HBase are taken as synchronous databases as examples. A metadata information base is built in an RDS database, and when a program starts to run for the first time, data description information is pulled from each database. Metadata such as all topics, partitions and offsets of all partitions in Kafka are acquired by using a consumer of Kafka, metadata information such as the number of partitions and consumer groups can be obtained after source information of the request Kafka is analyzed, and all the metadata information is written into a Kafka data description information table in a metadata information base. And then further acquiring mode (schema) information in the Hive database, and writing the mode information into a Hive data description information table in the metadata information base. And acquiring mode (schema) information in the HBase database, and writing the mode information into an HBase data description information table in the metadata database.
In one embodiment, the partitioning of the consumed data at the processing node to obtain the temporary data set further comprises: and performing persistence operation on the temporary data set. The persistence operation can buffer the data read from the main database into a memory or a disk, and can also back up multiple copies, so that the time consumed by task retry after data loss or failure of a downstream task is reduced.
In another embodiment, after the processing node synchronizes the temporary data set to the synchronization database according to the third data description information, further comprising updating the second data description information according to attribute information of the temporary data set in response to the temporary data set of the same batch completing the synchronization.
After a batch of temporary data sets is synchronized from the main database to the synchronous database, the second data description information of the main database is updated according to the information such as the partition, the offset, the state information and the like of the processed batch of data, so that the next time of data synchronization, the subsequent data synchronization process can be processed according to the updated second data description information, and the task progress is updated. If the update of the second data description information is abnormal after the temporary database of one batch is completed, the next synchronous scheduling may be caused to continue to perform data synchronization according to the second data description information which is not updated, so that repeated consumption of data is caused.
When the processing node performs data synchronization, if the data synchronization fails, retry is performed, and only after the batch of data is completely successful, the second data description information is updated, so that the strong consistency of the data is ensured, and the data loss is avoided.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the data synchronization method according to the present application. The method is a further extension of step S12, comprising the steps of:
s21: and the processing node acquires the initial offset of the preset theme in the processing node in the second data description information according to the preset theme.
When the data needing to be synchronized in the main database is acquired, the second description information needs to be acquired first, and the position from which to start data reading is determined according to the data offset in the second description information. The main database stores various types of data, so that corresponding data are determined according to a preset theme, and the initial offset corresponding to the data in the processing node is acquired according to the preset theme, so that the processing node processes the data under the preset theme based on the initial offset.
S22: and determining the data to be consumed according to the initial offset and the current offset interval of the preset theme in the main database.
After the initial offset corresponding to the preset theme is determined, determining the data which are finally required to be consumed according to the current offset of the theme in the main database. The current offset interval of the main database is the data offset interval of the main database at the current time point.
In an embodiment, in response to the initial offset being located in the current offset interval, the data corresponding to the maximum endpoint in the current offset interval from the initial offset is used as the data to be consumed, and in response to the initial offset being located outside the current offset interval, the data corresponding to the current offset interval is used as the data to be consumed or the abnormal reminder is performed.
When the initial offset is in the current offset interval, which indicates that the data has been consumed, the offset generated by the previously consumed data is stored in the metadata base, and the initial offset is the data corresponding to the interval from the initial offset to the maximum endpoint in the current offset interval, and only the data is used as the data to be consumed subsequently.
When the initial offset is located outside the current offset interval, it indicates that data expiration or data abnormality occurs at this time, and the data offset needs to be reset at this time to ensure normal operation of the data system, so that the initial offset is reset to the minimum endpoint in the current offset interval, and consumption of data corresponding to the current offset interval is started, and the data corresponding to the current offset interval is used as data to be consumed. Specifically, when the initial offset is smaller than the minimum endpoint in the current offset interval, which belongs to the normal case, the data is generally outdated because the data in the main database is not consumed for a long time, if the data corresponding to the interval from the initial offset to the minimum endpoint in the current offset interval is consumed now, the main database throws out the exception, so that the data corresponding to the current offset interval is consumed at this time, or whether the exception is thrown out is selected according to the actual situation. When the initial offset is greater than the maximum endpoint in the current offset interval, the exception condition is generally that the data offset which is not cleaned is stored in the metadata base, or the data in the metadata base is artificially modified to cause inconsistent data results, so that the data corresponding to the current offset interval is consumed at the moment, or whether the exception is thrown or not is selected according to the actual condition.
Further, in response to the number of partitions corresponding to the starting offset being less than the number of partitions corresponding to the minimum endpoint in the current offset interval, adding all data of the newly added partition and incremental data of the original partition in the main database to the data to be consumed. And responding to the fact that the partition number corresponding to the initial offset is larger than the partition number corresponding to the minimum endpoint in the current offset interval, and taking all data of all partitions in the main database as data to be consumed.
When the partition number corresponding to the initial offset is smaller than the partition number corresponding to the minimum endpoint in the current offset interval, the main database is indicated to trigger the capacity expansion mechanism, and the partition of the data in the main database is increased. Since the expanded partition also has data flowing in, the expanded partition of the portion, i.e., the newly added partition, needs full consumption, while for the previous partition, the expanded portion is also considered, so that incremental consumption is needed for the original partition.
When the partition number corresponding to the initial offset is greater than the partition number corresponding to the maximum endpoint in the current offset interval, the method belongs to an abnormal condition, and the fact that the data offset which is not cleaned exists in the metadata base is indicated, so that the data offset is modified to be consistent with the partition number corresponding to the current offset interval, consumption is conducted on the data corresponding to the current offset interval, or whether the abnormality is thrown out is selected according to actual conditions.
Referring to fig. 4, fig. 4 is a flowchart of a third embodiment of the data synchronization method according to the present application. The method is a further extension of step S13, comprising the steps of:
s31: the processing nodes partition the data according to the subject categories.
After the processing node obtains the data to be consumed in the main database, the obtained data is obtained according to the preset theme, so that the obtained data can be distinguished according to the theme category, and the data of the same theme category is used as the same partition.
S32: data under the same partition is parsed to generate a temporary dataset.
And uniformly processing the data in the same partition, and analyzing and generating the data of the main database in the same partition into a temporary data set for subsequent tasks.
According to the method, metadata information such as offset of the main database is stored in the metadata database, and data under different topics can be processed simultaneously, so that multiple topics are supported to be consumed during reading, and the data under the multiple topics are read.
Referring to fig. 5, fig. 5 is a flowchart of a fourth embodiment of the data synchronization method according to the present application. The method is a further extension of step S14, comprising the steps of:
s41: the processing node maps the data in the temporary data set to the first type of data according to the third data description information.
The data type of the first type of data is the same as the data type of the data in the synchronization database. The processing node alters the data of the temporary data set to a data type suitable for storage into the synchronization database in accordance with the third data description information.
In one embodiment, each computing node in the processing node receives broadcasted third data description information, which includes metadata information related to the synchronization database. The computing node processes the temporary data set in accordance with the received information.
S42: the first type of data is synchronized to a synchronization database.
After finishing the data type change of the data in the temporary data set, storing the data into a corresponding synchronous database.
In an embodiment, the synchronization database includes a plurality of types, and the first type of data includes a plurality of types of data, and after the temporary data set is processed, each type of data is stored in the corresponding synchronization database according to the corresponding relation.
The following describes the technical scheme of the present application in detail by taking a specific embodiment.
In this embodiment, the master database is Kafka as the master database, hive and HBase as the synchronization database, RDS as the metadata database, spark as the processing node, and zookeeer is utilized for cluster coordination.
Before starting the data synchronization, a metadata base is built in the RDS database.
The whole data consumption process can be composed of three parts, including an input end, a consumption end and an output end, wherein each end comprises different components, and the responsibility of metadata required to be responsible for maintenance in different processes is different. For component metadata of the input end, the same message is guaranteed to be transmitted each time, namely idempotent of the message is guaranteed, the message is guaranteed not to be lost, and the message can normally run even if message accumulation occurs. For the component metadata of the consumption end, the metadata needs to be read in the process of consumption, so that the progress of consumption is maintained, and at least one semantic is ensured. For component metadata of an output end, the data is persistent or real-time performance and persistence of the metadata need to be ensured.
And then the program starts to run for the first time so as to pull metadata information in the Kafka database, the Hive database and the HBase database and write the metadata information into the metadata database.
For the metadata acquisition flow of the Kafka database, the Kafka domain name is first acquired through a Zookeeper. Since the consistency service of the Kafka is responsible for the Zookeeper, each Zookeeper of the Kafka needs to record its own address in the Zookeeper, so that the domain name information of the Kafka can be obtained in the Zookeeper. Whereas acquisition from a Zookeeper may also reduce the number of configuration parameters. And then connecting the Kafka by using the Kafka domain name, adding the domain name which is successfully connected into the effective set, and discarding the domain name if the domain name is failed to be connected at the moment. And then, obtaining metadata such as all topics, partitions and offsets of each partition by using a Kafka consumer, analyzing source information obtained by requesting Kafka, obtaining metadata information such as the number of the partitions and consumer groups, and writing all the metadata information into a Kafka data description information table in a metadata information base. The table structure of the Kafka data description information table may be as shown in table 1.
TABLE 1
Name of name | Type(s) | Length of | Non-empty | Annotating |
topic | varchar | 128 | √ | kaka topic |
partitions | int | 16 | √ | Number of partitions |
partitionid | int | 16 | √ | Partition id |
offset | varchar | 255 | √ | Offset amount |
createtime | datetime | 0 | √ | Creation time |
group | varchar | 255 | √ | group name |
Wherein topic represents the topic of kafka, parts identify the total number of partitions of the current topic, parts represent the number of partitions of the current topic, offset represents the offset of the current topic, createTime represents the time of inserting the table, and group represents the consumer group of kafka.
And for the metadata acquisition flow of the Hive database, acquiring pattern (schema) information of a table in the Hive database, and writing the pattern information into a Hive data description information table in the metadata database. The table structure of the Hive data description information table may be as shown in table 2.
TABLE 2
Name of name | Type(s) | Length of | Non-empty | Annotating |
db | varchar | 255 | √ | Database name |
tb | varchar | 255 | √ | Table name |
fieldname | varchar | 255 | √ | Field name |
fieldtype | varchar | 255 | √ | Field type |
Where db represents a Hive database name, tb represents a Hive table name, fieldname represents a Hive table field name, fieldtype represents a Hive table field type.
And for the metadata acquisition flow of the HBase database, acquiring pattern (schema) information of a table in the HBase database, and writing the pattern information into an HBase data description information table in the metadata database. The table structure of the HBase data description information table can be shown in table 3.
TABLE 3 Table 3
Wherein columncmaster represents an HBase database name, tb represents an HBase table name, fieldname represents an HBase table field name, fieldtype represents an HBase table field type.
As shown in fig. 6, fig. 6 is a schematic diagram illustrating an embodiment of the data synchronization execution process. After the information of the data description table in the metadata base is obtained, data synchronization is executed, and the processing node responds to the triggering of the data synchronization task and requests the RDS service to read the Kafka description information table and the Hive table description information table, wherein the HBase table description information table obtains metadata information such as relevant table names, table fields, topic offset and the like.
Then, a Kafka consumer is created to begin consuming all the data of topic from the offset recorded in RDS to the current point in time. Assuming that the start Offset corresponding to topic in RDS is recorded as startOffset, the Offset interval of data in the current time point Kafka is oldestOffset, endOffset, where oldestOffset is the smallest Offset in Kafka, and endOffset is the largest Offset in Kafka. The general occurrence is:
StartOffset is greater than oldestOffset and less than endOffset: this situation illustrates that the data has been consumed and that the offset generated by the previously consumed data is stored in RDS in the form of startOffset. The corresponding processing is to consume incremental data from startOffset to endOffset.
StartOffset is less than oldestOffset: this situation is a normal situation, and a scenario that generally occurs is that data of Kafka is not consumed for a long time, resulting in the occurrence of data expiration in Kafka. If the data between startOffset to oldestOffset is consumed, kafka throws an exception. Therefore, the corresponding processing situation is to consume all data in Kafka, and whether to throw the exception can also be selected according to the situation.
StartOffset is greater than endOffset: this is one of the abnormal conditions, and the situation that typically occurs is that the RDS keeps old, unclean startOffset, or that the RDS data is artificially modified, resulting in data inconsistency. The corresponding processing is to consume all the data in Kafka and record the maximum offset for each partition in RDS. Of course, it is also possible to choose whether to throw an abnormality or not, and stop the task.
The number of partitions in StartOffset is less than the oldestOffset: it is shown that Kafka triggers a capacity expansion mechanism under which partitions of Kafka increase. The expanded partition also has data flowing in, so full consumption is also required for the expanded partition, and incremental consumption is required for the previous partition.
The number of partitions of startOffset is greater than the oldestOffset: this is the same as the third case, and is caused by the existence of old, unclean startOffset in RDS, and all data in Kafka can be consumed, or if an exception is thrown, the task can be stopped as the case may be.
After reading out the data in the Kafka, partitioning the Kafka data, cleaning and packaging the data in batches, and generating a temporary data set. When Spark reads data, the data can be single Kafka topic data or multiple Kafka topic data. When the read data is processed, the data is acquired from the appointed offset position of the Kafka, the information of the batch of data is acquired, the batch of data is partitioned according to topic, the data in the same partition after the partitioning is uniformly processed, and the Kafka data is analyzed according to the partition to generate a temporary data set for a subsequent task.
And after the temporary data set is generated, spark performs persistence operation on the data, the persistence operation can buffer the data read from the Kafka into a memory or a disk, so that multiple copies can be further backed up, and the time consumed after the data is lost or after the task is retried due to the failure of a downstream task is reduced.
After the temporary data set is acquired, the metadata information corresponding to the table names and the table field information of the HBase and the Hive is broadcasted as a broadcast variable to be transmitted to each computing node. Meanwhile, a part of global variable is guaranteed to be stored in each Executor (Executor), the task needs to be used when being executed, and the part of variable can be used, so that the memory overhead of the Executor is greatly reduced.
Then each Spark computing node maps the data set into the data type of HBase through the broadcast variable and inserts the data set into the HBase database, and maps the data set into the data type of Hive through the broadcast variable and inserts the data set into the Hive database.
After the data are stored in the HBase database and the Hive database, updating second data description information of the main database according to the information such as partition, offset, state information and the like of the processed data of the batch, wherein the second description information comprises offset information of the Kafka database, so that the next time of data synchronization, a subsequent data synchronization process can be processed according to the offset information in the updated second data description information, and the task progress is updated. If the update of the second data description information is abnormal after the temporary database of one batch is completed, the next synchronous scheduling may be caused to continue to perform data synchronization according to the second data description information which is not updated, so that repeated consumption of data is caused.
Fig. 7 is a schematic structural diagram of a first embodiment of the electronic device according to the present application, as shown in fig. 7.
The electronic device comprises a processor 110, a memory 120.
The processor 110 controls the operation of the electronic device, the processor 110 may also be referred to as a CPU (Central Processing Unit ). The processor 110 may be an integrated circuit chip with processing capabilities for signal sequences. Processor 110 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 120 stores instructions and program data required for operation of processor 110.
The processor 110 is configured to execute instructions to implement the method provided by any one and possibly a combination of the foregoing first to third embodiments of the data synchronization method of the present application.
As shown in fig. 8, fig. 8 is a schematic structural diagram of a first embodiment of a computer readable storage medium according to the present application.
An embodiment of the readable storage medium of the present application includes a memory 210, the memory 210 storing program data that when executed implements the method provided by any one and possible combinations of the first to third embodiments of the data method of the present application.
The Memory 210 may include a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other media capable of storing program instructions, or may be a server storing the program instructions, and the server may send the stored program instructions to other devices for execution, or may also self-execute the stored program instructions.
In summary, by constructing the metadata database for storing the data description information of the main database and the synchronization data, when the data synchronization task is triggered, the second description information which is pre-stored in the metadata database and is acquired from the main database and the third data description information which is acquired from the synchronization database are pulled, the data which is not synchronized in the main database is read by using the second data description information of the main database, and the read data is synchronized to the synchronization database by using the third data description information of the synchronization database, so that the data synchronization of the main database and the synchronization database is realized. By constructing the metadata database, the data description information of the downstream synchronous database is acquired, so that the scheme can support the data synchronization of various downstream synchronous databases at the same time, and flexibly support the update expansion and the new addition of the downstream synchronous database.
In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units of the other embodiments described above may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is only illustrative of the present application and is not to be construed as limiting the scope of the application, and all equivalent structures or equivalent flow modifications which may be made by the teachings of the present application and the accompanying drawings or which may be directly or indirectly employed in other related art are within the scope of the application.
Claims (10)
1. A method of data synchronization, characterized by being applied to a distributed system comprising a processing node, a metadata database, a master database, and a synchronization database, the method comprising:
the processing node responds to the triggering of the data synchronization task to acquire first data description information in the metadata base; the first data description information comprises second data description information of the main database and third data description information of the synchronous database;
the processing node consumes the data in the main database according to the second data description information;
the processing node partitions the consumed data to obtain a temporary data set;
and the processing node synchronizes the temporary data set to a synchronous database according to the third data description information.
2. The method of claim 1, wherein consuming data in the master database according to the second data description information by the processing node comprises:
the processing node acquires initial offset of the preset theme in the processing node in the second data description information according to the preset theme;
and determining the data to be consumed according to the initial offset and the current offset interval of the preset theme in the main database.
3. The method of claim 2, wherein the determining the data to be consumed based on the starting offset and a current offset interval of the preset theme in the master database comprises:
responding to the fact that the initial offset is located in the current offset interval, and taking data corresponding to the initial offset to the maximum endpoint in the current offset interval as the data to be consumed;
and responding to the fact that the initial offset is located outside the current offset interval, wherein data corresponding to the current offset interval is used as the data to be consumed or is used for carrying out abnormal reminding.
4. A method according to claim 3, wherein said determining said data to be consumed based on said starting offset and a current offset interval of said preset theme in said master database comprises:
responding to the fact that the partition number corresponding to the initial offset is smaller than the partition number corresponding to the minimum endpoint in the current offset interval, adding all data of a newly added partition in the main database and incremental data of an original partition to the data to be consumed;
and responding to the fact that the partition number corresponding to the initial offset is larger than the partition number corresponding to the minimum endpoint in the current offset interval, and taking all data of all partitions in the main database as the data to be consumed.
5. The method of claim 2, wherein the processing node partitioning the consumed data to obtain a temporary data set, comprising:
the processing node partitions the data according to the topic categories;
and analyzing the data under the same partition to generate the temporary data set.
6. The method of claim 1, wherein the processing node synchronizing the temporary data set to the synchronization database according to the third data description information further comprises:
and in response to the temporary data sets of the same batch completing synchronization, updating the second data description information according to the attribute information of the temporary data sets.
7. The method of claim 1, wherein the processing node partitioning the data consumed further comprises, after obtaining a temporary data set:
and performing persistence operation on the temporary data set.
8. The method of claim 1, wherein the processing node synchronizing the temporary data set to the synchronization database according to the third data description information comprises:
the processing node maps the data in the temporary data set into first type data according to the third data description information; wherein the data type of the first type data is the same as the data type of the data in the synchronous database;
synchronizing the first type of data to the synchronization database.
9. An electronic device comprising a memory and a processor, the memory for storing program data, the program data being executable by the processor to implement the method of any one of claims 1-8.
10. A computer readable storage medium storing program data executable by a processor to implement the method of any one of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310457266.8A CN116628082A (en) | 2023-04-21 | 2023-04-21 | Data synchronization method, electronic device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310457266.8A CN116628082A (en) | 2023-04-21 | 2023-04-21 | Data synchronization method, electronic device and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116628082A true CN116628082A (en) | 2023-08-22 |
Family
ID=87635594
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310457266.8A Pending CN116628082A (en) | 2023-04-21 | 2023-04-21 | Data synchronization method, electronic device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116628082A (en) |
-
2023
- 2023-04-21 CN CN202310457266.8A patent/CN116628082A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107391628B (en) | Data synchronization method and device | |
US9779128B2 (en) | System and method for massively parallel processing database | |
CN108509462B (en) | Method and device for synchronizing activity transaction table | |
US8301600B1 (en) | Failover recovery in a distributed data store | |
US7783607B2 (en) | Decentralized record expiry | |
CN113111129B (en) | Data synchronization method, device, equipment and storage medium | |
US20060259525A1 (en) | Recovery method using extendible hashing-based cluster logs in shared-nothing spatial database cluster | |
CN107919977B (en) | Online capacity expansion and online capacity reduction method and device based on Paxos protocol | |
CN111475480B (en) | Log processing method and system | |
CN107977396B (en) | Method and device for updating data table of KeyValue database | |
US20140156596A1 (en) | Replication control using eventually consistent meta-data | |
CN111552701A (en) | Method for determining data consistency in distributed cluster and distributed data system | |
CN110196680B (en) | Data processing method, device and storage medium | |
CN115587118A (en) | Task data dimension table association processing method and device and electronic equipment | |
CN118152450A (en) | Data communication method, equipment and medium for unidirectional network isolation environment | |
CN111046246B (en) | Label updating method and device and distributed storage system | |
CN117370454A (en) | Data processing method | |
CN112000850A (en) | Method, device, system and equipment for data processing | |
CN116628082A (en) | Data synchronization method, electronic device and computer readable storage medium | |
CN107295059A (en) | The statistical system and method for service propelling amount | |
CN114265900A (en) | Data processing method and device, electronic equipment and storage medium | |
CN115202925A (en) | Common identification method and system supporting fine-grained fault tolerance based on RDMA | |
CN111966650A (en) | Operation and maintenance big data sharing data table processing method and device and storage medium | |
CN107153699B (en) | Method and device for dynamically expanding cluster server | |
CN111581282A (en) | Timed task scheduling method and scheduling system in server cluster mode |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |