CN114385760A - Method and device for real-time synchronization of incremental data, computer equipment and storage medium - Google Patents

Method and device for real-time synchronization of incremental data, computer equipment and storage medium Download PDF

Info

Publication number
CN114385760A
CN114385760A CN202210043391.XA CN202210043391A CN114385760A CN 114385760 A CN114385760 A CN 114385760A CN 202210043391 A CN202210043391 A CN 202210043391A CN 114385760 A CN114385760 A CN 114385760A
Authority
CN
China
Prior art keywords
data
hbase
real
oracle
incremental
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210043391.XA
Other languages
Chinese (zh)
Inventor
罗开畅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An E Wallet Electronic Commerce Co Ltd
Original Assignee
Ping An E Wallet Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An E Wallet Electronic Commerce Co Ltd filed Critical Ping An E Wallet Electronic Commerce Co Ltd
Priority to CN202210043391.XA priority Critical patent/CN114385760A/en
Publication of CN114385760A publication Critical patent/CN114385760A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, and provides a method for real-time synchronization of incremental data, which stores JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue; the synchronization basis of incremental data from an Oracle data table to a big data environment is realized, the JSON message in the distributed message queue is stored to HBase, and an HBase data table is obtained; the change row key can be recorded according to an HBase data table and an Oracle mirror image table in the HBase obtained in advance; restoring all field data of a row of data from the HBase database according to the row changing key, perfecting the data when the database is updated, and storing all field data into a data warehouse to obtain the incremental data of the incremental data warehouse; the incremental data and the historical data are combined to obtain real-time updated data warehouse data, the data in the Oracle library are updated and extracted in real time, the resource occupation of the Oracle library is reduced, and the data updating time in the data warehouse is saved.

Description

Method and device for real-time synchronization of incremental data, computer equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a method and a device for real-time synchronization of incremental data, computer equipment and a storage medium.
Background
Data synchronization generally refers to the process of importing business data of a traditional relational database (such as Mysql, Oracle) into a big data Hive warehouse. In a traditional offline processing scenario, data import tools such as sqoop are usually used for big data, the data are read from an online database in batches every day and imported into a hive warehouse of the big data, and with the continuous increase of traffic, a large amount of database resources are occupied in a batch data synchronization mode in a short time, so that an upstream database frequently alarms, and even protective fusing is triggered.
In the prior art, in order to reduce the burden of batch data synchronization, logs in a database are generally analyzed into JSON messages through a tool such as an OGG (open log gateway) and the JSON messages are delivered to a big data Hive warehouse, but the analyzed JSON messages cannot be delivered in real time, so that data occupy resources of an upstream database, and unchanged information fields often exist in the JSON messages, so that the changed information in the database cannot be completely obtained, and therefore, the problem that data synchronization from Oracle to hdfs in the prior art cannot be synchronized in real time and completely exists.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for real-time synchronization of incremental data to solve the problem that the incremental data cannot be synchronized in real time and completely synchronized.
A first aspect of an embodiment of the present application provides a method for real-time synchronization of incremental data, including:
storing JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue; the JSON message is used for representing a change event of the database;
based on a preset first computing engine, storing the JSON message in the distributed message queue to HBase to obtain an HBase data table;
according to the HBase data table and an Oracle mirror image table in the HBase obtained in advance, recording a change line key in the HBase data table through HBase operation processing;
based on a preset second computing engine, restoring all field data of a row of data corresponding to the change row key according to the change row key, and storing all field data into a data warehouse to obtain incremental data;
and merging the incremental data and historical data in the data warehouse by using a scheduling task, and obtaining real-time updated data in the data warehouse.
A second aspect of the embodiments of the present application provides a method and an apparatus for real-time synchronization of incremental data, including:
an analysis unit: storing JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue; the JSON message is used for representing a change event of the database;
a storage unit: based on a preset first computing engine, storing the JSON message in the distributed message queue to HBase to obtain an HBase data table;
a changing unit: according to the HBase data table and an Oracle mirror image table in the HBase obtained in advance, recording a change line key in the HBase data table through HBase operation processing;
a reduction unit: based on a preset second computing engine, restoring all field data of a row of data corresponding to the change row key according to the change row key, and storing all field data into a data warehouse to obtain incremental data;
a merging unit: and merging the incremental data and historical data in the data warehouse by using a scheduling task, and obtaining real-time updated data in the data warehouse.
A third aspect of embodiments of the present application provides a computer device, including: a memory, a processor, and computer readable instructions stored in the memory and executable on the processor for causing the computer to perform the steps of the method for real-time synchronization of incremental data.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium having a computer program stored thereon, the computer program being for execution by a processor of steps of a method for real-time synchronization of incremental data.
The method for real-time synchronization of the incremental data provided by the embodiment of the application has the following beneficial effects:
the invention relates to an artificial intelligence technology, and provides a method for real-time synchronization of incremental data, which stores JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue; the synchronization basis of incremental data from an Oracle data table to a big data environment is realized, the JSON message in the distributed message queue is stored to HBase, and an HBase data table is obtained; the change row key can be recorded according to an HBase data table and an Oracle mirror image table in the HBase obtained in advance; restoring all field data of a row of data from the HBase database according to the row changing key, perfecting the data when the database is updated, and storing all field data into a data warehouse to obtain the incremental data of the incremental data warehouse; the incremental data and the historical data are combined to obtain real-time updated data warehouse data, the data in the Oracle library are updated and extracted in real time, the resource occupation of the Oracle library is reduced, and the data updating time in the data warehouse is saved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of a method for real-time synchronization of incremental data according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for real-time synchronization of incremental data according to another embodiment of the present application;
FIG. 3 is a flowchart illustrating an implementation of a method for real-time incremental data synchronization according to yet another embodiment of the present application;
FIG. 4 is a flowchart illustrating an implementation of a method for real-time incremental data synchronization according to another embodiment of the present application;
fig. 5 is a block diagram of a method and an apparatus for real-time incremental data synchronization according to an embodiment of the present application;
fig. 6 is a block diagram of a server-side device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method for real-time synchronization of incremental data is applied to the field of artificial intelligence and can be executed by a server side.
As shown in fig. 1, a method for real-time synchronization of incremental data includes:
s11: storing JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue;
in step S11, the structured copy tool parses the log in the Oracle data table, captures the log in the database, analyzes, sorts, and filters, obtains the data to be processed, stores the data as a file with a specific format, and transmits the file to the target database, and the structured copy tool stores the log in the Oracle data table as a JSON message and transmits the JSON message to the distributed message queue. JSON is a lightweight data exchange format that stores and represents data in a text format that is completely independent of the programming language. The compact and clear hierarchy makes JSON an ideal data exchange language. The network transmission method is easy to read and write by people, is easy to analyze and generate by machines, and effectively improves the network transmission efficiency.
In this embodiment, data is captured from the Oracle data table log according to the deployed structured replication tool, the log capture can be divided into two parts, an initial data loading stage is used for extracting data from the Oracle data table, a data change synchronization capture stage is used, and an extraction process is responsible for capturing corresponding change data including DDL statements and DML statements from the Oracle data table log according to configuration. The extraction process establishes an internal checkpoint mechanism similar to a database, is used for periodically checking and recording the positions of the read-write logs of the extraction process, and stores the read corresponding data as a binary trail file in an internal format. This mechanism can ensure that if the extraction process is terminated normally or abnormally or the whole operating system is down, after the process is restarted, the structured replication tool can restore to the previous position to retrieve the record extraction position thereof, and the logs are extracted again from the position without any data loss or repeated data acquisition.
Some running operations of the Oracle database instance require looking up and judging the contents. Since the data in the control file represents the state of the database at a certain time, in order to better record the state information, the control file adopts binary representation, thereby ensuring the integrity of the database. The control file is created during the creation of the database. At least one of the files is generated during the database creation process in a default state.
According to the obtained change event of the Oracle library, the change event of the Oracle library is converted into a JSON message and sent to a distributed message queue for storage. The messages of physically different topics are stored separately, and logically the message of one Topic is stored on one or more servers, but the user only needs to specify the Topic of the message to produce or consume the data without having to care about where the data is stored. And inputting the JSON string message into a Kafka-based message publishing system, and storing the JSON string message in the pre-created Topic in the distributed message queue-based message publishing system.
As an embodiment of the present application, step S11 specifically includes:
sending the logs in the Oracle data table to a preset storage system for storage to obtain target logs; and analyzing the target log by using a tool to obtain a JSON message, and storing the JSON message to a distributed message queue.
In this embodiment, the preset storage system is stored as a third-party storage system, the log in the Oracle data table is sent to the third-party storage system, and analysis processing is performed in the third-party storage system, so that the log can be generated continuously in the Oracle database, and the memory pressure and the bottleneck of the main server are reduced. And obtaining a target log, analyzing the target log by using a tool to obtain a JSON message, and storing the JSON message to a distributed message queue.
S12: based on a preset first computing engine, storing the JSON message in the distributed message queue to HBase to obtain an HBase data table;
in step S12, the pre-configured first compute engine dataflow interface supports transformations (such as filters, aggregations, and window functions) on bounded or unbounded dataflow, which can be used in Java and Scala. HBase is an open source version that can provide highly reliable, column storage based, scalable, database system services, and data in HBase relies on row keys, column family column identifier timestamps to retrieve data in a cell, where column identifiers and timestamps are not necessary. The row key represents a row of data, and atomicity of row operation can be guaranteed. HBase can receive JSON messages obtained from distributed message queues in real time.
In this embodiment, the preset first computing engine is a flink computing engine, and during execution, one data stream includes one or more partitions, and each reorganization includes one or more reorganization tasks, which are executed independently of each other in different threads, different physical machines, or different containers. The number of tasks of a particular reorganization is referred to as its parallelism. The parallelism of one data stream is always equal to that of a production reassembly. Parallelism is a dynamic concept and can be configured by a parameter parallelism. And performing consumption processing on the JSON message in the distributed message queue through a preset first computing engine, taking out the JSON message from the first computing engine, and storing the JSON message in the HBase to obtain a corresponding HBase data table.
As an embodiment of the present application, step S12 specifically includes:
an interface corresponding to HBase is obtained by declaring and assigning an interface in a preset first calculation engine program; and storing the JSON message in the distributed message queue to the HBase through a corresponding interface in the HBase to obtain an HBase data table.
In this embodiment, statements and assignments are made to variables, parameters, and functions called by the first compute engine data flow family. Configuring data source parallelism, wherein the parallelism refers to the maximum number of instructions or data to be executed in parallel, consuming JSON messages through configured interfaces and storing the JSON messages to HBase, wherein the JSON messages are sent to corresponding areas in consumers according to the newly built consumers in advance to obtain consumption data, and the consumption data are stored to the HBase through the corresponding interfaces in the HBase to obtain an HBase data table.
And newly building a consumer, setting the consumption Topic to point to the pre-created Topic, and indicating that the consumer takes the data in the pre-created Topic for distribution. Other attributes may also be set, including an address, i.e., a server address stored by Topic; setting a consumption strategy of a consumer, for example using earlie, wherein when the strategy is used, each Topic comprises one or more partitions, and when a submitted message consumption point exists below each partition, consumption is started from the submitted consumption point; without a committed consumption point, consumption starts from scratch, i.e., no data is lost by restarting the program using this policy. Calling a first calculation engine data stream interface, wherein in the first calculation engine data stream interface, a consumer is defined as a data source and points to a pre-created Topic in a message publishing system, real-time data is stored in the pre-created Topic after being formatted by JSON, and the first calculation engine data stream interface executes a sink operation from the data source to a destination, namely the real-time data is stored in an HBase database.
It should be noted that, the filtering condition of the JSON packet is set according to actual needs, for example, the message format: the method comprises the steps of { "ID" "and" "name" }, wherein the ID is a main key of each message, the filtering condition is set to be ID legal, only JSON messages containing the ID legal are reserved, and the rest messages are discarded. In some embodiments, the preset ID validity rule, for example, the ID validity rule includes: the method comprises the following steps of comparing an ID field in a JSON message with a preset ID validity rule without special characters, forbidden words and the like; and when the ID field conforms to the preset ID validity rule, storing the JSON string message to an Hbase database through a data source, and otherwise, discarding the JSON message.
Based on a preset first calculation engine, storing the JSON message in the distributed message queue to the HBase, obtaining an HBase data table, and determining a row key field corresponding to the HBase data table according to the number of partitions in the HBase data table.
S13: according to the HBase data table and an Oracle mirror image table in the HBase obtained in advance, recording a change line key in the HBase data table through HBase operation processing;
at step S13: the Oracle mirror table is history data of logs in the Oracle data table, the history data of the copy logs is stored in HBase to facilitate comparison between data and obtain changed data, and the data in the HBase data table is used for retrieving data in a cell by virtue of a row key and a column family identifier timestamp, wherein the column identifier and the timestamp are not necessary. The HBase related operation comprises basic operations such as adding, deleting, modifying and checking data, a row key represents a row of data, and a modified row key where the modified data is located is obtained according to the related operation.
S14: based on a preset second computing engine, restoring all field data of a row of data corresponding to the change row key according to the change row key, and storing all field data into a data warehouse to obtain incremental data;
in step S14, according to the second calculation engine set in advance, the corresponding row key is looked up from the Oracle mirror table in the HBase according to the change row key, and the incremental data containing all the field values is restored from the HBase, and the incremental data is stored in the data warehouse and stored in the form of a data table in the data warehouse.
In this embodiment, the preset second computing engine is a Spark computing engine, the data source built in the data source computing engine read by the second computing engine, the data source provided by the third party platform, or the user-defined data source, and the second computing engine is a computing framework based on a memory, and uses a memory computing engine to provide a Cache mechanism to store data that needs to be repeatedly iteratively computed or used for multiple times in the memory, so that disk reading is not required after the data is loaded into the memory for the first time and then needs to be used, data reading overhead is reduced, and incremental data containing all field values are restored by reading the Oracle mirror image table data in the HBase from the HBase through the second computing engine.
It should be noted that, based on the configuration and plug-in modes, the interface for reading the data source by the second computing engine is unified, and all the data sources read by the second computing engine are implemented by the configuration file, so that each data source corresponds to one plug-in the configuration file, and each plug-in has a different configuration item.
S15: and merging the incremental data and historical data in the data warehouse by using a scheduling task, and obtaining real-time updated data in the data warehouse.
In step S15, according to the scheduling task, the incremental data and the historical data are merged, and the data in the data warehouse is refreshed in real time to obtain real-time updated data, where the merging of the data includes adding data, deleting data, changing data, and the like.
When more steps need to be executed in the scheduling task, the scheduling task can be subjected to fine-grained processing, the scheduling task is divided into one or more executable finer tasks, an execution method of the finer tasks is determined, and the finer tasks are the executable tasks with smaller execution range and are refined by the scheduling task. The more detailed task may be a specific step of implementing the scheduling task, or a sub-task subordinate to the scheduling task. The refinement degree of the scheduling task is not limited in the embodiment of the application, and the finer task can be the finest executable task. After the scheduling task is divided into one or more detailed tasks, the scheduling domain analyzes the more detailed tasks and obtains task information of the more detailed tasks, the technical domain executes the tasks according to the task information of the more detailed tasks, and the scheduling tasks are executed by executing the more detailed tasks. For example, when the incremental data and the historical data are merged, the merged task includes adding a task, deleting a task, changing a task, and the like, and the merged task is refined into three refined tasks.
It should be noted that a condition that the initiation information of the automatic triggering scheduling task meets may be preset, and in a possible implementation manner, a monitoring index may be set for the operation data of the facility according to the information data in the scheduling task, and when the operation data of the facility exceeds the monitoring index, the initiation information of the corresponding scheduling task is triggered. For example, when the incremental data and the historical data are merged, it may be set that, after the incremental data reaches a certain number, a scheduling task is triggered, so that the incremental data and the historical data are merged, and the data updated in real time is obtained in a data warehouse. Or a certain time interval is set, and the incremental data and the historical data are merged at the specified time interval to obtain the data updated in real time in the data warehouse.
Referring to fig. 2, fig. 2 is a flowchart illustrating an implementation of a method for real-time incremental data synchronization according to another embodiment of the present application. With respect to the embodiment shown in fig. 1, the method for real-time synchronization of incremental data provided in this embodiment further includes step S21 after step S12. The details are as follows:
s21: and determining a row key field corresponding to the HBase data table according to the number of the partitions in the HBase data table.
In the embodiment, the number of partitions occupied by the data to be stored is determined according to the data amount stored in the HBase data table; determining the current time of data storage, and determining the first half section of the row key corresponding to the data to be stored according to the current partition number; generating discrete random universal unique discrete data, and determining the discrete random universal unique discrete data as the second half of a row key corresponding to the data to be stored; and integrating the first half section of the row key corresponding to the required storage data and the second half section of the row key corresponding to the required storage data to obtain the row key design value corresponding to the required storage data. Based on the segmented design of the line keys, the first half section of the line key corresponding to the data to be stored is determined to be a sequential character string according to the number of partitions, so that the data is integrally used as a continuous whole, and the reading efficiency of Hbase can be effectively improved; the second half section is composed of discrete random universal unique discrete data, so that the data is discretely stored in a certain area, the parallel writing efficiency of the data is improved, the writing performance of the data can be improved, and the reading performance of the data can be improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating an implementation of a method for real-time incremental data synchronization according to yet another embodiment of the present application. With respect to the embodiment shown in fig. 1, the method for real-time synchronization of incremental data provided in this embodiment further includes step S31 before step S13. The details are as follows:
s31: storing the data in the Oracle table to a cluster in a data table form in a data warehouse by using a tool; converting data in a data table in a data warehouse in the cluster into a binary file stored in an HBase bottom layer based on a preset second calculation engine; and loading the binary file into the HBase in a data loading mode to obtain an Oracle mirror image table in the HBase.
In this embodiment, before the change row key is obtained, an Oracle mirror image table needs to be obtained, a tool is used to store data in the Oracle table to a cluster in a data table form in a data warehouse, the tool in this embodiment is a database export tool, the full data of the Oracle table is synchronized to the HBase, the data in the data table in the data warehouse is converted into a binary file, and the binary file is loaded to the HBase in a data loading manner, so that the Oracle mirror image table in the HBase is obtained. HBase is based on a column schema, and export tools export a set of files from Oracle to an Oracle mirror table in HBase. The target table must already exist in the database. The input file is read and parsed into a set of records according to the specified delimiters. The default operation converts these records into a set of statements that inject the records into the database. The database export tool will generate statements that replace existing records in the database, and in the call mode, the database export tool will issue a store procedure call for each record.
It should be noted that the export is executed by a plurality of writers in parallel. Separate connections and transactions are used by each writer. The database export tool inserts up to 10 records in each statement using a multi-line syntax. For every 10 records, a commit occurs every time 100 lines are reached when the current transaction within the task is committed. This ensures that the transaction buffer does not grow indefinitely, resulting in a memory starvation condition.
It should be noted that the row key serves as a primary key during the operations of adding, deleting, and modifying, and as with many databases, it can uniquely identify a row of records. The row key can be any character string, and is stored as a byte array in HBase. When storing, the data are stored according to the lexicographic order of the row keys. To take full advantage of this property of sorted storage, rows that are often read together are stored together. And comparing the Oracle mirror image table in the HBase with the HBase data table to obtain the changed row key.
Referring to fig. 4, fig. 4 is a flowchart illustrating an implementation of a method for real-time incremental data synchronization according to another embodiment of the present application. With respect to the embodiment shown in fig. 1, the method for real-time synchronization of incremental data provided in this embodiment further includes step S41 before step S15. The details are as follows:
s41: and setting a merged scheduling task by using a timing scheduling tool.
In this embodiment, before data merging, a corresponding scheduling task needs to be set, and a task is executed at a specified time, so that merging of the increment number and the historical data in the data warehouse is completed.
In this embodiment, when the scheduling tasks are deployed in a decentralized manner (for example, when the scheduling tasks are set in a distributed mode, a distributed job cluster capable of being independently deployed is implemented based on quartz, zk, and the like), the console may deploy each scheduling task independently (the deployed scheduling tasks may be understood as jobs; for example, the scheduled scheduling tasks deployed on the distributed server cluster may be referred to as distributed jobs), and the scheduling tasks are not affected by each other.
The invention provides a method for real-time synchronization of incremental data, which is characterized in that JSON messages obtained by using logs in an Oracle data table analyzed by a tool are stored in a distributed message queue; the synchronization basis of incremental data from an Oracle data table to a big data environment is realized, the JSON message in the distributed message queue is stored to HBase, and an HBase data table is obtained; the incremental data synchronization to the data warehouse is realized, the limitation of direct operation in the data warehouse is avoided, and the incremental data synchronization in the data warehouse is realized efficiently. Recording a change line key according to an HBase data table and an Oracle mirror image table in the HBase obtained in advance; restoring all field data of a row of data from the HBase database according to the row changing key, perfecting the data when the database is updated, and storing all field data into a data warehouse to obtain the incremental data of the incremental data warehouse; the incremental data and the historical data are combined to obtain real-time updated data warehouse data, the data in the Oracle library are updated and extracted in real time, the resource occupation of the Oracle library is reduced, and the data updating time in the data warehouse is saved.
Referring to fig. 5, fig. 5 is a block diagram of an apparatus structure of a method for real-time incremental data synchronization according to an embodiment of the present disclosure. In this embodiment, the server includes 5 units for executing the steps in the embodiments corresponding to fig. 1 to 4, specifically please refer to the related descriptions in the embodiments corresponding to fig. 1 to 4 and fig. 1 to 4. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 5, the method and apparatus 50 for real-time synchronization of incremental data includes: analyzing means 51, storing means 52, changing means 53, restoring means 54, and merging means 55, wherein,
analysis unit 51: the JSON message obtained by using the log in the Oracle data table analyzed by the tool is stored in the distributed message queue; the JSON message is used for representing a change event of the database;
the storage unit 52: the JSON message in the distributed message queue is stored to HBase based on a preset first computing engine, and an HBase data table is obtained;
the changing unit 53: the HBase management module is used for recording change row keys in the HBase data table through HBase operation processing according to the HBase data table and an Oracle mirror image table in the HBase obtained in advance;
the reduction unit 54: the system comprises a first calculation engine, a second calculation engine, a data warehouse and a data processing system, wherein the first calculation engine is used for restoring all field data of a row of data corresponding to a change row key based on a preset second calculation engine according to the change row key and storing all field data into the data warehouse to obtain incremental data;
the merging unit 55: and the incremental data and the historical data in the data warehouse are merged by utilizing the scheduling task, and the data updated in real time are obtained in the data warehouse.
As an embodiment of the present application, the method and apparatus 50 for real-time synchronization of incremental data further include:
the determination unit 56: and the method is used for determining the row key field corresponding to the HBase data table according to the partition number in the HBase data table.
The loading unit 57: the data storage system is used for storing data in the Oracle table to a cluster in a data table form in a data warehouse by utilizing a tool; converting data in a data table in a data warehouse in the cluster into a binary file stored in an HBase bottom layer based on a preset second calculation engine; and loading the binary file into the HBase in a data loading mode to obtain an Oracle mirror image table in the HBase.
The setting unit 58: and the method is used for setting and combining the scheduling tasks by utilizing the timing scheduling tool.
As an embodiment of the present application, the method and apparatus 50 for real-time synchronization of incremental data further include:
the first execution unit 59 is configured to send the JSON packet to a corresponding region in a consumer according to a newly-built consumer in advance, so as to obtain consumption data; and storing the consumption data to the HBase through a corresponding interface in the HBase to obtain an HBase data table.
The parsing unit 51 is specifically configured to send a log in an Oracle data table to a preset storage system for storage, so as to obtain a target log; and analyzing the target log by using a tool to obtain a JSON message, and storing the JSON message to a distributed message queue.
The storage unit 52, which is an embodiment of the present application, is specifically configured to obtain an interface corresponding to the HBase by declaring and assigning an interface in a preset first calculation engine program; and storing the JSON message in the distributed message queue to the HBase through a corresponding interface in the HBase to obtain an HBase data table.
It should be understood that, in the structural block diagram of the apparatus of the method for real-time synchronization of incremental data shown in fig. 5, each unit is used to execute each step in the embodiment corresponding to fig. 1 to 4, and each step in the embodiment corresponding to fig. 1 to 4 has been explained in detail in the above embodiment, specifically please refer to the description in the embodiment corresponding to fig. 1 to 4 and fig. 1 to 4, which is not repeated herein.
In one embodiment, a computer device is provided, the computer device is a server, and the internal structure diagram of the computer device can be as shown in fig. 6. The computer device 60 includes a processor 61, an internal memory 63, and a network interface 64 connected by a system bus 62. Wherein the processor 61 of the computer device is used to provide computing and control capabilities. The memory of the computer device 60 includes a readable storage medium 65, an internal memory 63. The readable storage medium 65 stores an operating system 66, computer readable instructions 67, and a database 68. The internal memory 63 provides an environment for the operation of an operating system 66 and computer readable instructions 67 in a readable storage medium 65. The database 68 of the computer device 60 is used to store data relating to the method of real-time synchronization of incremental data. The network interface 63 of the computer device 60 is used for communication with an external terminal through a network connection. The computer readable instructions 67, when executed by the processor 61, implement a method of real-time synchronization of incremental data. The readable storage medium 65 provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.
It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to computer readable instructions, which may be stored in a non-volatile readable storage medium or a volatile readable storage medium, and when executed, the computer readable instructions may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method for real-time synchronization of incremental data, comprising:
storing JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue; the JSON message is used for representing a change event of the database;
based on a preset first computing engine, storing the JSON message in the distributed message queue to HBase to obtain an HBase data table;
according to the HBase data table and an Oracle mirror image table in the HBase obtained in advance, recording a change line key in the HBase data table through HBase operation processing;
based on a preset second computing engine, restoring all field data of a row of data corresponding to the change row key according to the change row key, and storing all field data into a data warehouse to obtain incremental data;
and merging the incremental data and historical data in the data warehouse by using a scheduling task, and obtaining real-time updated data in the data warehouse.
2. The method for real-time synchronization of incremental data according to claim 1, wherein the step of storing JSON messages obtained by using logs in an Oracle data table parsed by a tool to a distributed message queue comprises:
sending the logs in the Oracle data table to a preset storage system for storage to obtain target logs;
and analyzing the target log by using a tool to obtain a JSON message, and storing the JSON message to a distributed message queue.
3. The method for real-time synchronization of incremental data according to claim 1, wherein the step of storing the JSON packet in the distributed message queue to the HBase based on a preset first computing engine to obtain an HBase data table comprises:
an interface corresponding to HBase is obtained by declaring and assigning an interface in a preset first calculation engine program;
and storing the JSON message in the distributed message queue to the HBase through a corresponding interface in the HBase to obtain an HBase data table.
4. The method for real-time synchronization of incremental data according to claim 3, wherein the step of storing the JSON packet in the distributed message queue to the HBase through a corresponding interface in the HBase to obtain an HBase data table comprises:
according to a consumer which is newly built in advance, the JSON message is sent to a corresponding area in the consumer, and consumption data are obtained;
and storing the consumption data to the HBase through a corresponding interface in the HBase to obtain an HBase data table.
5. The method for real-time synchronization of incremental data according to claim 1, wherein the step of storing the JSON packet in the distributed message queue to the HBase based on a preset first computing engine, and after obtaining an HBase data table, further comprises:
and determining a row key field corresponding to the HBase data table according to the number of the partitions in the HBase data table.
6. The method for real-time synchronization of incremental data according to claim 1, wherein before the HBase operation processing according to the HBase data table and the pre-obtained Oracle mirror image table in the HBase and recording the change line key in the HBase data table, the method further comprises:
storing the data in the Oracle table to a cluster in a data table form in a data warehouse by using a tool;
developing a batch calculation program based on a preset second calculation engine, and converting data in a data table in a data warehouse in the cluster into a binary file stored in an HBase bottom layer;
and loading the binary file into the HBase in a data loading mode to obtain an Oracle mirror image table in the HBase.
7. The method for real-time synchronization of incremental data according to claim 1, wherein said merging the incremental data with the historical data in the data warehouse by using the scheduling task, further comprises, before obtaining the real-time updated data in the data warehouse:
and setting a merged scheduling task by using a timing scheduling tool.
8. A method and device for real-time synchronization of incremental data,
an analysis unit: storing JSON messages obtained by using logs in an Oracle data table analyzed by a tool to a distributed message queue; the JSON message is used for representing a change event of the database;
a storage unit: based on a preset first computing engine, storing the JSON message in the distributed message queue to HBase to obtain an HBase data table;
a changing unit: according to the HBase data table and an Oracle mirror image table in the HBase obtained in advance, recording a change line key in the HBase data table through HBase operation processing;
a reduction unit: based on a preset second computing engine, restoring all field data of a row of data corresponding to the change row key according to the change row key, and storing all field data into a data warehouse to obtain incremental data;
a merging unit: and merging the incremental data and historical data in the data warehouse by using a scheduling task, and obtaining real-time updated data in the data warehouse.
9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the computer readable instructions are readable instructions generated by the engine of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing the computer to perform the steps of the method of any of the preceding claims 1-7.
CN202210043391.XA 2022-01-14 2022-01-14 Method and device for real-time synchronization of incremental data, computer equipment and storage medium Pending CN114385760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210043391.XA CN114385760A (en) 2022-01-14 2022-01-14 Method and device for real-time synchronization of incremental data, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210043391.XA CN114385760A (en) 2022-01-14 2022-01-14 Method and device for real-time synchronization of incremental data, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114385760A true CN114385760A (en) 2022-04-22

Family

ID=81202127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210043391.XA Pending CN114385760A (en) 2022-01-14 2022-01-14 Method and device for real-time synchronization of incremental data, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114385760A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579667A (en) * 2022-04-28 2022-06-03 深圳市华曦达科技股份有限公司 Method, device and system for incremental synchronization of HBase data
CN116414902A (en) * 2023-03-31 2023-07-11 华能信息技术有限公司 Quick data source access method
CN117453730A (en) * 2023-12-21 2024-01-26 深圳海智创科技有限公司 Data query method, device, equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579667A (en) * 2022-04-28 2022-06-03 深圳市华曦达科技股份有限公司 Method, device and system for incremental synchronization of HBase data
CN116414902A (en) * 2023-03-31 2023-07-11 华能信息技术有限公司 Quick data source access method
CN116414902B (en) * 2023-03-31 2024-06-04 华能信息技术有限公司 Quick data source access method
CN117453730A (en) * 2023-12-21 2024-01-26 深圳海智创科技有限公司 Data query method, device, equipment and storage medium
CN117453730B (en) * 2023-12-21 2024-03-08 深圳海智创科技有限公司 Data query method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN114385760A (en) Method and device for real-time synchronization of incremental data, computer equipment and storage medium
US11604804B2 (en) Data replication system
US8938421B2 (en) Method and a system for synchronizing data
US20210081358A1 (en) Background dataset maintenance
CN111209126A (en) Data transmission method and device between microservices and electronic equipment
CN109669976A (en) Data service method and equipment based on ETL
CN115544007A (en) Label preprocessing method and device, computer equipment and storage medium
CN110019169B (en) Data processing method and device
CN107577809A (en) Offline small documents processing method and processing device
CN113704267A (en) Data query method, system, equipment and storage medium based on elastic search
CN112965939A (en) File merging method, device and equipment
CN112948504A (en) Data acquisition method and device, computer equipment and storage medium
CN113986942B (en) Message queue management method and device based on man-machine conversation
CN108846002B (en) Label real-time updating method and system
CN113326401B (en) Method and system for generating field blood relationship
CN111026764B (en) Data storage method and device, electronic product and storage medium
CN115203260A (en) Abnormal data determination method and device, electronic equipment and storage medium
JP2023546818A (en) Transaction processing method, device, electronic device, and computer program for database system
CN113612832A (en) Streaming data distribution method and system
CN111274316B (en) Method and device for executing multi-level data stream task, electronic equipment and storage medium
CN114490865A (en) Database synchronization method, device, equipment and computer storage medium
CN115221125A (en) File processing method and device, electronic equipment and readable storage medium
CN113568966A (en) Data processing method and system used between ODS layer and DW layer
CN113868138A (en) Method, system, equipment and storage medium for acquiring test data
WO2019134238A1 (en) Method for executing auxiliary function, device, storage medium, and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination