CN112948486A - Batch data synchronization method and system and electronic equipment - Google Patents

Batch data synchronization method and system and electronic equipment Download PDF

Info

Publication number
CN112948486A
CN112948486A CN202110156442.5A CN202110156442A CN112948486A CN 112948486 A CN112948486 A CN 112948486A CN 202110156442 A CN202110156442 A CN 202110156442A CN 112948486 A CN112948486 A CN 112948486A
Authority
CN
China
Prior art keywords
data
synchronized
batch
synchronization
file system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110156442.5A
Other languages
Chinese (zh)
Inventor
闫宇新
袁孝锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qilu Information Technology Co Ltd
Original Assignee
Beijing Qilu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qilu Information Technology Co Ltd filed Critical Beijing Qilu Information Technology Co Ltd
Priority to CN202110156442.5A priority Critical patent/CN112948486A/en
Publication of CN112948486A publication Critical patent/CN112948486A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1734Details of monitoring file system events, e.g. by the use of hooks, filter drivers, logs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a batch data synchronization method, system, electronic device and computer readable medium. The method comprises the following steps: the big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system; storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework; the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized. The batch data synchronization method, the batch data synchronization system, the electronic equipment and the computer readable medium can solve the problem of poor timeliness of batch data synchronization in the prior art, quickly and accurately synchronize batch data, and cannot increase network burden.

Description

Batch data synchronization method and system and electronic equipment
Technical Field
The present disclosure relates to the field of computer information processing, and in particular, to a batch data synchronization method, system, electronic device, and computer readable medium.
Background
Database operation often becomes a bottleneck of a system, the reading pressure of a general system is far greater than the writing pressure, and the performance of the system can be improved by realizing the reading and writing separation of the database. The master database is responsible for writing operation, the slave database is responsible for reading operation, and the slave database can be deployed with a plurality of speeds for improving reading according to pressure conditions, so that the overall performance of the system is improved. The problem of data synchronization of a master database and a slave database is solved for realizing read-write separation, and the data of the slave database is ensured to be updated after the data is written in the master database.
The system comprises a relational database, a Hadoop system and a database management system, wherein the Sqoop is SQL-to-Hadoop, is a bridge for connecting a traditional relational database and the Hadoop, and is used for importing data of the relational database into a Hadoop system (such as HDFS HBase and Hive); data can also be extracted from the Hadoop system and exported to a relational database. And the Sqoop utilizes MapReduce to accelerate the data transmission speed and adopts a batch processing mode to transmit data. However, in the process of data synchronization through Sqoop, a large number of finger access control lists in the SQL injection security detection tool need to be consumed, and data transmission is slow when data is synchronized offline.
To solve the dilemma in the prior art, the present disclosure provides a new batch data synchronization method, system, electronic device, and computer readable medium.
The above information disclosed in this background section is only for enhancement of understanding of the background of the disclosure and therefore it may contain information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
In view of this, the present disclosure provides a batch data synchronization method, system, electronic device and computer readable medium, which can solve the problem of poor timeliness of batch data synchronization in the prior art, and perform batch data synchronization quickly and accurately without increasing network load.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, a batch data synchronization method is provided, the method including: the big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system; storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework; the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized.
Optionally, the method further comprises: the distributed publish-subscribe message system obtains service related data.
Optionally, before the big data distributed stream data streaming engine stores the data to be synchronized to the first preset location of the distributed file system, the big data distributed stream data streaming engine includes: the big data distributed stream data flow engine acquires the service related data in a consumption mode; and acquiring the data to be synchronized based on the service related data.
Optionally, the big data distributed stream data streaming engine stores the data to be synchronized to a first preset location of the distributed file system, and includes: a big data distributed stream data flow engine acquires configuration information of task metadata; analyzing the table to be synchronized based on the service attribute of the task metadata; and storing the analyzed table to be synchronized to a first preset position of the distributed file system.
Optionally, storing the parsed table to be synchronized to a first preset location of the distributed file system, including: performing sink analysis on the table to be synchronized to generate the data to be synchronized; and storing the table to be synchronized to a first preset position of the Hadoop Distributed File System System.
Optionally, the storing, by the offline synchronization framework of the heterogeneous data source, the historical data of the data to be synchronized to a second preset location of the distributed file system includes: and the DataX framework stores the historical data of the data to be synchronized to a second preset position of the Hadoop Distributed File System.
Optionally, the data warehouse tool stores the data to be synchronized in a partitioned manner according to preset rules, and the data warehouse tool includes: and Hive stores the data to be synchronized according to time interval partitions to generate a plurality of time partitions.
Optionally, the batch merging the to-be-synchronized data stored in the partition at a preset time to realize batch synchronization of the to-be-synchronized data includes: extracting data to be synchronized in the plurality of time partitions at preset time; merging the data to be synchronized in the plurality of time partitions to generate a full table so as to realize batch synchronization of the data to be synchronized.
Optionally, the acquiring, by the distributed publish-subscribe message system, service-related data includes: the service database starts a binlog function; analyzing the service related data based on the binlog function; and pushing the analyzed service related data to the distributed publishing and subscribing message system.
Optionally, parsing the service related data based on the binlog function includes: and analyzing the service related data based on the binlog function and the MySQL database incremental log analyzing mode.
According to an aspect of the present disclosure, there is provided a batch data synchronization system, the system including: the big data distributed stream data flow engine is used for storing data to be synchronized to a first preset position of the distributed file system; the heterogeneous data source offline synchronization framework is used for storing the historical data of the data to be synchronized to a second preset position of the distributed file system; the data warehouse tool is used for storing the data to be synchronized in a partitioning manner according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized.
Optionally, the method further comprises: and the distributed publishing and subscribing message system is used for acquiring the service related data.
Optionally, the big data distributed stream data flow engine is further configured to obtain the service related data in a consumption manner; and acquiring the data to be synchronized based on the service related data.
Optionally, the big data distributed stream data flow engine is further configured to obtain configuration information of the task metadata; analyzing the table to be synchronized based on the service attribute of the task metadata; and storing the analyzed table to be synchronized to a first preset position of the distributed file system.
Optionally, the big data distributed stream data flow engine is further configured to perform sink parsing on the table to be synchronized to generate the data to be synchronized; and storing the table to be synchronized to a first preset position of the Hadoop Distributed File System System.
Optionally, the offline synchronization framework of the heterogeneous data source is further configured to store the historical data of the data to be synchronized to a second preset position of the Hadoop Distributed File System.
Optionally, the data warehouse tool is further configured to store the data to be synchronized in time partition according to time interval, and generate a plurality of time partitions.
Optionally, the data warehouse tool is further configured to extract data to be synchronized in the plurality of time partitions at preset time; merging the data to be synchronized in the plurality of time partitions to generate a full table so as to realize batch synchronization of the data to be synchronized.
Optionally, the distributed publish-subscribe message system is further configured to start a binlog function in the service database; analyzing the service related data based on the binlog function; and pushing the analyzed service related data to the distributed publishing and subscribing message system.
Optionally, the distributed publish-subscribe message system is further configured to analyze the service-related data based on the binlog function and an incremental log analysis mode of the MySQL database.
According to an aspect of the present disclosure, an electronic device is provided, the electronic device including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as above.
According to an aspect of the disclosure, a computer-readable medium is proposed, on which a computer program is stored, which program, when being executed by a processor, carries out the method as above.
According to the batch data synchronization method, the batch data synchronization system, the electronic equipment and the computer readable medium, the big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system; storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework; the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules; the method for realizing the batch synchronization of the data to be synchronized by carrying out batch combination on the data to be synchronized stored in the partitions at the preset time can solve the problem of poor timeliness of the batch data synchronization in the prior art, can quickly and accurately carry out the synchronization of the batch data, and can not increase network burden.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings. The drawings described below are merely some embodiments of the present disclosure, and other drawings may be derived from those drawings by those of ordinary skill in the art without inventive effort.
FIG. 1 is a system diagram illustrating a batch data synchronization system in accordance with an exemplary embodiment.
FIG. 2 is a flow chart illustrating a method of batch data synchronization in accordance with an exemplary embodiment.
FIG. 3 is a flowchart illustrating a method of batch data synchronization according to another exemplary embodiment.
FIG. 4 is a schematic diagram illustrating a method of batch data synchronization in accordance with another exemplary embodiment.
FIG. 5 is a block diagram illustrating a batch data synchronization system in accordance with an exemplary embodiment.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 7 is a block diagram illustrating a computer-readable medium in accordance with an example embodiment.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, systems, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various components, these components should not be limited by these terms. These terms are used to distinguish one element from another. Thus, a first component discussed below may be termed a second component without departing from the teachings of the disclosed concept. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It is to be understood by those skilled in the art that the drawings are merely schematic representations of exemplary embodiments, and that the blocks or processes shown in the drawings are not necessarily required to practice the present disclosure and are, therefore, not intended to limit the scope of the present disclosure.
FIG. 1 is a system diagram illustrating a batch data synchronization system in accordance with an exemplary embodiment.
As shown in fig. 1, the system architecture 10 may include a big data distributed stream data streaming engine 101, a heterogeneous data source offline synchronization framework 102, a data warehouse tool 103, a business server 104, and a distributed publish-subscribe message system 105, a distributed file system 106. The network serves as a medium to provide communication links between the big data distributed stream data streaming engine 101, the heterogeneous data source offline synchronization framework 102, the data warehouse tool 103, the business server 104, and the distributed publish-subscribe message system 105. The network may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user can use the service server 104 to perform service processing, and generate service data. The business server 104 may have various communication client applications installed thereon, such as a financial services application, a shopping application, a web browser application, an instant messaging tool, a mailbox client, social platform software, and the like.
The big data distributed stream data streaming engine 101, the heterogeneous data source offline synchronization framework 102, the data warehouse tool 103, and the distributed publish-subscribe message system 105 may be servers that provide various services, such as background servers that provide data synchronization for data generated by the business servers 104. The big data distributed stream data flow engine 101, the heterogeneous data source offline synchronization framework 102, the data warehouse tool 103, and the distributed publish-subscribe message system 105 may analyze and process the received service data, and perform data synchronization.
The distributed publish-subscribe message system 105 may, for example, obtain business related data, and the big data distributed stream data streaming engine 101 may, for example, obtain the business related data by consumption; and acquiring the data to be synchronized based on the service related data.
The big data distributed stream data flow engine 101 may, for example, store data to be synchronized to a first preset location of the distributed file system; the heterogeneous data source offline synchronization framework 102 may, for example, store historical data of the data to be synchronized to a second preset location of the distributed file system; the data warehouse tool 103 may, for example, store the data to be synchronized in a partitioned manner according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized.
The big data distributed stream data stream engine 101, the heterogeneous data source offline synchronization framework 102, the data warehouse tool 103, the service server 104, and the distributed publish-subscribe message system 105 may all be a server of one entity, and may also be composed of a plurality of servers, for example, it should be noted that the batch data synchronization method provided by the embodiment of the present disclosure may be executed by the big data distributed stream data stream engine 101, the heterogeneous data source offline synchronization framework 102, the data warehouse tool 103, and the distributed publish-subscribe message system 105 together.
FIG. 2 is a flow chart illustrating a method of batch data synchronization in accordance with an exemplary embodiment. The batch data synchronization method 20 includes at least steps S202 to S208.
As shown in fig. 2, in S202, the big data distributed stream data streaming engine stores data to be synchronized to a first preset location of the distributed file system. The big data distributed stream data stream engine (Flink) may, for example, obtain configuration information for the task metadata; analyzing the table to be synchronized based on the service attribute of the task metadata; and storing the analyzed table to be synchronized to a first preset position of the distributed file system.
In one embodiment, the Flink procedure is mapped to stream data streams after execution, each Flink data stream starting with one or more sources (task elements) and ending with one or more sinks (data outputs). Flink may perform any number of transformations on the stream.
The step of storing the analyzed table to be synchronized to a first preset position of the distributed file system includes: performing sink analysis on the table to be synchronized to generate the data to be synchronized; and storing the table to be synchronized to a first preset position of an HDFS (Hadoop Distributed File System) system.
The HDFS may adopt a Master-Slave (Master/Slave) structure model, and the HDFS cluster may be composed of one NameNode and a plurality of DataNodes. The NameNode is used as a main server and used for managing the naming space of the file system and the access operation of a client to the file; the DataNode in the cluster manages the stored data.
In S204, the offline synchronization framework of the heterogeneous data source stores the historical data of the data to be synchronized to a second preset location of the distributed file system. And the DataX framework stores the historical data of the data to be synchronized to a second preset position of the Hadoop Distributed File System.
The DataX is used as a data synchronization framework, the synchronization of different data sources is abstracted into a Reader plug-in for reading data from a source data source and a Writer plug-in for writing data into a target end, and theoretically, the DataX framework can support the data synchronization work of any data source type. Meanwhile, the DataX plug-in system is used as a set of ecosystem, and the newly added data source can be communicated with the existing data source.
In S206, the data warehouse tool stores the data to be synchronized in a partitioned manner according to a preset rule. The method comprises the following steps: and Hive stores the data to be synchronized according to time interval partitions to generate a plurality of time partitions.
Hive is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, and can store, inquire and analyze large-scale data stored in Hadoop. The Hive data warehouse tool can map the structured data file into a database table, provides an SQL query function, and can convert an SQL statement into a MapReduce task to execute Hive and can be constructed on the basis of static batch processing Hadoop.
In S208, the data to be synchronized stored in the partitions are combined in batch at a preset time to implement batch synchronization of the data to be synchronized. The Hive extracts the data to be synchronized in the plurality of time partitions at preset time; merging the data to be synchronized in the plurality of time partitions to generate a full table so as to realize batch synchronization of the data to be synchronized.
According to the batch data synchronization method disclosed by the invention, a big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system; storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework; the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules; the method for realizing the batch synchronization of the data to be synchronized by carrying out batch combination on the data to be synchronized stored in the partitions at the preset time can solve the problem of poor timeliness of the batch data synchronization in the prior art, can quickly and accurately carry out the synchronization of the batch data, and can not increase network burden.
According to the batch data synchronization method disclosed by the invention, data synchronization can be carried out in batches, one application program can synchronize a table required in one library, the occupied resources are less, a data stream filtering scheme is used, and the data synchronization efficiency is improved.
It should be clearly understood that this disclosure describes how to make and use particular examples, but the principles of this disclosure are not limited to any details of these examples. Rather, these principles can be applied to many other embodiments based on the teachings of the present disclosure.
FIG. 3 is a flowchart illustrating a method of batch data synchronization according to another exemplary embodiment. The flow 30 shown in fig. 3 is a detailed description of the "acquiring service related data by the distributed publish-subscribe message system".
As shown in fig. 3, in S302, the service database turns on the binlog function. The business database may be a MySQL database. Wherein binlog is a binary log file for recording MySQL data updates or potential updates, and all operations recorded by binlog actually have corresponding event types.
In S304, the traffic-related data is parsed based on the binlog function. More specifically, the business related data can be analyzed based on the binlog function and the MySQL database incremental log analysis mode.
In S306, the analyzed service related data is pushed to the distributed publish-subscribe message system. Business data can be parsed, for example, by customizing the Canal/Maxwell application, and assembled into json-formatted data for push to the distributed publish-subscribe messaging system.
The Canal/Maxwell is based on database incremental log analysis, provides incremental data subscription and consumption, and mainly supports MySQL at present.
In S308, the big data distributed stream data flow engine acquires the service related data in a consumption manner.
The Kafka consumer in Flink is implemented by an operator with a state (operator) that integrates the check point mechanism of Flink, whose state is the read offset of all Kafka partitions. When a checkpoint is triggered, the offset for each partition is stored into the checkpoint. The Flink checkpoint mechanism ensures that the storage state of all operatotasks is consistent. When all operatortacks successfully store their state, a checkpoint is computed to complete.
In S310, the data to be synchronized is obtained based on the service-related data.
FIG. 4 is a schematic diagram illustrating a method of batch data synchronization in accordance with another exemplary embodiment. Firstly, opening a binlo function in the service server MySQL, performing binlog analysis on the service MySQL by customizing the Canal/Maxwell application, and assembling the binlo function into push data in a designated json format to the kafka;
secondly, the Flink program acquires the MySQL configuration of the task metadata by kafka, performs topic data analysis of the corresponding kafka, and resolves the sink to the specified position of hdfs from the table needing synchronization of the same library every hour;
then, the data program acquires the MySQL configuration of the task metadata, and synchronizes the historical data of the table to be synchronized to the specified position of the HDFS where the corresponding table is located;
then, the Hive adding partition task adds the data mapping of the Flink parsing sink into the Hive hour partition every hour;
finally, in the morning each day, the corresponding Hive hour tables can be merged to generate a full scale.
The batch data synchronization method can greatly reduce access control list resources of safe3 isolation environment, in the embodiment of the invention, only 1 access control list resource is needed for one MySQL library, in the prior art, Sqoop synchronization needs 14 batch data synchronization methods of the invention, the timeliness of data synchronization can be ensured, incremental data is shared to each hour, only operations such as deduplication, merging and the like need to be carried out on the data, the whole process can be completed within 1 hour, in the prior art, the processing is carried out through Sqoop tasks for 2 to 3 hours, and the efficiency is improved by at least more than 1 time.
Those skilled in the art will appreciate that all or part of the steps implementing the above embodiments are implemented as computer programs executed by a CPU. When executed by the CPU, performs the functions defined by the above-described methods provided by the present disclosure. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
FIG. 5 is a block diagram illustrating a batch data synchronization system in accordance with an exemplary embodiment. As shown in FIG. 5, the batch data synchronization system 50 includes: a big data distributed stream data flow engine 502, a heterogeneous data source offline synchronization framework 504, a data warehouse tool 506, and a distributed publish-subscribe message system 508.
The big data distributed stream data flow engine 502 is used for storing data to be synchronized to a first preset position of the distributed file system; the big data distributed stream data flow engine 502 is further configured to obtain the service related data in a consumption manner; and acquiring the data to be synchronized based on the service related data. The big data distributed stream data flow engine 502 is further configured to obtain configuration information of the task metadata; analyzing the table to be synchronized based on the service attribute of the task metadata; and storing the analyzed table to be synchronized to a first preset position of the distributed file system. The big data distributed stream data flow engine 502 is further configured to perform sink parsing on the table to be synchronized to generate the data to be synchronized; and storing the table to be synchronized to a first preset position of the Hadoop Distributed File System System.
The heterogeneous data source offline synchronization framework 504 is configured to store historical data of the data to be synchronized to a second preset location of the distributed file system; the heterogeneous data source offline synchronization framework 504 is further configured to store the historical data of the data to be synchronized to a second preset position of the Hadoop Distributed File System.
The data warehouse tool 506 is used for storing the data to be synchronized in a partitioned mode according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized. The data warehouse tool 506 is further configured to store the data to be synchronized in time interval partitions, generating a plurality of time partitions. The data warehouse tool 506 is further configured to extract data to be synchronized in the plurality of time partitions at preset times; merging the data to be synchronized in the plurality of time partitions to generate a full table so as to realize batch synchronization of the data to be synchronized.
The distributed publish-subscribe message system 508 is used to obtain service related data. The distributed publish-subscribe message system 508 is also configured to enable binlog functionality for the service database; analyzing the service related data based on the binlog function; and pushing the analyzed service related data to the distributed publishing and subscribing message system. The distributed publish-subscribe message system 508 is further configured to parse the service-related data based on the binlog function and the MySQL database incremental log parsing manner.
According to the batch data synchronization system disclosed by the invention, a big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system; storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework; the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules; the method for realizing the batch synchronization of the data to be synchronized by carrying out batch combination on the data to be synchronized stored in the partitions at the preset time can solve the problem of poor timeliness of the batch data synchronization in the prior art, can quickly and accurately carry out the synchronization of the batch data, and can not increase network burden.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
An electronic device 600 according to this embodiment of the disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 that connects the various system components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs steps in accordance with various exemplary embodiments of the present disclosure in the present specification. For example, the processing unit 610 may perform the steps shown in fig. 2 and 3.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 600' (e.g., keyboard, pointing device, bluetooth device, etc.), such that a user can communicate with devices with which the electronic device 600 interacts, and/or any device (e.g., router, modem, etc.) with which the electronic device 600 can communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, as shown in fig. 7, the technical solution according to the embodiment of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, or a network device, etc.) to execute the above method according to the embodiment of the present disclosure.
The software product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The computer readable medium carries one or more programs which, when executed by a device, cause the computer readable medium to perform the functions of: the big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system; storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework; the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized.
Those skilled in the art will appreciate that the modules described above may be distributed in the apparatus according to the description of the embodiments, or may be modified accordingly in one or more apparatuses unique from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that the present disclosure is not limited to the precise arrangements, instrumentalities, or instrumentalities described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A method for synchronizing batch data, comprising:
the big data distributed stream data flow engine stores data to be synchronized to a first preset position of a distributed file system;
storing the historical data of the data to be synchronized to a second preset position of the distributed file system by the heterogeneous data source offline synchronization framework;
the data warehouse tool stores the data to be synchronized in a partition mode according to preset rules;
and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized.
2. The method of claim 1, further comprising:
the distributed publish-subscribe message system obtains service related data.
3. The method of any of claims 1-2, wherein the big data distributed stream data streaming engine storing the data to be synchronized before the first predetermined location of the distributed file system comprises:
the big data distributed stream data flow engine acquires the service related data in a consumption mode;
and acquiring the data to be synchronized based on the service related data.
4. The method of any of claims 1-3, wherein the big data distributed stream data streaming engine storing the data to be synchronized to a first predetermined location of the distributed file system, comprising:
a big data distributed stream data flow engine acquires configuration information of task metadata;
analyzing the table to be synchronized based on the service attribute of the task metadata;
and storing the analyzed table to be synchronized to a first preset position of the distributed file system.
5. The method of any one of claims 1-4, wherein storing the parsed table to be synchronized to a first preset location of a distributed file system comprises:
performing sink analysis on the table to be synchronized to generate the data to be synchronized;
and storing the table to be synchronized to a first preset position of the Hadoop Distributed File System System.
6. The method of any one of claims 1-5, wherein storing, by a heterogeneous data source offline synchronization framework, historical data of the data to be synchronized to a second preset location of the distributed file system comprises:
and the DataX framework stores the historical data of the data to be synchronized to a second preset position of the Hadoop Distributed File System.
7. The method of any one of claims 1-6, wherein the data warehouse tool stores the data to be synchronized in a partitioned manner according to preset rules, comprising:
and Hive stores the data to be synchronized according to time interval partitions to generate a plurality of time partitions.
8. A batch data synchronization system, comprising:
the big data distributed stream data flow engine is used for storing data to be synchronized to a first preset position of the distributed file system;
the heterogeneous data source offline synchronization framework is used for storing the historical data of the data to be synchronized to a second preset position of the distributed file system;
the data warehouse tool is used for storing the data to be synchronized in a partitioning manner according to preset rules; and carrying out batch combination on the data to be synchronized stored in the partitions at preset time so as to realize batch synchronization of the data to be synchronized.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202110156442.5A 2021-02-04 2021-02-04 Batch data synchronization method and system and electronic equipment Pending CN112948486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110156442.5A CN112948486A (en) 2021-02-04 2021-02-04 Batch data synchronization method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110156442.5A CN112948486A (en) 2021-02-04 2021-02-04 Batch data synchronization method and system and electronic equipment

Publications (1)

Publication Number Publication Date
CN112948486A true CN112948486A (en) 2021-06-11

Family

ID=76244045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110156442.5A Pending CN112948486A (en) 2021-02-04 2021-02-04 Batch data synchronization method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN112948486A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934760A (en) * 2021-10-15 2022-01-14 珠海百丰网络科技有限公司 Financial data identification and transmission system and method based on artificial intelligence model
CN114710481A (en) * 2021-12-13 2022-07-05 越亮传奇科技股份有限公司 Traffic ticket analysis method, device, equipment and storage medium based on big data
CN114996319A (en) * 2022-08-01 2022-09-02 税友软件集团股份有限公司 Data processing method, device and equipment based on rule engine and storage medium
CN116016089A (en) * 2021-10-20 2023-04-25 北京京诚鼎宇管理系统有限公司 Metallurgical equipment data processing method, device and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108599992A (en) * 2018-03-21 2018-09-28 四川斐讯信息技术有限公司 A kind of data processing system and method
CN110990435A (en) * 2019-12-03 2020-04-10 秒针信息技术有限公司 Data synchronization method, device and computer readable storage medium
CN111680017A (en) * 2020-06-30 2020-09-18 深圳前海微众银行股份有限公司 Data synchronization method and device
CN112100147A (en) * 2020-07-27 2020-12-18 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink
CN112286941A (en) * 2020-12-23 2021-01-29 武汉物易云通网络科技有限公司 Big data synchronization method and device based on Binlog + HBase + Hive

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108599992A (en) * 2018-03-21 2018-09-28 四川斐讯信息技术有限公司 A kind of data processing system and method
CN110990435A (en) * 2019-12-03 2020-04-10 秒针信息技术有限公司 Data synchronization method, device and computer readable storage medium
CN111680017A (en) * 2020-06-30 2020-09-18 深圳前海微众银行股份有限公司 Data synchronization method and device
CN112100147A (en) * 2020-07-27 2020-12-18 杭州玳数科技有限公司 Method and system for realizing real-time acquisition from Bilog to HIVE based on Flink
CN112286941A (en) * 2020-12-23 2021-01-29 武汉物易云通网络科技有限公司 Big data synchronization method and device based on Binlog + HBase + Hive

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113934760A (en) * 2021-10-15 2022-01-14 珠海百丰网络科技有限公司 Financial data identification and transmission system and method based on artificial intelligence model
CN113934760B (en) * 2021-10-15 2022-06-17 珠海百丰网络科技有限公司 Financial data identification and transmission system and method based on artificial intelligence model
CN116016089A (en) * 2021-10-20 2023-04-25 北京京诚鼎宇管理系统有限公司 Metallurgical equipment data processing method, device and system
CN114710481A (en) * 2021-12-13 2022-07-05 越亮传奇科技股份有限公司 Traffic ticket analysis method, device, equipment and storage medium based on big data
CN114996319A (en) * 2022-08-01 2022-09-02 税友软件集团股份有限公司 Data processing method, device and equipment based on rule engine and storage medium

Similar Documents

Publication Publication Date Title
CN109086409B (en) Microservice data processing method and device, electronic equipment and computer readable medium
CN110807067B (en) Data synchronization method, device and equipment for relational database and data warehouse
US10055410B1 (en) Corpus-scoped annotation and analysis
CN112948486A (en) Batch data synchronization method and system and electronic equipment
US9336288B2 (en) Workflow controller compatibility
CN109189835A (en) The method and apparatus of the wide table of data are generated in real time
CN109388637A (en) Data warehouse information processing method, device, system, medium
CN111666490A (en) Information pushing method, device, equipment and storage medium based on kafka
CN111324610A (en) Data synchronization method and device
CN110321544B (en) Method and device for generating information
US20150032743A1 (en) Analyzing files using big data tools
US10083031B2 (en) Cognitive feature analytics
CN109522341A (en) Realize method, apparatus, the equipment of the stream data processing engine based on SQL
CN113282611B (en) Method, device, computer equipment and storage medium for synchronizing stream data
CN110781197B (en) Hive offline synchronous verification method and device and electronic equipment
CN114461603A (en) Multi-source heterogeneous data fusion method and device
CN112988741A (en) Real-time service data merging method and device and electronic equipment
CN113468196B (en) Method, apparatus, system, server and medium for processing data
CN113190517B (en) Data integration method and device, electronic equipment and computer readable medium
Krämer GeoRocket: A scalable and cloud-based data store for big geospatial files
CN113962597A (en) Data analysis method and device, electronic equipment and storage medium
CN113836235B (en) Data processing method based on data center and related equipment thereof
CN113722007A (en) Configuration method, device and system of VPN branch equipment
CN110555070B (en) Method and apparatus for outputting information
US20140074869A1 (en) Autoclassifying compound documents for enhanced metadata search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination