CN112507020A - Data synchronization method and device, computer equipment and storage medium - Google Patents

Data synchronization method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112507020A
CN112507020A CN202011314160.5A CN202011314160A CN112507020A CN 112507020 A CN112507020 A CN 112507020A CN 202011314160 A CN202011314160 A CN 202011314160A CN 112507020 A CN112507020 A CN 112507020A
Authority
CN
China
Prior art keywords
data
synchronized
synchronization
source table
configuration file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011314160.5A
Other languages
Chinese (zh)
Inventor
刘薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202011314160.5A priority Critical patent/CN112507020A/en
Publication of CN112507020A publication Critical patent/CN112507020A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Abstract

The embodiment of the application belongs to the field of big data and relates to a data synchronization method, which comprises the steps of acquiring a synchronization configuration file according to a triggered data synchronization instruction; setting spark based on the synchronization configuration file; accessing a data source table through the synchronous configuration file, and performing data segmentation on data to be synchronized in the data source table; distributing the segmented data to be synchronized to each process in the spark; and synchronizing the segmented data to be synchronized through the processes. The application also provides a data synchronization device, computer equipment and a storage medium. In addition, the present application also relates to blockchain techniques, where synchronization profiles may be stored. The data synchronization method and the data synchronization device improve the efficiency of data synchronization.

Description

Data synchronization method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a data synchronization method and apparatus, a computer device, and a storage medium.
Background
With the development of computer technology, the application of big data is more and more extensive. Data synchronization is an indispensable link in big data technology, and some data synchronization tools appear in order to better synchronize mass data.
However, the existing data synchronization tools have some disadvantages and inconveniences in use. For example, some data synchronization tools are prone to data duplication, and data deduplication operation needs to be performed after data synchronization, and time needs to be consumed for data deduplication when the data size is large, so that the data synchronization efficiency is low; the other data synchronization software can only run by a single machine, the running pressure is high when big data is processed, the synchronization speed needs to be sacrificed, and the data synchronization efficiency is still low.
Disclosure of Invention
An embodiment of the present application provides a data synchronization method, an apparatus, a computer device, and a storage medium, so as to solve the problem of low data synchronization efficiency.
In order to solve the above technical problem, an embodiment of the present application provides a data synchronization method, which adopts the following technical solutions:
acquiring a synchronous configuration file according to a triggered data synchronization instruction;
setting spark based on the synchronization configuration file;
accessing a data source table through the synchronous configuration file, and performing data segmentation on data to be synchronized in the data source table;
distributing the segmented data to be synchronized to each process in the spark;
and synchronizing the segmented data to be synchronized through the processes.
Further, before the acquiring the synchronization configuration file according to the triggered data synchronization instruction, the method further includes:
when a configuration instruction sent by a terminal is received, a synchronous configuration page is displayed through the terminal;
acquiring task configuration information input in the synchronous configuration page through the terminal;
and generating a synchronous configuration file according to the task configuration information.
Further, the setting the spark based on the synchronization configuration file includes:
extracting a data source address and a data source table name from the synchronous configuration file;
accessing a data source table corresponding to the data source table name according to the data source address;
acquiring the data volume of the data to be synchronized in the data source table;
and carrying out process setting and memory setting on spark according to the acquired data volume.
Further, the accessing the data source table through the synchronization configuration file and performing data segmentation on the data to be synchronized in the data source table includes:
accessing a data source table through the synchronization configuration file;
inquiring data to be synchronized in the data source table;
performing data segmentation on the inquired data to be synchronized, wherein the data segmentation mode comprises the following steps: pseudo-column segmentation, result pseudo-column segmentation, time segmentation, or random field segmentation.
Further, the querying the data to be synchronized in the data source table includes:
reading a preset data segmentation mode from the synchronous configuration file;
when the data segmentation mode is result pseudo-column segmentation, acquiring preset creation deadline and query conditions;
and querying the data to be synchronized in the data source table through the creation deadline and the query condition.
Further, the allocating the segmented data to be synchronized to each process in the spark comprises:
and distributing the segmented data to be synchronized to each process, and persisting the distributed data to be synchronized to a preset disk of the spark through each process.
Further, the synchronizing the segmented data to be synchronized through the processes includes:
acquiring a target storage directory from the synchronous configuration file;
and synchronizing the data to be synchronized in the preset disk to the target storage directory through the processes, and deleting the data to be synchronized in the preset disk.
In order to solve the above technical problem, an embodiment of the present application further provides a data synchronization apparatus, which adopts the following technical solutions:
the file acquisition module is used for acquiring a synchronous configuration file according to the triggered data synchronization instruction;
the setting module is used for setting spark based on the synchronization configuration file;
the source table access module is used for accessing a data source table through the synchronous configuration file and carrying out data segmentation on data to be synchronized in the data source table;
the data distribution module is used for distributing the segmented data to be synchronized to each process in the spark;
and the data synchronization module is used for synchronizing the segmented data to be synchronized through each process.
In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:
acquiring a synchronous configuration file according to a triggered data synchronization instruction;
setting spark based on the synchronization configuration file;
accessing a data source table through the synchronous configuration file, and performing data segmentation on data to be synchronized in the data source table;
distributing the segmented data to be synchronized to each process in the spark;
and synchronizing the segmented data to be synchronized through the processes.
In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:
acquiring a synchronous configuration file according to a triggered data synchronization instruction;
setting spark based on the synchronization configuration file;
accessing a data source table through the synchronous configuration file, and performing data segmentation on data to be synchronized in the data source table;
distributing the segmented data to be synchronized to each process in the spark;
and synchronizing the segmented data to be synchronized through the processes.
Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: when data synchronization is carried out, a synchronization configuration file is obtained firstly, spark is set according to the synchronization configuration file, so that hardware resources during the operation of spark can be matched with data to be synchronized, the efficiency of data synchronization is ensured, and meanwhile, the pressure of single machine operation can be reduced by carrying out data synchronization based on spark; the data source table is accessed according to the synchronization configuration file, data to be synchronized in the data source table is segmented, each process is distributed to obtain one data to be synchronized, the data source table can be synchronized in a concurrent mode, and the data synchronization efficiency is improved.
Drawings
In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow diagram of one embodiment of a data synchronization method according to the present application;
FIG. 3 is a schematic block diagram of one embodiment of a data synchronization apparatus according to the present application;
FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, system architecture 100 may include terminal devices 101, networks 102 and servers 103, cluster servers 104, 105, and data storage server 106. Network 102 is a medium used to provide communication links between terminal equipment 101, server 103, cluster servers 104, 105, and data storage server 106. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal device 101.
The terminal device 101 may be various electronic devices having a display screen and supporting web browsing, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), an MP4 player (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), a laptop portable computer, a desktop computer, and the like.
The server 103 may be a server providing various services, for example, a background server providing support for a page displayed on the terminal device 101, and the server 103 in the present application may provide a control service for data synchronization. The cluster servers 104 and 105 may be servers in a spark cluster, and implement spark functions, where the server 103 may also be a cluster server. The data storage server 106 may be a server that stores a table of data sources.
It should be noted that, the data synchronization method provided in the embodiments of the present application is generally executed by a server, and accordingly, the data synchronization apparatus is generally disposed in the server.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow diagram of one embodiment of a data synchronization method in accordance with the present application is shown. The data synchronization method comprises the following steps:
step S201, acquiring a synchronization configuration file according to the triggered data synchronization instruction.
In this embodiment, the electronic device (for example, the server shown in fig. 1) on which the data synchronization method operates may communicate through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.
Specifically, the data synchronization instruction may be triggered by a user at a terminal, and the terminal sends the data synchronization instruction to the server, or the data synchronization instruction may be automatically triggered by the server, for example, by a timing task. And the server accesses a preset synchronous configuration file according to the data synchronization instruction. The synchronization configuration file records task configuration information of the data synchronization task, including a data source address, a data source table name, a preset data volume of the data source table, a target table name, a data segmentation mode and the like. The synchronization configuration file is used for indicating the server to carry out data synchronization, and the synchronization configuration file can realize an increment synchronization task and a full synchronization task.
It is emphasized that the synchronization profile may also be stored in a node of a blockchain in order to further ensure privacy and security of the synchronization profile.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Step S202, spark is set based on the synchronization configuration file.
In particular, data synchronization may be implemented based on spark clusters. The server reads a preset data volume from the synchronization configuration file, where the preset data volume may be a data volume estimation value of data to be synchronized in the data source table, configures spark according to the preset data volume, and allocates hardware resources for spark, and specifically, may perform process (executive) setting and memory setting on a spark cluster.
For spark, the running architecture thereof includes a Cluster resource Manager (Cluster Manager), a work Node (Worker Node) running a job task, a task control Node (Driver) of each application, and an execution process (execution) on each work Node in charge of a specific task.
And after reading the preset data volume, the server acquires a pre-established resource configuration table, inquires resource setting information matched with the preset data volume in the resource configuration table, and allocates hardware resources to spark according to the resource setting information.
For example, when the preset data volume is less than 6 million, the number of the processes is set to be 6, and the memory is 8G; when the data volume is more than 6 million, one process is allocated to each million of data volume, and 1G of memory is added to each hundred million of data volume. In order to reduce the pressure on the central processing unit CPU and the spark cluster of the data storage server where the source database is located, an upper limit value may be set for the process and the memory, for example, the maximum number of processes may be set to 10, and the maximum number of memories may be set to 20G. And the parallel number of the spark-jdbc read data source table can be set according to the number of the processes, so that the parallel number is the same as the number of the processes, and the spark cluster is fully utilized to perform data synchronization.
Step S203, accessing the data source table through the synchronization configuration file, and performing data segmentation on the data to be synchronized in the data source table.
The data source table is located in the data storage server and used for storing data.
Specifically, the synchronization configuration file may record an address of the data source table, and record which data in the data source table is to be synchronized, that is, record related information of the data to be synchronized. And the server reads the data source table according to the address of the data source table and determines the data to be synchronized in the data source table.
In order to improve the efficiency of data synchronization, the server may perform data segmentation on the data to be synchronized, and the number of the segmented data may be the same as the set number of processes. The data to be synchronized can be divided equally, and the data volume of each data to be synchronized is kept the same, so that the problem of data inclination is avoided.
Step S204, distributing the divided data to be synchronized to each process in spark.
Specifically, the server allocates the segmented data to be synchronized to the processes in spark, and each process can be divided into one copy of the data to be synchronized. The data segmentation can be carried out by each process, each process marks a part of data to be synchronized, and the data to be synchronized marked by each process are different from each other; the data to be synchronized marked by the process is distributed to the process for processing.
Step S205, the divided data to be synchronized is synchronized through each process.
Specifically, data synchronization is realized on the spark cluster, and when the spark cluster works, data synchronization is simultaneously performed by each process, data to be synchronized is read in parallel, and the data to be synchronized is stored in a preset target storage directory.
The method and the device support data synchronization among various relational databases, for example, data in Oracle, MySQL and postgresql databases can be synchronized to hive or HDFS.
In one embodiment, the flow of the data synchronization method comprises: filling task configuration information in a synchronization configuration page, generating a task shell script and a synchronization configuration file, sending the shell script to a scheduling platform (for example, azkaban, a workflow control engine, which can solve the dependency relationship among a plurality of spark tasks), executing the script by the scheduling platform, and controlling the spark to perform data synchronization according to the data synchronization configuration file.
In the embodiment, when data synchronization is performed, a synchronization configuration file is acquired first, and spark is set according to the synchronization configuration file, so that hardware resources during operation of spark can be matched with data to be synchronized, the efficiency of data synchronization is ensured, and meanwhile, data synchronization is performed based on spark, and the pressure of single machine operation can be reduced; the data source table is accessed according to the synchronization configuration file, data to be synchronized in the data source table is segmented, each process is distributed to obtain one data to be synchronized, the data source table can be synchronized in a concurrent mode, and the data synchronization efficiency is improved.
Further, before step S201, the method may further include: when a configuration instruction sent by a terminal is received, a synchronous configuration page is displayed through the terminal; acquiring task configuration information input in a synchronous configuration page through a terminal; and generating a synchronous configuration file according to the task configuration information.
The configuration instruction may be an instruction requesting to configure the data synchronization task.
Specifically, a data synchronization platform is built in the server, and the data synchronization platform provides a visual task configuration mode. When a user accesses a user page of the data processing platform at the terminal, a configuration instruction can be triggered in the user page, and the terminal sends the configuration instruction to the server. And after receiving the configuration instruction, the server displays the synchronous configuration page through the terminal.
The synchronous configuration page is used for configuring the data synchronization task in what you see is what you get, and a user does not need to write a complex script, so that the configuration of data synchronization operation is simplified, and the efficiency of data synchronization is improved. The user can input various task configuration information in the synchronous configuration page, including a data source address, a data source table name, a preset data volume of the data source table, a synchronous condition, a data division mode, a target storage directory and the like. And the server generates a synchronization configuration file according to the task information, and performs task synchronization according to the task configuration information in the synchronization configuration file when performing data synchronization.
In the embodiment, the synchronous configuration file can be automatically generated by inputting the task configuration information according to the requirements in the synchronous configuration page, so that the operation is simple and convenient, the adjustment is flexible, and the data synchronization efficiency is improved.
Further, the step S202 may include: extracting a data source address and a data source table name from the synchronous configuration file; accessing a data source table corresponding to the data source table name according to the data source address; acquiring the data volume of data to be synchronized in a data source table; and carrying out process setting and memory setting on spark according to the acquired data volume.
Specifically, a data source address and a data source table name may be recorded in the synchronization configuration file, and the server may read the data source table corresponding to the data source table name according to the data source address, and obtain the data volume of the data to be synchronized in the data source table through the data volume query instruction. The server can perform process setting and memory setting on spark according to the actually inquired data volume so as to enable hardware resources of the spark cluster to be better adapted to the actual data volume and guarantee the efficiency of data synchronization.
In this embodiment, the actual data volume of the data to be synchronized in the data source table is obtained after the data source table is accessed, and spark is set according to the actual data volume, so that the efficiency of data synchronization is ensured.
Further, the step S203 may include: accessing a data source table through a synchronization configuration file; inquiring data to be synchronized in a data source table; performing data segmentation on the inquired data to be synchronized, wherein the data segmentation mode comprises the following steps: pseudo-column segmentation, result pseudo-column segmentation, time segmentation, or random field segmentation.
Specifically, the server accesses the data source table according to the synchronization configuration file, and the synchronization configuration file may record synchronization conditions, such as which fields are synchronized, perform data query according to which conditions, and use the queried data as data to be synchronized; the synchronous configuration file can also record a data segmentation mode, and the data segmentation mode is related to synchronous conditions and characteristics of the data source table; the data division mode may not be recorded in the synchronization configuration file, and the server may determine the data division mode according to the synchronization condition and the data source table.
The server inquires data to be synchronized in the data source table according to the synchronization condition and performs data segmentation on the data to be synchronized according to a corresponding data segmentation mode, wherein the data segmentation mode comprises the following steps: pseudo-column (rowid) partitioning, result pseudo-column (rownum) partitioning, time partitioning, and random field partitioning.
The pseudo-column (rowid) is specific to an Oracle database, when a table is established, the Oracle database establishes a rowid column for each table, the rowid column serves as an identifier of each record in the database and stores a storage position of each record, but the rowid is not stored in an original data table, and therefore the rowid column is called a pseudo-column. Therefore, when the data source table comes from the Oracle database, pseudo-column segmentation can be adopted, and when the pseudo-column segmentation is adopted, the data to be synchronized is segmented uniformly, and the query efficiency is high. It will be appreciated that the data to be synchronized from the Oracle database is not limited to a data split, pseudo-column split.
rownum is also a pseudo-column of the Oracle database, which is a pseudo-column added to the query result set, i.e. a column added after the result set is searched, is a serial number of the result meeting the query condition.
When the time division method is adopted, the method can be based on the field of creation time or the field of modification time. In the incremental synchronization task, the time division mode can be preferentially selected.
Each record in the data source table can have a plurality of fields, so random field segmentation can be performed, that is, one field is selected at will and converted into a byte stream, segmentation is performed according to the numerical range of the byte stream, and the records of the byte stream in a certain interval are synchronized by a certain process. In the random field division, a UUID (Universally Unique Identifier) type field may be preferable.
In the data division, one of the pseudo division, the result pseudo division, the time division, and the random division may be selected.
In one embodiment, the querying the data to be synchronized in the data source table may include: reading a preset data segmentation mode from the synchronous configuration file; when the data segmentation mode is result pseudo-column segmentation, acquiring preset creation deadline and query conditions; and querying the data to be synchronized in the data source table by creating the deadline and the query condition.
Specifically, each process reads the data source table in parallel, queries the data source table, and divides the data to be synchronized. Because the result is a column added to the query result, in the query process, if new data enters the data source table, the query results of the processes are different due to the same query condition, and different rownum may point to the same data, resulting in the problem of data duplication.
Therefore, the preset data segmentation mode can be read from the synchronous configuration file, and when the data segmentation mode is result fake segmentation, the preset creation deadline and the query condition are acquired. The creation deadline is used as a limiting condition, and the creation time of the inquired data is required to be before a certain time. Therefore, when multi-concurrent query is carried out according to the creation deadline and the query condition, the risk of data duplication is avoided.
The server can use the creation deadline as a where condition to limit in the query SQL statement, and ensure that the queried data is generated before a certain fixed time, so as to ensure that the snapshots read by each process are the same, and avoid the occurrence of data repetition.
In the embodiment, when the data segmentation mode is result fake segmentation, the cutoff time is established for limitation, the risk of data repetition in multiple concurrences is avoided, and the accuracy of data synchronization is ensured.
In the embodiment, the data to be synchronized is divided by the multiple processes, so that each process can process one copy of data to be synchronized, and the efficiency of data synchronization is improved.
Further, the step S204 may include: and distributing the segmented data to be synchronized to each process, and persisting the distributed data to be synchronized to a preset disk of spark through each process.
Specifically, each process performs data segmentation on data to be synchronized, and during data segmentation, each process takes a part of the data to be synchronized as data to be processed by the process.
The server persists the data to be synchronized to a preset disk of the spark cluster through each process, and each process persists the data to be synchronized distributed to the process. If the data is not persisted, the data source table at the upstream needs to be continuously accessed during data synchronization, and the speed is slow. After the data to be synchronized in the data source table is persisted to the local preset disk of the spark cluster, the speed of reading the data to be synchronized from the local preset disk can be greatly increased. In one embodiment, the persistence operation may be performed through dataframe.
In this embodiment, each process persists the allocated data to be synchronized to the spare preset disk, and reads the data to be synchronized from the preset disk when performing data synchronization, thereby improving the efficiency of data synchronization.
Further, the step S205 may include: acquiring a target storage directory from the synchronous configuration file; and synchronizing the data to be synchronized in the preset disk to the target storage directory through each process, and deleting the data to be synchronized in the preset disk.
Specifically, the preset disk is not the final storage position of the data to be synchronized, the server acquires the target storage directory from the synchronization configuration file, instructs each process to synchronize the data to be synchronized in the preset disk to the target storage directory, so that the final data synchronization is completed, and deletes the data to be synchronized in the preset disk after the data synchronization is completed.
In this embodiment, the data to be synchronized in the preset disk is synchronized to the target storage directory, so that the data to be synchronized is stored in the final storage area, and data synchronization is completed.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data synchronization apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.
As shown in fig. 3, the data synchronization apparatus 300 according to the present embodiment includes: a file acquisition module 301, a setting module 302, a source table access module 303, a data distribution module 304, and a data synchronization module 305.
Wherein:
the file obtaining module 301 is configured to obtain a synchronization configuration file according to the triggered data synchronization instruction.
A setting module 302, configured to set spark based on the synchronization configuration file.
And the source table accessing module 303 is configured to access the data source table through the synchronization configuration file, and perform data segmentation on the data to be synchronized in the data source table.
And a data distribution module 304, configured to distribute the segmented data to be synchronized to each process in spark.
And a data synchronization module 305, configured to synchronize the segmented data to be synchronized through each process.
In the embodiment, when data synchronization is performed, a synchronization configuration file is acquired first, and spark is set according to the synchronization configuration file, so that hardware resources during operation of spark can be matched with data to be synchronized, the efficiency of data synchronization is ensured, and meanwhile, data synchronization is performed based on spark, and the pressure of single machine operation can be reduced; the data source table is accessed according to the synchronization configuration file, data to be synchronized in the data source table is segmented, each process is distributed to obtain one data to be synchronized, the data source table can be synchronized in a concurrent mode, and the data synchronization efficiency is improved.
In some optional implementations of this embodiment, the data synchronization apparatus 300 further includes: the system comprises a page display module, an information acquisition module and a file generation module, wherein:
and the page display module is used for displaying the synchronous configuration page through the terminal when receiving the configuration instruction sent by the terminal.
And the information acquisition module is used for acquiring the task configuration information input in the synchronous configuration page through the terminal.
And the file generation module is used for generating a synchronous configuration file according to the task configuration information.
In the embodiment, the synchronous configuration file can be automatically generated by inputting the task configuration information according to the requirements in the synchronous configuration page, so that the operation is simple and convenient, the adjustment is flexible, and the data synchronization efficiency is improved.
In some optional implementations of this embodiment, the setting module 302 may include: extract submodule piece, visit submodule piece, obtain submodule piece and set up submodule piece, wherein:
and the extraction submodule is used for extracting a data source address and a data source table name from the synchronous configuration file.
And the access submodule is used for accessing the data source table corresponding to the data source table name according to the data source address.
And the acquisition submodule is used for acquiring the data volume of the data to be synchronized in the data source table.
And the setting submodule is used for carrying out process setting and memory setting on spark according to the obtained data volume.
In this embodiment, the actual data volume of the data to be synchronized in the data source table is obtained after the data source table is accessed, and spark is set according to the actual data volume, so that the efficiency of data synchronization is ensured.
In some optional implementations of this embodiment, the source table accessing module 303 may include: the system comprises a source table access sub-module, a data query sub-module and a data segmentation sub-module, wherein:
and the source table access submodule is used for accessing the data source table through the synchronous configuration file.
And the data query submodule is used for querying the data to be synchronized in the data source table.
The data segmentation submodule is used for carrying out data segmentation on the inquired data to be synchronized, wherein the data segmentation mode comprises the following steps: pseudo-column segmentation, result pseudo-column segmentation, time segmentation, or random field segmentation.
In the embodiment, the data to be synchronized is divided by the multiple processes, so that each process can process one copy of data to be synchronized, and the efficiency of data synchronization is improved.
In some optional implementations of this embodiment, the data query sub-module may include: mode reading unit, acquisition unit and data inquiry unit, wherein:
and the mode reading unit is used for reading a preset data segmentation mode from the synchronous configuration file.
And the acquisition unit is used for acquiring preset creation deadline and query conditions when the data segmentation mode is result pseudo-column segmentation.
And the data query unit is used for querying the data to be synchronized in the data source table by establishing the deadline and the query condition.
In the embodiment, when the data segmentation mode is result fake segmentation, the cutoff time is established for limitation, the risk of data repetition in multiple concurrences is avoided, and the accuracy of data synchronization is ensured.
In some optional implementation manners of this embodiment, the data allocation module 304 is further configured to allocate the partitioned data to be synchronized to each process, and persist the allocated data to be synchronized to a preset disk of spark through each process.
In this embodiment, each process persists the allocated data to be synchronized to the spare preset disk, and reads the data to be synchronized from the preset disk when performing data synchronization, thereby improving the efficiency of data synchronization.
In some optional implementations of this embodiment, the data synchronization module 305 may include: the catalog acquisition submodule and the data synchronization submodule, wherein:
and the directory acquisition submodule is used for acquiring the target storage directory from the synchronous configuration file.
And the data synchronization submodule is used for synchronizing the data to be synchronized in the preset disk to the target storage directory through each process and deleting the data to be synchronized in the preset disk.
In this embodiment, the data to be synchronized in the preset disk is synchronized to the target storage directory, so that the data to be synchronized is stored in the final storage area, and data synchronization is completed.
In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.
The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.
The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.
The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various types of application software, such as computer readable instructions of a data synchronization method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.
The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, for example, execute computer readable instructions of the data synchronization method.
The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.
The computer device provided in this embodiment may execute the data synchronization method described above. The data synchronization method here may be the data synchronization method of the above-described respective embodiments.
In the embodiment, when data synchronization is performed, a synchronization configuration file is acquired first, and spark is set according to the synchronization configuration file, so that hardware resources during operation of spark can be matched with data to be synchronized, the efficiency of data synchronization is ensured, and meanwhile, data synchronization is performed based on spark, and the pressure of single machine operation can be reduced; the data source table is accessed according to the synchronization configuration file, data to be synchronized in the data source table is segmented, each process is distributed to obtain one data to be synchronized, the data source table can be synchronized in a concurrent mode, and the data synchronization efficiency is improved.
The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the data synchronization method as described above.
In the embodiment, when data synchronization is performed, a synchronization configuration file is acquired first, and spark is set according to the synchronization configuration file, so that hardware resources during operation of spark can be matched with data to be synchronized, the efficiency of data synchronization is ensured, and meanwhile, data synchronization is performed based on spark, and the pressure of single machine operation can be reduced; the data source table is accessed according to the synchronization configuration file, data to be synchronized in the data source table is segmented, each process is distributed to obtain one data to be synchronized, the data source table can be synchronized in a concurrent mode, and the data synchronization efficiency is improved.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.
It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims (10)

1. A method of data synchronization, comprising the steps of:
acquiring a synchronous configuration file according to a triggered data synchronization instruction;
setting spark based on the synchronization configuration file;
accessing a data source table through the synchronous configuration file, and performing data segmentation on data to be synchronized in the data source table;
distributing the segmented data to be synchronized to each process in the spark;
and synchronizing the segmented data to be synchronized through the processes.
2. The data synchronization method according to claim 1, further comprising, before the obtaining the synchronization profile according to the triggered data synchronization instruction:
when a configuration instruction sent by a terminal is received, a synchronous configuration page is displayed through the terminal;
acquiring task configuration information input in the synchronous configuration page through the terminal;
and generating a synchronous configuration file according to the task configuration information.
3. The data synchronization method according to claim 1, wherein the setting spark based on the synchronization profile comprises:
extracting a data source address and a data source table name from the synchronous configuration file;
accessing a data source table corresponding to the data source table name according to the data source address;
acquiring the data volume of the data to be synchronized in the data source table;
and carrying out process setting and memory setting on spark according to the acquired data volume.
4. The data synchronization method according to claim 1, wherein the accessing a data source table through the synchronization configuration file and performing data segmentation on the data to be synchronized in the data source table comprises:
accessing a data source table through the synchronization configuration file;
inquiring data to be synchronized in the data source table;
performing data segmentation on the inquired data to be synchronized, wherein the data segmentation mode comprises the following steps: pseudo-column segmentation, result pseudo-column segmentation, time segmentation, or random field segmentation.
5. The data synchronization method according to claim 4, wherein the querying the data to be synchronized in the data source table comprises:
reading a preset data segmentation mode from the synchronous configuration file;
when the data segmentation mode is result pseudo-column segmentation, acquiring preset creation deadline and query conditions;
and querying the data to be synchronized in the data source table through the creation deadline and the query condition.
6. The data synchronization method according to claim 1, wherein the allocating the segmented data to be synchronized to each process in the spark comprises:
and distributing the segmented data to be synchronized to each process, and persisting the distributed data to be synchronized to a preset disk of the spark through each process.
7. The data synchronization method according to claim 6, wherein the synchronizing the segmented data to be synchronized by the processes comprises:
acquiring a target storage directory from the synchronous configuration file;
and synchronizing the data to be synchronized in the preset disk to the target storage directory through the processes, and deleting the data to be synchronized in the preset disk.
8. A data synchronization apparatus, comprising:
the file acquisition module is used for acquiring a synchronous configuration file according to the triggered data synchronization instruction;
the setting module is used for setting spark based on the synchronization configuration file;
the source table access module is used for accessing a data source table through the synchronous configuration file and carrying out data segmentation on data to be synchronized in the data source table;
the data distribution module is used for distributing the segmented data to be synchronized to each process in the spark;
and the data synchronization module is used for synchronizing the segmented data to be synchronized through each process.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements a data synchronization method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium having computer-readable instructions stored thereon, which when executed by a processor implement the data synchronization method of any one of claims 1 to 7.
CN202011314160.5A 2020-11-20 2020-11-20 Data synchronization method and device, computer equipment and storage medium Pending CN112507020A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011314160.5A CN112507020A (en) 2020-11-20 2020-11-20 Data synchronization method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011314160.5A CN112507020A (en) 2020-11-20 2020-11-20 Data synchronization method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112507020A true CN112507020A (en) 2021-03-16

Family

ID=74958139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011314160.5A Pending CN112507020A (en) 2020-11-20 2020-11-20 Data synchronization method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112507020A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254534A (en) * 2021-06-04 2021-08-13 四川省明厚天信息技术股份有限公司 Data synchronization method and device and computer storage medium
CN113507497A (en) * 2021-06-01 2021-10-15 常州皓鸣信息科技有限公司 Multi-node asynchronous issuing and delay integration method for multi-type data
CN113836224A (en) * 2021-09-07 2021-12-24 南方电网大数据服务有限公司 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment
CN114780641A (en) * 2022-05-07 2022-07-22 湖南长银五八消费金融股份有限公司 Multi-library multi-table synchronization method and device, computer equipment and storage medium
CN115033647A (en) * 2022-08-11 2022-09-09 杭州湖畔网络技术有限公司 Data synchronization method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804697A (en) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment based on Spark and storage medium
CN108920670A (en) * 2018-07-06 2018-11-30 深圳市小牛在线互联网信息咨询有限公司 Cache synchronization method, device, system and storage medium
EP3418919A1 (en) * 2017-06-23 2018-12-26 Palantir Technologies Inc. User interface for managing synchronization between data sources and cache databases
CN109271447A (en) * 2018-09-04 2019-01-25 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment and storage medium
CN109446170A (en) * 2018-09-13 2019-03-08 北京米文动力科技有限公司 A kind of profile data synchronous method and equipment
CN110795499A (en) * 2019-09-17 2020-02-14 中国平安人寿保险股份有限公司 Cluster data synchronization method, device and equipment based on big data and storage medium
CN111324610A (en) * 2020-02-19 2020-06-23 深圳市融壹买信息科技有限公司 Data synchronization method and device
CN111460038A (en) * 2020-04-07 2020-07-28 中国建设银行股份有限公司 Quasi-real-time data synchronization method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3418919A1 (en) * 2017-06-23 2018-12-26 Palantir Technologies Inc. User interface for managing synchronization between data sources and cache databases
CN108804697A (en) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment based on Spark and storage medium
CN108920670A (en) * 2018-07-06 2018-11-30 深圳市小牛在线互联网信息咨询有限公司 Cache synchronization method, device, system and storage medium
CN109271447A (en) * 2018-09-04 2019-01-25 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment and storage medium
CN109446170A (en) * 2018-09-13 2019-03-08 北京米文动力科技有限公司 A kind of profile data synchronous method and equipment
CN110795499A (en) * 2019-09-17 2020-02-14 中国平安人寿保险股份有限公司 Cluster data synchronization method, device and equipment based on big data and storage medium
CN111324610A (en) * 2020-02-19 2020-06-23 深圳市融壹买信息科技有限公司 Data synchronization method and device
CN111460038A (en) * 2020-04-07 2020-07-28 中国建设银行股份有限公司 Quasi-real-time data synchronization method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈宇收;: "基于Datax的数据同步方案研究", 电脑编程技巧与维护, no. 09, 18 September 2018 (2018-09-18), pages 99 - 100 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113507497A (en) * 2021-06-01 2021-10-15 常州皓鸣信息科技有限公司 Multi-node asynchronous issuing and delay integration method for multi-type data
CN113254534A (en) * 2021-06-04 2021-08-13 四川省明厚天信息技术股份有限公司 Data synchronization method and device and computer storage medium
CN113254534B (en) * 2021-06-04 2023-04-11 四川省明厚天信息技术股份有限公司 Data synchronization method, device and computer storage medium
CN113836224A (en) * 2021-09-07 2021-12-24 南方电网大数据服务有限公司 Method and device for processing synchronous files from OGG (one glass solution) to HDFS (Hadoop distributed File System) and computer equipment
CN114780641A (en) * 2022-05-07 2022-07-22 湖南长银五八消费金融股份有限公司 Multi-library multi-table synchronization method and device, computer equipment and storage medium
CN115033647A (en) * 2022-08-11 2022-09-09 杭州湖畔网络技术有限公司 Data synchronization method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112507020A (en) Data synchronization method and device, computer equipment and storage medium
KR101994021B1 (en) File manipulation method and apparatus
US11468013B2 (en) Prioritizing content item synchronization based on sharing
US9372880B2 (en) Reclamation of empty pages in database tables
CN104881466B (en) The processing of data fragmentation and the delet method of garbage files and device
CN110795499B (en) Cluster data synchronization method, device, equipment and storage medium based on big data
WO2021073510A1 (en) Statistical method and device for database
WO2022116425A1 (en) Method and system for data lineage analysis, computer device, and storage medium
CN111797096A (en) Data indexing method and device based on ElasticSearch, computer equipment and storage medium
CN111680477A (en) Method and device for exporting spreadsheet file, computer equipment and storage medium
CN107844488B (en) Data query method and device
CN112380227A (en) Data synchronization method, device and equipment based on message queue and storage medium
JP2018513454A (en) Efficient performance of insert and point query operations in the column store
CN111090803A (en) Data processing method and device, electronic equipment and storage medium
Choi et al. Improving database system performance by applying NoSQL
CN112416934A (en) hive table incremental data synchronization method and device, computer equipment and storage medium
CN112948383A (en) Government affair data sharing and exchanging method and device
WO2021051569A1 (en) Data isolation method and apparatus, computer device and storage medium
CN116842012A (en) Method, device, equipment and storage medium for storing Redis cluster in fragments
CN113849524B (en) Data processing method and device
WO2022223038A1 (en) Key name generation method and device, and computer readable storage medium
CN106874457B (en) Method for improving metadata cluster performance through virtual directory
US10146791B2 (en) Open file rebalance
CN116414801A (en) Data migration method, device, computer equipment and storage medium
CN113760861A (en) Data migration method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination