CN109857803B - Data synchronization method, device, equipment, system and computer readable storage medium - Google Patents

Data synchronization method, device, equipment, system and computer readable storage medium Download PDF

Info

Publication number
CN109857803B
CN109857803B CN201811527177.1A CN201811527177A CN109857803B CN 109857803 B CN109857803 B CN 109857803B CN 201811527177 A CN201811527177 A CN 201811527177A CN 109857803 B CN109857803 B CN 109857803B
Authority
CN
China
Prior art keywords
data
source data
partition
source
hbase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811527177.1A
Other languages
Chinese (zh)
Other versions
CN109857803A (en
Inventor
章海怒
周一帆
郑艳涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dt Dream Technology Co Ltd
Original Assignee
Hangzhou Dt Dream Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dt Dream Technology Co Ltd filed Critical Hangzhou Dt Dream Technology Co Ltd
Priority to CN201811527177.1A priority Critical patent/CN109857803B/en
Publication of CN109857803A publication Critical patent/CN109857803A/en
Application granted granted Critical
Publication of CN109857803B publication Critical patent/CN109857803B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data synchronization method, which comprises the steps that Spark acquires source data at a source data end according to a data acquisition instruction; performing labeling processing on the source data to obtain a partition template; after the source data are partitioned based on the partition template, extracting target synchronous data from the partitioned source data according to a preset rule; writing the target synchronous data into a target data terminal; the data synchronization method can effectively improve the data synchronization efficiency between the two data synchronization parties and reduce the complexity of data synchronization while reducing the flow waste and the performance loss. The application also discloses a data synchronization device, equipment, a system and a computer readable storage medium, which have the beneficial effects.

Description

Data synchronization method, device, equipment, system and computer readable storage medium
Technical Field
The present application relates to the field of big data technologies, and in particular, to a data synchronization method, and further, to a data synchronization apparatus, device, system, and computer-readable storage medium.
Background
In the prior art, when data synchronization needs to be achieved between two data ends, there are generally two ways, namely external tool-based data import and export and cache-based data import and export. The implementation mode of the third-party tool is used for serializing the data into the content readable by the third-party tool and re-serializing the content back to the data format corresponding to the data end to be synchronized, and the data is required to be subjected to network transmission once during the serialization, so that the possibility of data falling is existed; the implementation scheme using the cache sacrifices extra disk resources, and the data also undergoes multiple disk drops and serialization, resulting in a large loss in performance. In addition, in both of the above methods, the task of importing and exporting data is completed by first clustering data and then clustering data, which results in waste of traffic and extremely low data synchronization efficiency.
Therefore, how to effectively improve the data synchronization efficiency between two data synchronization parties while reducing the traffic waste and the performance loss is a problem to be solved by those skilled in the art.
Disclosure of Invention
The data synchronization method can effectively improve the data synchronization efficiency between two data synchronization parties while reducing the flow waste and the performance loss; another object of the present application is to provide a data synchronization device, an apparatus, a system, and a computer-readable storage medium, which also have the above-mentioned advantages.
In order to solve the above technical problem, the present application provides a data synchronization method, where the data synchronization method includes:
the Spark acquires source data at a source data end according to the data acquisition instruction;
performing labeling processing on the source data to obtain a partition template;
after the source data are partitioned based on the partition template, extracting target synchronous data from the partitioned source data according to a preset rule;
and writing the target synchronous data into a target data terminal.
Preferably, the source data end is Hive, and the target data end is HBase.
Preferably, the tagging the source data to obtain a partition template includes:
performing data type matching on the source data to obtain source data after type matching;
performing column description analysis on the source data after the type matching to obtain a partition column and a partition value;
and generating the partition template according to the partition columns and the partition values.
Preferably, the extracting the target synchronization data from the partitioned source data according to the preset rule includes:
and extracting source data corresponding to the HBase primary key from the partitioned source data as the target synchronous data.
Preferably, the data synchronization method further includes:
creating a data mapping table on the HBase; wherein, the data mapping table comprises a predetermined number of HBase primary keys.
Preferably, the data synchronization method further includes:
reading HBase data in the HBase;
and writing the HBase data into the Hive.
In order to solve the above technical problem, the present application provides a data synchronization apparatus, including:
the data acquisition module is used for acquiring source data at a source data end according to the data acquisition instruction;
the data processing module is used for performing labeling processing on the source data to obtain a partition template;
the data extraction module is used for extracting target synchronous data from the partitioned source data according to a preset rule after the source data are partitioned based on the partition template;
and the data synchronization module is used for writing the target synchronization data into a target data end.
In order to solve the above technical problem, the present application provides a data synchronization apparatus, including:
a memory for storing a computer program;
a processor for implementing the steps of any of the above data synchronization methods when executing the computer program.
In order to solve the above technical problem, the present application provides a data synchronization system, including:
the data synchronization device is configured to obtain source data at a source data end according to a data obtaining instruction; performing labeling processing on the source data to obtain a partition template; after the source data are partitioned based on the partition template, extracting target synchronous data from the partitioned source data according to a preset rule; writing the target synchronous data into a target data terminal;
the source data terminal is used for providing the source data;
and the target data terminal is used for storing the source data.
To solve the above technical problem, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of any of the above data synchronization methods.
The data synchronization method comprises the steps that Spark acquires source data at a source data end according to a data acquisition instruction; performing labeling processing on the source data to obtain a partition template; after the source data are partitioned based on the partition template, extracting target synchronous data from the partitioned source data according to a preset rule; and writing the target synchronous data into a target data terminal.
Therefore, according to the data synchronization method provided by the application, data synchronization between two data ends is realized through Spark, after the Spark obtains source data at the source data end, the Spark performs tagging processing on the source data according to related information of the source data end to obtain a corresponding partition template, so that a technician can directly complete partition work on the source data through the partition template, that is, the implementation mode can directly give a corresponding resource configuration instruction, so that automatic partition of the source data and automatic configuration of related formats in a target data end are realized, and further data synchronization is realized. Compared with the traditional data synchronization mode, the method does not need manual decision on partition type and manual partition, effectively improves the partition rate, improves the partition accuracy and further improves the data synchronization efficiency. In addition, since Spark has the performance of data calculation based on the memory, the implementation method can realize the function that the data to be synchronized directly carries out stream transfer in the Spark memory without falling off the disk, so that additional disk resources are not required to be introduced, material resources are saved, and performance loss is reduced to a certain extent; meanwhile, the source data can be directly downloaded in a format corresponding to the source data end and directly written into the target data end, so that the serialization process of the data is avoided.
The data synchronization device, the data synchronization system, and the computer-readable storage medium provided by the present application all have the above beneficial effects, and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data synchronization method provided in the present application;
FIG. 2 is a schematic flow chart of another data synchronization method provided in the present application;
fig. 3 is a schematic flow chart of a partition template obtaining method provided in the present application;
FIG. 4 is a flowchart of Hive data analysis provided herein;
fig. 5 is a schematic structural diagram of a data synchronization apparatus provided in the present application;
fig. 6 is a schematic structural diagram of a data synchronization apparatus provided in the present application;
fig. 7 is a schematic structural diagram of a data synchronization system provided in the present application.
Detailed Description
The core of the application is to provide a data synchronization method, which can effectively improve the data synchronization efficiency between two data synchronization parties while reducing the flow waste and the performance loss; another core of the present application is to provide a data synchronization apparatus, a device, a system and a computer-readable storage medium, which also have the above-mentioned advantages.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the prior art, the data synchronization process needs to be realized based on an external tool or a cache, and needs to undergo data de-stacking and serialization for many times, so that the performance loss and the traffic waste are large, and the data synchronization efficiency is extremely low. Therefore, in order to solve the above problems, the present application provides a data synchronization method, where the data synchronization method analyzes acquired source data based on Spark and provides a corresponding resource configuration indication, so as to implement automatic partitioning of the source data and automatic configuration of a target data end related format, further implement data synchronization, and effectively improve data synchronization efficiency while reducing traffic waste and performance loss.
Referring to fig. 1, fig. 1 is a schematic flow chart of a data synchronization method provided in the present application, where the data synchronization method may include:
s101: the Spark acquires source data at a source data end according to the data acquisition instruction;
the Apache Spark is a fast and multipurpose cluster computing system, and due to the property of memory-based computing, the Apache Spark has great advantages in performance and strong adaptability, and is convenient for realizing data synchronization among multiple types of data terminals. Specifically, when data synchronization between two data ends is required, Spark may obtain source data from a data provider, that is, the source data end, and perform subsequent forwarding, so as to import the source data to a data end, that is, the target data end, where data synchronization is required. The acquisition process of the source data can be realized based on a corresponding data acquisition instruction, and the data acquisition instruction can be issued by a technician based on a client.
Spark has stronger applicability, and for the specific types of both sides of data synchronization, namely a source data end and a target data end, the method is not limited in the application, and only the relevant data format of Spark is satisfied.
S102: performing labeling processing on the source data to obtain a partition template;
the method aims to realize the labeling processing of the source data so as to obtain the data meeting the storage condition of the target data terminal, thereby facilitating the subsequent data synchronization. Compared with the prior art in which manual analysis is performed on source data to decide the partition type and implement manual partitioning, the method and the device for processing the source data achieve tagging of the source data based on Spark, obtain a partition template for the source data, and give a relevant resource configuration indication. Therefore, a user can directly partition the source data according to the partition template without manual analysis, the partition accuracy is effectively improved, the partition time is saved, the partition efficiency is improved, and the data synchronization efficiency is further improved.
S103: after partitioning of the source data is completed based on the partitioning template, extracting target synchronous data from the partitioned source data according to a preset rule;
specifically, after the partition template is obtained based on S102, the user may directly partition the source data according to the related indication in the partition template, and the partitioned source data may implement the correspondence with the target data end related format, so that the source data meeting the target data end related data format may be directly extracted from the partitioned source data according to the preset rule.
The preset rules are preset based on the partition template, and different types of target data terminals correspond to different preset rules.
S104: and writing the target synchronous data into the target data terminal.
Specifically, this step is intended to directly write the source data extracted in S103 into the destination data side, thereby completing data synchronization between the source data side and the destination data side.
According to the data synchronization method provided by the application, data synchronization between two data ends is realized through Spark, after the Spark obtains source data at the source data end, labeling processing is firstly carried out on the source data according to related information of the source data end so as to obtain a corresponding partition template, therefore, a technician can directly complete partition work on the source data through the partition template, namely, the realization mode can directly give corresponding resource configuration instructions, so that automatic partition of the source data and automatic configuration of related formats in a target data end are realized, and further data synchronization is realized. Compared with the traditional data synchronization mode, the method does not need manual decision on partition type and manual partition, effectively improves the partition rate, improves the partition accuracy and further improves the data synchronization efficiency. In addition, since Spark has the performance of data calculation based on the memory, the implementation method can realize the function that the data to be synchronized directly carries out stream transfer in the Spark memory without falling off the disk, so that additional disk resources are not required to be introduced, material resources are saved, and performance loss is reduced to a certain extent; meanwhile, the source data can be directly downloaded in a format corresponding to the source data end and directly written into the target data end, so that the serialization process of the data is avoided.
In addition to the above embodiments, as a preferred embodiment, the source data side is Hive, and the target data side is HBase.
In the process of data synchronization between the existing Hive and HBase, data import and export based on an external tool and data import and export based on cache need to be manually established on the HBase side, a guide person on the Hive side specifies the size and format of a file on the HDFS, a user needs to master certain HBase knowledge and Hive knowledge, certain professional requirements are provided for the user, and applicability is low. In addition, before data import and export, the type of partition needs to be manually decided, the Hive partition needs to be manually created, and the number of regions in HBase needs to be manually specified according to the size of the data and a service model.
Therefore, the application provides a data synchronization method for a specific application scenario, that is, Spark is applied to data synchronization between Hive and HBase. Apache Spark provides a Hive on Spark mode, so that correlation calculation processing of data stored in Hive, namely Hive data can be realized based on Spark. However, because Hive on Spark SQL does not have the efficient query and data access capability, and HBase provides the ultra-high performance data writing and query function, the capability of reading and writing HBase clusters in Application can be introduced on the basis of Spark Application to realize data intercommunication between Hive and HBase, that is, a data result set cleaned by Spark calculation is imported into HBase for access of a user.
Referring to fig. 2, fig. 2 is a schematic flow chart of another data synchronization method provided in the present application.
S201: spark acquires Hive data in Hive according to the data acquisition instruction;
specifically, when data synchronization between Hive and HBase is required, Hive data, that is, the source data, may be obtained in Hive by Spark based on the obtained data obtaining instruction, and then forwarded, so that the source data is imported to HBase, thereby implementing data synchronization between Hive and HBase.
S202: labeling the Hive data to obtain a partition template;
specifically, Hive is a data warehouse tool based on Hadoop, and all data of the tool is stored in the HDFS. The Hive can map the structured data file into a database table, but does not provide a special data storage format and does not provide a function of establishing an index for the data, so that a user can freely organize all the database tables in the Hive, and only needs to set column separators and row separators for the Hive data while establishing the database tables. However, HBase is a distributed, column-oriented open-source database with a fixed data storage format suitable for unstructured data storage. Therefore, before writing the Hive data into the HBase, the Hive data needs to be partitioned to meet the data storage condition of the HBase.
Preferably, referring to fig. 3, fig. 3 is a schematic flowchart of a partition template obtaining method provided in the present application, where the performing tagging processing on the source data to obtain the partition template may include:
s301: performing data type matching on the source data to obtain source data after type matching;
s302: performing column description analysis on the source data after type matching to obtain a partition column and a partition value;
s303: and generating a partition template according to the partition columns and the partition values.
Specifically, the present application provides a more specific partition template obtaining method, and the Hive data is also taken as an example. First, data type matching can be performed on the Hive data, that is, each data is labeled to determine the type of each data, which is equivalent to obtaining a type mapping table. The process mainly aims at reducing data needing type matching, for example, fields such as char, warchar and the like can be uniformly classified as String. Of course, the number and specific types of the finally obtained data types are not specifically limited in the present application, for example, five basic data types are obtained by fuzzy matching the Hive data in the present application.
Further, after the data type matching is completed, row-column description analysis can be performed on the Hive data after the type matching, that is, analysis is performed according to description information recorded in the Hive data column, and fields with fixed formats, such as location, type, generator and the like, which are suitable for being used as Hive data of Hive partition fields are captured, so that corresponding partition columns and partition values can be obtained. And finally, automatically generating the partition template according to the partition columns and the partition values so that the user can partition the Hive data according to the partition template.
Preferably, the step of performing data type matching on the Hive data to obtain the Hive data after type matching may include performing data type matching on the Hive data through a data classification algorithm to obtain the Hive data after type matching.
In particular, the data type matching process is similar to data classification, and therefore, can be implemented based on a corresponding data classification algorithm. Of course, the specific type of the data classification algorithm can be preset according to the actual situation, and the application is not limited, and the data classification algorithm can be implemented by a decision tree, a bayesian classifier and the like.
S203: after partitioning of Hive data is completed based on the partitioning template, extracting data corresponding to HBase primary keys from the partitioned Hive data;
specifically, after the partition template is obtained based on S202, the user may directly partition the Hive data according to the relevant indication in the partition template, and the partitioned Hive data may implement the correspondence with the HBase primary key, so that the data corresponding to the HBase primary key may be directly extracted from the partitioned Hive data.
The HBase primary key is preset based on a partition template, some fields in Hive data are suitable for being partition fields, and some fields can be used as HBase primary keys, so that the HBase primary keys can be obtained through the generated partition template based on the labeling processing process and further arranged on HBase.
Preferably, the data synchronization method may further include creating a data mapping table on the HBase; wherein, the data mapping table comprises a predetermined number of HBase primary keys.
Specifically, because the primary key of the HBase in the HBase is not unique, such as a self-increment column, a hash column, and the like, when it is determined that a plurality of primary keys of the HBase are created in the HBase, all the primary keys of the HBase may be stored in a list form, that is, the data mapping table may be generated.
S204: data corresponding to the primary key of HBase is written into HBase.
Specifically, this step is intended to write the data corresponding to the primary key of the HBase extracted in S203 directly into the HBase, thereby completing data synchronization between Hive and HBase.
Preferably, the writing of the data corresponding to the primary key of the HBase to the HBase may include writing the data corresponding to the primary key of the HBase to the HBase through Spark RDD.
Specifically, the writing process of the data corresponding to the HBase primary key may be implemented based on Spark RDDs (flexible Distributed data sets). RDD can provide highly limited shared memory, is a read-only, partitioned collection, has high fault tolerance characteristics, and can allow developers to perform memory-based computations on large clusters. Therefore, Hive data is transferred in Spark RDD without being dropped, material resources are effectively saved, serialization process is avoided, and data synchronization efficiency is improved.
As a preferred embodiment, the data synchronization method may further include reading HBase data in the HBase; and writing the HBase data into Hive.
Specifically, the above method embodiment realizes data synchronization from Hive to HBase, and conversely, the same can be realized for data synchronization from HBase to Hive. Since the data storage in Hive has no fixed format, and the data storage of HBase provides a fixed format, the data partitioning is performed synchronously on the data from HBase to Hive, and the data can be read and written directly. Specifically, the HBase data in the HBase may be directly read by Spark and written into Hive. And for the subsequent partitioning process, the method can also be realized through SQL.
According to the method for synchronizing the Hive data and the HBase data, data synchronization between the Hive data and the HBase is achieved through Spark, after Spark obtains the Hive data in the Hive, labeling processing is conducted on the Hive data according to related information of the Spark, and a corresponding partition template is obtained. Compared with the traditional data synchronization mode, the method does not need manual decision on partition type and manual partition, effectively improves the partition rate, improves the partition accuracy and further improves the data synchronization efficiency. In addition, since Spark has the performance of data calculation based on the memory, the implementation method can realize the function that the data to be synchronized directly carries out stream transfer in the Spark memory without falling off the disk, so that additional disk resources are not required to be introduced, material resources are saved, and performance loss is reduced to a certain extent; meanwhile, Hive data can be directly downloaded in a Hive format and directly written into HBase, so that the serialization process of the data is avoided.
On the basis of the various embodiments, the application provides a more specific implementation mode, which is applied to Hive and HBase.
First, the configuration management function of the open source component can be utilized to manage the relevant configuration of the HBase and Hive to obtain the partition template.
Specifically, since the modeling of Hive data is embodied in selecting partition columns and partition values, the modeling of HBase is embodied in selecting rowkey, whether to add salt, and selecting cut points. Therefore, in the case of autonomous modeling, before data export is performed on a data source, that is, the Hive data, data modeling analysis may be performed based on Spark Application, that is, the tagging processing is performed, and a specific implementation process of the method may refer to fig. 4, where fig. 4 is a flowchart of Hive data analysis provided by this Application.
After the Hive data is obtained, data type analysis can be performed on the Hive data, namely the data types are matched, so that the Hive data of multiple basic types can be obtained through fuzzy matching; further, column description analysis is performed on the Hive data after data type matching to exclude fields suitable as HBase primary keys, obtain fields with a fixed format suitable as Hive partitions, and recommend the fields as Hive partition values. Then, a field with similar data arrangement attributes can be extracted from the Hive data after the data types are matched to be used as an HBase primary key, such as a hash column, a self-increment column, and the like, and of course, the HBase primary key can also be obtained on the basis of the completed Hive partition value, for example, for a column with a hash attribute, a hash can be used as the HBase primary key, and for a column with a self-increment attribute, an addition salt can be used as the HBase primary key, and the like. Therefore, based on the process, the Hive partition value and the HBase main key are obtained, further, model matching can be carried out on the Hive partition value and the HBase main key, when a matching model cannot be obtained, the Hive data can be subjected to repeated cycle analysis processing, for example, analysis can be carried out by combining relevant parameter combinations, types, field attributes and the like, and if the matching model cannot be obtained after the cycle times are preset, a model which is not adaptable is output; if the matching model is available, outputting a corresponding configuration suggestion, so that the user can partition the Hive data and create the HBase primary key according to the configuration suggestion.
Secondly, after the partition of the Hive data and the creation of the HBase primary key are completed, a corresponding AppHBase Spark plug-in can be loaded, and therefore reading and writing of the HBase data are achieved.
And finally, synchronizing the Hive data between the Hive and the HBase through Spark RDD.
Therefore, the data synchronization from Hive to HBase is realized based on the above process, and for the reverse synchronization process, the HBase data is directly read based on Spark Application and written into Hive.
The method for synchronizing the Hive data and the HBase provided by the embodiment of the application analyzes the Hive data based on Spark and provides corresponding resource configuration instructions, so that automatic partition of the Hive data and automatic region configuration of the HBase are realized, data synchronization is further realized, and the efficiency of synchronizing the data between the Hive data and the HBase is effectively improved while the waste of flow and the performance loss are reduced.
To solve the above problem, please refer to fig. 5, fig. 5 is a schematic structural diagram of a data synchronization apparatus provided in the present application, where the data synchronization apparatus may include:
the data acquisition module 1 is used for acquiring source data at a source data end according to a data acquisition instruction;
the data processing module 2 is used for performing labeling processing on the source data to obtain a partition template;
the data extraction module 3 is used for extracting target synchronous data from the partitioned source data according to a preset rule after the source data are partitioned based on the partition template;
and the data synchronization module 4 is used for writing the target synchronization data into the target data terminal.
As a preferred embodiment, the data processing module 2 may include:
the data matching unit is used for carrying out data type matching on the source data to obtain source data after the type matching;
the data analysis unit is used for performing column description analysis on the source data after the type matching to obtain a partition column and a partition value;
and the template generating unit is used for generating a partition template according to the partition columns and the partition values.
As a preferred embodiment, the data extraction module 2 may be specifically configured to extract, from partitioned source data, source data corresponding to the HBase primary key as target synchronization data.
As a preferred embodiment, the data synchronization apparatus may further include:
the preprocessing module is used for creating a data mapping table on the HBase; wherein, the data mapping table comprises a predetermined number of HBase primary keys.
As a preferred embodiment, the data synchronization apparatus may further include:
the reverse synchronization module is used for reading HBase data in HBase; and writing the HBase data into Hive.
For the introduction of the apparatus provided in the present application, please refer to the above method embodiments, which are not described herein again.
To solve the above problem, please refer to fig. 6, fig. 6 is a schematic structural diagram of a data synchronization apparatus provided in the present application, where the data synchronization apparatus may include:
a memory 11 for storing a computer program;
a processor 12 for implementing the following steps when executing the computer program:
acquiring source data at a source data end according to the data acquisition instruction; performing labeling processing on the source data to obtain a partition template; after partitioning of the source data is completed based on the partitioning template, extracting target synchronous data from the partitioned source data according to a preset rule; and writing the target synchronous data into the target data terminal.
For the introduction of the device provided in the present application, please refer to the above method embodiment, which is not described herein again.
To solve the above problem, please refer to fig. 7, fig. 7 is a schematic structural diagram of a data synchronization system provided in the present application, where the data synchronization system may include:
the data synchronization device 10 as above, configured to obtain source data at the source data end 20 according to the data obtaining instruction; performing labeling processing on the source data to obtain a partition template; after partitioning of the source data is completed based on the partitioning template, extracting target synchronous data from the partitioned source data according to a preset rule; writing the target synchronous data into the target data terminal 30;
a source data terminal 20 for providing source data;
and the target data terminal 30 is used for storing the source data.
It should be noted that the above description has been given by taking as an example a process of importing source data in the source data side 20 to the destination data side 30, but the same is also applied to a data importing process from the destination data side 30 to the source data side 20, and when data importing from the destination data side 30 to the source data side 20 is required, the destination data side 30 serves as a side for providing source data, and the source data side 20 serves as a side for storing source data.
For the introduction of the system provided by the present application, please refer to the above method embodiment, which is not described herein again.
To solve the above problem, the present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program when executed by a processor can implement the following steps:
acquiring source data at a source data end according to the data acquisition instruction; performing labeling processing on the source data to obtain a partition template; after partitioning of the source data is completed based on the partitioning template, extracting target synchronous data from the partitioned source data according to a preset rule; and writing the target synchronous data into the target data terminal.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
For the introduction of the computer-readable storage medium provided in the present application, please refer to the above method embodiments, which are not described herein again.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The data synchronization method, apparatus, device, system and computer-readable storage medium provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and these improvements and modifications also fall into the elements of the protection scope of the claims of the present application.

Claims (9)

1. A method of data synchronization, comprising:
the Spark acquires source data at a source data end according to the data acquisition instruction;
performing data type matching on the source data to obtain source data after type matching;
performing column description analysis on the source data after the type matching to obtain a partition column and a partition value;
generating a partition template according to the partition columns and the partition values;
after the source data are partitioned based on the partition template, extracting target synchronous data from the partitioned source data according to a preset rule;
and writing the target synchronous data into a target data terminal.
2. The data synchronization method of claim 1, wherein the source data side is Hive and the target data side is HBase.
3. The data synchronization method according to claim 2, wherein the extracting the target synchronization data from the partitioned source data according to the preset rule comprises:
and extracting source data corresponding to the HBase primary key from the partitioned source data as the target synchronous data.
4. The data synchronization method of claim 3, further comprising:
creating a data mapping table on the HBase; wherein, the data mapping table comprises a predetermined number of HBase primary keys.
5. The data synchronization method of claim 2, further comprising:
reading HBase data in the HBase;
writing the HBase into the Hive.
6. A data synchronization apparatus, comprising:
the data acquisition module is used for acquiring source data at a source data end according to the data acquisition instruction;
the data processing module is used for carrying out data type matching on the source data to obtain source data after type matching; performing column description analysis on the source data after the type matching to obtain a partition column and a partition value; generating a partition template according to the partition columns and the partition values;
the data extraction module is used for extracting target synchronous data from the partitioned source data according to a preset rule after the source data are partitioned based on the partition template;
and the data synchronization module is used for writing the target synchronization data into a target data end.
7. A data synchronization apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the data synchronization method according to any one of claims 1 to 5 when executing said computer program.
8. A data synchronization system, comprising:
the data synchronization device of claim 7, configured to obtain source data at a source data end according to a data obtaining instruction; performing data type matching on the source data to obtain source data after type matching; performing column description analysis on the source data after the type matching to obtain a partition column and a partition value; generating a partition template according to the partition columns and the partition values; after the source data are partitioned based on the partition template, extracting target synchronous data from the partitioned source data according to a preset rule; writing the target synchronous data into a target data terminal;
the source data terminal is used for providing the source data;
and the target data terminal is used for storing the source data.
9. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the data synchronization method according to any one of claims 1 to 5.
CN201811527177.1A 2018-12-13 2018-12-13 Data synchronization method, device, equipment, system and computer readable storage medium Active CN109857803B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811527177.1A CN109857803B (en) 2018-12-13 2018-12-13 Data synchronization method, device, equipment, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811527177.1A CN109857803B (en) 2018-12-13 2018-12-13 Data synchronization method, device, equipment, system and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109857803A CN109857803A (en) 2019-06-07
CN109857803B true CN109857803B (en) 2020-09-08

Family

ID=66891032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811527177.1A Active CN109857803B (en) 2018-12-13 2018-12-13 Data synchronization method, device, equipment, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109857803B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110737683A (en) * 2019-10-18 2020-01-31 成都四方伟业软件股份有限公司 Automatic partitioning method and device for extraction-based business intelligent analysis platforms
CN111008235A (en) * 2019-12-03 2020-04-14 成都四方伟业软件股份有限公司 Spark-based small file merging method and system
CN111651466B (en) * 2020-05-09 2023-07-25 杭州数梦工场科技有限公司 Data sampling method and device
CN111597245B (en) * 2020-05-20 2023-09-29 政采云有限公司 Data extraction method and device and related equipment
CN111538754A (en) * 2020-06-22 2020-08-14 杭州城市大数据运营有限公司 Data collection management system, method, device, equipment and storage medium
CN111813779A (en) * 2020-07-09 2020-10-23 携程旅游网络技术(上海)有限公司 Data query method, system, device and medium based on data interface configuration
CN112966015B (en) * 2021-02-01 2023-08-15 杭州博联智能科技股份有限公司 Big data analysis processing and storing method, device, equipment and medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417246A (en) * 2018-03-26 2018-08-17 大连大学 Use the monitoring method of Internet of Things smart cloud intelligent ECG monitoring and data processing system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10305967B2 (en) * 2016-03-14 2019-05-28 Business Objects Software Ltd. Unified client for distributed processing platform
CN106651633B (en) * 2016-10-09 2021-02-02 国网浙江省电力公司信息通信分公司 Power utilization information acquisition system based on big data technology and acquisition method thereof
CN106502964B (en) * 2016-12-06 2019-03-26 中国矿业大学 A kind of extreme learning machine parallelization calculation method based on Spark
CN108804697A (en) * 2018-06-15 2018-11-13 中国平安人寿保险股份有限公司 Method of data synchronization, device, computer equipment based on Spark and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417246A (en) * 2018-03-26 2018-08-17 大连大学 Use the monitoring method of Internet of Things smart cloud intelligent ECG monitoring and data processing system

Also Published As

Publication number Publication date
CN109857803A (en) 2019-06-07

Similar Documents

Publication Publication Date Title
CN109857803B (en) Data synchronization method, device, equipment, system and computer readable storage medium
CN107463692B (en) Super large text data is synchronized to the method and system of search engine
CN110851511A (en) Data synchronization method and device
CN110209809B (en) Text clustering method and device, storage medium and electronic device
CN109902126B (en) Loading system supporting HIVE automatic partition and implementation method thereof
CN110490761B (en) Power grid distribution network equipment ledger data model modeling method
CA3101497A1 (en) System and method for analyzing and modeling content
CN111008521A (en) Method and device for generating wide table and computer storage medium
CN110209714A (en) Report form generation method, device, computer equipment and computer readable storage medium
CN110377576A (en) Create method and apparatus, the log analysis method of log template
CN112597348A (en) Method and device for optimizing big data storage
CN111897828A (en) Data batch processing implementation method, device, equipment and storage medium
CN113468196B (en) Method, apparatus, system, server and medium for processing data
CN112307318A (en) Content publishing method, system and device
CN104778252A (en) Index storage method and index storage device
CN112052248A (en) Audit big data processing method and system
CN117093619A (en) Rule engine processing method and device, electronic equipment and storage medium
CN111221967A (en) Language data classification storage system based on block chain architecture
CN115858322A (en) Log data processing method and device and computer equipment
CN116303379A (en) Data processing method, system and computer storage medium
US20210295738A1 (en) Providing math content for visually impaired
CN109063201B (en) Impala online interactive query method based on mixed storage scheme
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN106557564A (en) A kind of object data analysis method and device
CN106469086B (en) Event processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant