CN112860792A

CN112860792A - Method and system for synchronizing mongodb cluster data and hive cluster data

Info

Publication number: CN112860792A
Application number: CN202110129698.7A
Authority: CN
Inventors: 李佳喜; 刘跃红
Original assignee: Yinsheng Payment Service Co Ltd
Current assignee: Yinsheng Payment Service Co Ltd
Priority date: 2021-01-29
Filing date: 2021-01-29
Publication date: 2021-05-28

Abstract

The invention discloses a method and a system for synchronizing monogdb cluster and h i ve cluster data, which belong to the technical field of system data analysis and comprise an h i ve cluster, a monogdb cluster and a local server, wherein the h i ve cluster, the local server and the monogdb cluster are sequentially connected, preprocessing is performed in the h i ve cluster, the local server executes a program to enable data to be synchronized from the monogdb cluster to the h i ve cluster, preprocessing is performed in the monogdb cluster, and the local server executes the program to enable the data to be synchronized from the h i ve cluster to the monogdb cluster.

Description

Method and system for synchronizing mongodb cluster data and hive cluster data

Technical Field

The invention relates to the technical field of system data analysis, in particular to a method and a system for synchronizing mongodb cluster data and hive cluster data.

Background

At present, the scale of a plurality of enterprises is increased, branch companies are distributed in different cities for business management, and a parent company can promote the business process only by carrying out on-line system data analysis.

However, according to business requirements, a large amount of data needs to be analyzed in a multi-dimensional mode every day, and reports are output to business departments of various subsidiaries, so that certain requirements are made on accuracy and timeliness of the data at present, and because a tool button used for synchronizing the data occupies much time before, the problems that a large amount of data cannot be synchronized, and memory is involved are solved, and business processes are seriously affected.

Disclosure of Invention

In order to overcome the defects of the prior art, the method for synchronizing the data of the mongodb cluster and the hive cluster and the synchronizing system thereof can solve the problem that the data are synchronized between the mongodb cluster and the hive cluster.

The technical scheme adopted by the invention for solving the technical problems is as follows: in a method of synchronizing mongodb cluster and hive cluster data with each other, the improvement comprising

Preprocessing is carried out in the hive cluster, and a program is executed through a local server, so that data can be synchronized from the mongodb cluster to the hive cluster;

preprocessing is performed in the mongodb cluster, and a program is executed through a local server, so that data can be synchronized from the hive cluster to the mongodb cluster.

As a further improvement of the above technical solution, the process of preprocessing in the hive cluster and executing the program through the local server so that the data can be synchronized from the mongodb cluster to the hive cluster includes the following steps:

step S1, installing a mongodb cluster client on one machine of the hive cluster;

step S2, deriving a json file from the mongodb cluster and storing the json file in a local server;

step S3, loading json file format data by the hive cluster, and loading json files of the local server into a database of the hive cluster;

in step S4, data compression processing is performed on the database in the hive cluster.

As a further improvement of the above technical solution, in step S2, the mongodeb cluster calls its mongooxport instruction to export a json file.

As a further improvement of the above technical solution, before the json file is exported, a comma or a linefeed in the character string is replaced with a null.

As a further improvement of the above technical solution, the process of preprocessing in the mongodb cluster and executing the program by the local server so that the data can be synchronized from the hive cluster to the mongodb cluster includes the following steps:

step K1, exporting csv files and column headers in the hive cluster, and storing the exported csv files in a local server;

step K2, calling a linux command to modify the csv file of the local server;

step K3, calling linux commands to add fields and types corresponding to the mongodb cluster to the modified csv file in the local server;

step K4, the client of the mongodb cluster calls the mongoimprort command and introduces the csv files which are processed for many times into the mongodb cluster.

As a further improvement of the above technical solution, in step K1, before storing the export csv file in the local server, the column header needs to be processed, and only the output content is set without outputting the column name.

A synchronization system comprises a hive cluster, a mongodb cluster and a local server, wherein the hive cluster, the local server and the mongodb cluster are sequentially connected;

The invention has the beneficial effects that: the invention solves the problem of mutual synchronization of a large amount of data between the mongodb cluster and the hive cluster, can accelerate the progress of the service, improves the efficiency and saves the cost.

Drawings

Fig. 1 is a structural frame diagram of the present invention.

Detailed Description

The invention is further illustrated with reference to the following figures and examples.

The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.

Referring to fig. 1, the present invention discloses a method for synchronizing a mongodb cluster and a hive cluster with each other, which includes preprocessing in the hive cluster, executing a program through a local server so that data can be synchronized from the mongodb cluster to the hive cluster, and preprocessing in the mongodb cluster, executing the program through the local server so that data can be synchronized from the hive cluster to the mongodb cluster.

In the above embodiment, to synchronize data between the mongodb cluster and the hive cluster, firstly, both the mongodb cluster and the hive cluster are preprocessed, and since data is synchronized from the mongodb cluster to the hive cluster and data is synchronized from the hive cluster to the mongodb cluster, data cannot be directly used by a receiving party and must be processed by a program.

Further, the process of preprocessing in the hive cluster and executing the program through the local server so that the data can be synchronized from the mongodb cluster to the hive cluster comprises the following steps:

step S1, installing mongodb client on one machine of the hive cluster;

In step S2, the mongodb cluster calls its mongooxport instruction to export the json file. Commas or linefeeds within the string are replaced with null before exporting the json file.

In the above embodiment, in order to synchronize a large amount of data from the mongodb cluster to the hive cluster, the present invention first installs the client of the mongodb on the machine of the hive cluster, then derives the json file from the mongodb cluster and stores the json file into the local server, and the hive cluster loads the data in the json file format, loads the json file of the local server into the database of the hive cluster, and completes the synchronization of the data from the mongodb cluster to the objective of the hive cluster. 1000 ten thousand data can be synchronously completed only by less than 1H, thereby achieving the speed problem of data transmission (faster than program processing)

In step S3, the json file of the local server is loaded into the temporary table in the database of the hive cluster, and since the data is directly loaded without data compression, the occupied disk space is large, so the table cannot store much data, otherwise it is not cost-effective (temporary table), and the data of the temporary table is inserted into the partition of the target table of the database in the hive sql mode (where the temporary table can be inserted in the hive sql language), and the target table is compressed, so that a large space can be saved.

Still further, the pre-processing in the mongodb cluster, and the program executed by the local server, so that the data can be synchronized from the hive cluster to the mongodb cluster, includes the following steps:

step K2, calling a linux command to modify the csv file of the local server;

In step K1, before the export csv file is stored in the local server, the column header needs to be processed, and only the output content is set, and the column name is not output.

In the above embodiment, in order to synchronize data from the hive cluster to the mongodb cluster, the csv file and the column header in the hive cluster are stored in the local server, the csv file of the local server is modified by using the linux command, and then the linux command is called to newly add a field and a type corresponding to the mongodb cluster to the modified csv file. The client side of the mongodb cluster can call the mongoumport command of the client side, the csv files subjected to the multiple modification processing are led into the mongodb cluster, and the aim that data are synchronized from the hive cluster to the mongodb cluster is achieved. 1000 ten thousand data can be synchronously completed only by less than 1H, thereby achieving the speed problem of data transmission (faster than program processing)

According to the method, when a command carried by a client of the mongodb cluster is used and input, a link IP, a port, a name, a password, a database name, a set name, a file type and a first behavior column name are designated, empty fields in export of csv files and tsv files are ignored, and parameters such as the file type are imported. In addition, in the process of data from the hive cluster to the mongodb cluster, as the csv file is exported, commas, line breaks and the like need to be replaced by null, the data accuracy can be improved, then the csv file is exported, the field name of the csv file needs to be consistent with the field name of the hive cluster in the exporting process, otherwise, the field values are different. The invention relates to the problem of data accuracy, as long as the field does not contain comma, special characters such as line feed character and the like, the field can reach hundred percent consistency, and if special character replacement occurs, the field can be replaced by the special characters.

step S1, installing mongodb client on one machine of the hive cluster;

step K2, calling a linux command to modify the csv file of the local server;

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for synchronizing mongodb cluster data and hive cluster data mutually is characterized by comprising the following steps

2. The method for synchronizing the data of the mongodb cluster and the hive cluster with each other according to claim 1, wherein the preprocessing is performed in the hive cluster, and the local server executes the program, so that the data can be synchronized from the mongodb cluster to the hive cluster, and the method comprises the following steps:

3. The method of claim 2, wherein in step S2, the mongodb cluster calls its mongooxport instruction to export a json file.

4. The method for synchronizing mongodb cluster data and hive cluster data with each other according to claim 3, wherein commas or line breaks in character strings are replaced with null before json files are exported.

5. The method for the mutual synchronization of the mongodb cluster and the hive cluster data according to claim 1, wherein the preprocessing is performed in the mongodb cluster, and the local server executes the program, so that the data can be synchronized from the hive cluster to the mongodb cluster, and the method comprises the following steps:

step K2, calling a linux command to modify the csv file of the local server;

6. The method of claim 5, wherein in step K1, before storing the exported csv file in the local server, the column header is processed, and only the exported content is set without exporting the column name.

7. A synchronization system is characterized by comprising a hive cluster, a mongodb cluster and a local server, wherein the hive cluster, the local server and the mongodb cluster are sequentially connected;