CN112860792A - Method and system for synchronizing mongodb cluster data and hive cluster data - Google Patents

Method and system for synchronizing mongodb cluster data and hive cluster data Download PDF

Info

Publication number
CN112860792A
CN112860792A CN202110129698.7A CN202110129698A CN112860792A CN 112860792 A CN112860792 A CN 112860792A CN 202110129698 A CN202110129698 A CN 202110129698A CN 112860792 A CN112860792 A CN 112860792A
Authority
CN
China
Prior art keywords
cluster
hive
mongodb
data
local server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110129698.7A
Other languages
Chinese (zh)
Inventor
李佳喜
刘跃红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yinsheng Payment Service Co Ltd
Original Assignee
Yinsheng Payment Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yinsheng Payment Service Co Ltd filed Critical Yinsheng Payment Service Co Ltd
Priority to CN202110129698.7A priority Critical patent/CN112860792A/en
Publication of CN112860792A publication Critical patent/CN112860792A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/275Synchronous replication

Abstract

The invention discloses a method and a system for synchronizing monogdb cluster and h i ve cluster data, which belong to the technical field of system data analysis and comprise an h i ve cluster, a monogdb cluster and a local server, wherein the h i ve cluster, the local server and the monogdb cluster are sequentially connected, preprocessing is performed in the h i ve cluster, the local server executes a program to enable data to be synchronized from the monogdb cluster to the h i ve cluster, preprocessing is performed in the monogdb cluster, and the local server executes the program to enable the data to be synchronized from the h i ve cluster to the monogdb cluster.

Description

Method and system for synchronizing mongodb cluster data and hive cluster data
Technical Field
The invention relates to the technical field of system data analysis, in particular to a method and a system for synchronizing mongodb cluster data and hive cluster data.
Background
At present, the scale of a plurality of enterprises is increased, branch companies are distributed in different cities for business management, and a parent company can promote the business process only by carrying out on-line system data analysis.
However, according to business requirements, a large amount of data needs to be analyzed in a multi-dimensional mode every day, and reports are output to business departments of various subsidiaries, so that certain requirements are made on accuracy and timeliness of the data at present, and because a tool button used for synchronizing the data occupies much time before, the problems that a large amount of data cannot be synchronized, and memory is involved are solved, and business processes are seriously affected.
Disclosure of Invention
In order to overcome the defects of the prior art, the method for synchronizing the data of the mongodb cluster and the hive cluster and the synchronizing system thereof can solve the problem that the data are synchronized between the mongodb cluster and the hive cluster.
The technical scheme adopted by the invention for solving the technical problems is as follows: in a method of synchronizing mongodb cluster and hive cluster data with each other, the improvement comprising
Preprocessing is carried out in the hive cluster, and a program is executed through a local server, so that data can be synchronized from the mongodb cluster to the hive cluster;
preprocessing is performed in the mongodb cluster, and a program is executed through a local server, so that data can be synchronized from the hive cluster to the mongodb cluster.
As a further improvement of the above technical solution, the process of preprocessing in the hive cluster and executing the program through the local server so that the data can be synchronized from the mongodb cluster to the hive cluster includes the following steps:
step S1, installing a mongodb cluster client on one machine of the hive cluster;
step S2, deriving a json file from the mongodb cluster and storing the json file in a local server;
step S3, loading json file format data by the hive cluster, and loading json files of the local server into a database of the hive cluster;
in step S4, data compression processing is performed on the database in the hive cluster.
As a further improvement of the above technical solution, in step S2, the mongodeb cluster calls its mongooxport instruction to export a json file.
As a further improvement of the above technical solution, before the json file is exported, a comma or a linefeed in the character string is replaced with a null.
As a further improvement of the above technical solution, the process of preprocessing in the mongodb cluster and executing the program by the local server so that the data can be synchronized from the hive cluster to the mongodb cluster includes the following steps:
step K1, exporting csv files and column headers in the hive cluster, and storing the exported csv files in a local server;
step K2, calling a linux command to modify the csv file of the local server;
step K3, calling linux commands to add fields and types corresponding to the mongodb cluster to the modified csv file in the local server;
step K4, the client of the mongodb cluster calls the mongoimprort command and introduces the csv files which are processed for many times into the mongodb cluster.
As a further improvement of the above technical solution, in step K1, before storing the export csv file in the local server, the column header needs to be processed, and only the output content is set without outputting the column name.
A synchronization system comprises a hive cluster, a mongodb cluster and a local server, wherein the hive cluster, the local server and the mongodb cluster are sequentially connected;
preprocessing is carried out in the hive cluster, and a program is executed through a local server, so that data can be synchronized from the mongodb cluster to the hive cluster;
preprocessing is performed in the mongodb cluster, and a program is executed through a local server, so that data can be synchronized from the hive cluster to the mongodb cluster.
The invention has the beneficial effects that: the invention solves the problem of mutual synchronization of a large amount of data between the mongodb cluster and the hive cluster, can accelerate the progress of the service, improves the efficiency and saves the cost.
Drawings
Fig. 1 is a structural frame diagram of the present invention.
Detailed Description
The invention is further illustrated with reference to the following figures and examples.
The conception, the specific structure, and the technical effects produced by the present invention will be clearly and completely described below in conjunction with the embodiments and the accompanying drawings to fully understand the objects, the features, and the effects of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and those skilled in the art can obtain other embodiments without inventive effort based on the embodiments of the present invention, and all embodiments are within the protection scope of the present invention. In addition, all the connection/connection relations referred to in the patent do not mean that the components are directly connected, but mean that a better connection structure can be formed by adding or reducing connection auxiliary components according to specific implementation conditions. All technical characteristics in the invention can be interactively combined on the premise of not conflicting with each other.
Referring to fig. 1, the present invention discloses a method for synchronizing a mongodb cluster and a hive cluster with each other, which includes preprocessing in the hive cluster, executing a program through a local server so that data can be synchronized from the mongodb cluster to the hive cluster, and preprocessing in the mongodb cluster, executing the program through the local server so that data can be synchronized from the hive cluster to the mongodb cluster.
In the above embodiment, to synchronize data between the mongodb cluster and the hive cluster, firstly, both the mongodb cluster and the hive cluster are preprocessed, and since data is synchronized from the mongodb cluster to the hive cluster and data is synchronized from the hive cluster to the mongodb cluster, data cannot be directly used by a receiving party and must be processed by a program.
Further, the process of preprocessing in the hive cluster and executing the program through the local server so that the data can be synchronized from the mongodb cluster to the hive cluster comprises the following steps:
step S1, installing mongodb client on one machine of the hive cluster;
step S2, deriving a json file from the mongodb cluster and storing the json file in a local server;
step S3, loading json file format data by the hive cluster, and loading json files of the local server into a database of the hive cluster;
in step S4, data compression processing is performed on the database in the hive cluster.
In step S2, the mongodb cluster calls its mongooxport instruction to export the json file. Commas or linefeeds within the string are replaced with null before exporting the json file.
In the above embodiment, in order to synchronize a large amount of data from the mongodb cluster to the hive cluster, the present invention first installs the client of the mongodb on the machine of the hive cluster, then derives the json file from the mongodb cluster and stores the json file into the local server, and the hive cluster loads the data in the json file format, loads the json file of the local server into the database of the hive cluster, and completes the synchronization of the data from the mongodb cluster to the objective of the hive cluster. 1000 ten thousand data can be synchronously completed only by less than 1H, thereby achieving the speed problem of data transmission (faster than program processing)
In step S3, the json file of the local server is loaded into the temporary table in the database of the hive cluster, and since the data is directly loaded without data compression, the occupied disk space is large, so the table cannot store much data, otherwise it is not cost-effective (temporary table), and the data of the temporary table is inserted into the partition of the target table of the database in the hive sql mode (where the temporary table can be inserted in the hive sql language), and the target table is compressed, so that a large space can be saved.
Still further, the pre-processing in the mongodb cluster, and the program executed by the local server, so that the data can be synchronized from the hive cluster to the mongodb cluster, includes the following steps:
step K1, exporting csv files and column headers in the hive cluster, and storing the exported csv files in a local server;
step K2, calling a linux command to modify the csv file of the local server;
step K3, calling linux commands to add fields and types corresponding to the mongodb cluster to the modified csv file in the local server;
step K4, the client of the mongodb cluster calls the mongoimprort command and introduces the csv files which are processed for many times into the mongodb cluster.
In step K1, before the export csv file is stored in the local server, the column header needs to be processed, and only the output content is set, and the column name is not output.
In the above embodiment, in order to synchronize data from the hive cluster to the mongodb cluster, the csv file and the column header in the hive cluster are stored in the local server, the csv file of the local server is modified by using the linux command, and then the linux command is called to newly add a field and a type corresponding to the mongodb cluster to the modified csv file. The client side of the mongodb cluster can call the mongoumport command of the client side, the csv files subjected to the multiple modification processing are led into the mongodb cluster, and the aim that data are synchronized from the hive cluster to the mongodb cluster is achieved. 1000 ten thousand data can be synchronously completed only by less than 1H, thereby achieving the speed problem of data transmission (faster than program processing)
According to the method, when a command carried by a client of the mongodb cluster is used and input, a link IP, a port, a name, a password, a database name, a set name, a file type and a first behavior column name are designated, empty fields in export of csv files and tsv files are ignored, and parameters such as the file type are imported. In addition, in the process of data from the hive cluster to the mongodb cluster, as the csv file is exported, commas, line breaks and the like need to be replaced by null, the data accuracy can be improved, then the csv file is exported, the field name of the csv file needs to be consistent with the field name of the hive cluster in the exporting process, otherwise, the field values are different. The invention relates to the problem of data accuracy, as long as the field does not contain comma, special characters such as line feed character and the like, the field can reach hundred percent consistency, and if special character replacement occurs, the field can be replaced by the special characters.
A synchronization system comprises a hive cluster, a mongodb cluster and a local server, wherein the hive cluster, the local server and the mongodb cluster are sequentially connected;
preprocessing is carried out in the hive cluster, and a program is executed through a local server, so that data can be synchronized from the mongodb cluster to the hive cluster;
preprocessing is performed in the mongodb cluster, and a program is executed through a local server, so that data can be synchronized from the hive cluster to the mongodb cluster.
In the above embodiment, to synchronize data between the mongodb cluster and the hive cluster, firstly, both the mongodb cluster and the hive cluster are preprocessed, and since data is synchronized from the mongodb cluster to the hive cluster and data is synchronized from the hive cluster to the mongodb cluster, data cannot be directly used by a receiving party and must be processed by a program.
Further, the process of preprocessing in the hive cluster and executing the program through the local server so that the data can be synchronized from the mongodb cluster to the hive cluster comprises the following steps:
step S1, installing mongodb client on one machine of the hive cluster;
step S2, deriving a json file from the mongodb cluster and storing the json file in a local server;
step S3, loading json file format data by the hive cluster, and loading json files of the local server into a database of the hive cluster;
in step S4, data compression processing is performed on the database in the hive cluster.
In step S2, the mongodb cluster calls its mongooxport instruction to export the json file. Commas or linefeeds within the string are replaced with null before exporting the json file.
In the above embodiment, in order to synchronize a large amount of data from the mongodb cluster to the hive cluster, the present invention first installs the client of the mongodb on the machine of the hive cluster, then derives the json file from the mongodb cluster and stores the json file into the local server, and the hive cluster loads the data in the json file format, loads the json file of the local server into the database of the hive cluster, and completes the synchronization of the data from the mongodb cluster to the objective of the hive cluster. 1000 ten thousand data can be synchronously completed only by less than 1H, thereby achieving the speed problem of data transmission (faster than program processing)
In step S3, the json file of the local server is loaded into the temporary table in the database of the hive cluster, and since the data is directly loaded without data compression, the occupied disk space is large, so the table cannot store much data, otherwise it is not cost-effective (temporary table), and the data of the temporary table is inserted into the partition of the target table of the database in the hive sql mode (where the temporary table can be inserted in the hive sql language), and the target table is compressed, so that a large space can be saved.
Still further, the pre-processing in the mongodb cluster, and the program executed by the local server, so that the data can be synchronized from the hive cluster to the mongodb cluster, includes the following steps:
step K1, exporting csv files and column headers in the hive cluster, and storing the exported csv files in a local server;
step K2, calling a linux command to modify the csv file of the local server;
step K3, calling linux commands to add fields and types corresponding to the mongodb cluster to the modified csv file in the local server;
step K4, the client of the mongodb cluster calls the mongoimprort command and introduces the csv files which are processed for many times into the mongodb cluster.
In step K1, before the export csv file is stored in the local server, the column header needs to be processed, and only the output content is set, and the column name is not output.
In the above embodiment, in order to synchronize data from the hive cluster to the mongodb cluster, the csv file and the column header in the hive cluster are stored in the local server, the csv file of the local server is modified by using the linux command, and then the linux command is called to newly add a field and a type corresponding to the mongodb cluster to the modified csv file. The client side of the mongodb cluster can call the mongoumport command of the client side, the csv files subjected to the multiple modification processing are led into the mongodb cluster, and the aim that data are synchronized from the hive cluster to the mongodb cluster is achieved. 1000 ten thousand data can be synchronously completed only by less than 1H, thereby achieving the speed problem of data transmission (faster than program processing)
According to the method, when a command carried by a client of the mongodb cluster is used and input, a link IP, a port, a name, a password, a database name, a set name, a file type and a first behavior column name are designated, empty fields in export of csv files and tsv files are ignored, and parameters such as the file type are imported. In addition, in the process of data from the hive cluster to the mongodb cluster, as the csv file is exported, commas, line breaks and the like need to be replaced by null, the data accuracy can be improved, then the csv file is exported, the field name of the csv file needs to be consistent with the field name of the hive cluster in the exporting process, otherwise, the field values are different. The invention relates to the problem of data accuracy, as long as the field does not contain comma, special characters such as line feed character and the like, the field can reach hundred percent consistency, and if special character replacement occurs, the field can be replaced by the special characters.
The invention has the beneficial effects that: the invention solves the problem of mutual synchronization of a large amount of data between the mongodb cluster and the hive cluster, can accelerate the progress of the service, improves the efficiency and saves the cost.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A method for synchronizing mongodb cluster data and hive cluster data mutually is characterized by comprising the following steps
Preprocessing is carried out in the hive cluster, and a program is executed through a local server, so that data can be synchronized from the mongodb cluster to the hive cluster;
preprocessing is performed in the mongodb cluster, and a program is executed through a local server, so that data can be synchronized from the hive cluster to the mongodb cluster.
2. The method for synchronizing the data of the mongodb cluster and the hive cluster with each other according to claim 1, wherein the preprocessing is performed in the hive cluster, and the local server executes the program, so that the data can be synchronized from the mongodb cluster to the hive cluster, and the method comprises the following steps:
step S1, installing a mongodb cluster client on one machine of the hive cluster;
step S2, deriving a json file from the mongodb cluster and storing the json file in a local server;
step S3, loading json file format data by the hive cluster, and loading json files of the local server into a database of the hive cluster;
in step S4, data compression processing is performed on the database in the hive cluster.
3. The method of claim 2, wherein in step S2, the mongodb cluster calls its mongooxport instruction to export a json file.
4. The method for synchronizing mongodb cluster data and hive cluster data with each other according to claim 3, wherein commas or line breaks in character strings are replaced with null before json files are exported.
5. The method for the mutual synchronization of the mongodb cluster and the hive cluster data according to claim 1, wherein the preprocessing is performed in the mongodb cluster, and the local server executes the program, so that the data can be synchronized from the hive cluster to the mongodb cluster, and the method comprises the following steps:
step K1, exporting csv files and column headers in the hive cluster, and storing the exported csv files in a local server;
step K2, calling a linux command to modify the csv file of the local server;
step K3, calling linux commands to add fields and types corresponding to the mongodb cluster to the modified csv file in the local server;
step K4, the client of the mongodb cluster calls the mongoimprort command and introduces the csv files which are processed for many times into the mongodb cluster.
6. The method of claim 5, wherein in step K1, before storing the exported csv file in the local server, the column header is processed, and only the exported content is set without exporting the column name.
7. A synchronization system is characterized by comprising a hive cluster, a mongodb cluster and a local server, wherein the hive cluster, the local server and the mongodb cluster are sequentially connected;
preprocessing is carried out in the hive cluster, and a program is executed through a local server, so that data can be synchronized from the mongodb cluster to the hive cluster;
preprocessing is performed in the mongodb cluster, and a program is executed through a local server, so that data can be synchronized from the hive cluster to the mongodb cluster.
CN202110129698.7A 2021-01-29 2021-01-29 Method and system for synchronizing mongodb cluster data and hive cluster data Pending CN112860792A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110129698.7A CN112860792A (en) 2021-01-29 2021-01-29 Method and system for synchronizing mongodb cluster data and hive cluster data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110129698.7A CN112860792A (en) 2021-01-29 2021-01-29 Method and system for synchronizing mongodb cluster data and hive cluster data

Publications (1)

Publication Number Publication Date
CN112860792A true CN112860792A (en) 2021-05-28

Family

ID=75987084

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110129698.7A Pending CN112860792A (en) 2021-01-29 2021-01-29 Method and system for synchronizing mongodb cluster data and hive cluster data

Country Status (1)

Country Link
CN (1) CN112860792A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222122A (en) * 2019-07-26 2019-09-10 深圳市元征科技股份有限公司 A kind of method of data synchronization and relevant device of MongoDB
CN110674135A (en) * 2019-09-17 2020-01-10 上海易点时空网络有限公司 MongoDB-based answer data statistical method and device and storage medium
CN110674113A (en) * 2019-09-24 2020-01-10 咪咕音乐有限公司 One-key migration method and device for data, electronic equipment and storage medium
CN110704400A (en) * 2019-09-29 2020-01-17 上海易点时空网络有限公司 Real-time data synchronization method and device and server
CN110795418A (en) * 2019-09-23 2020-02-14 紫光云(南京)数字技术有限公司 Json-based data extraction method from mongoDB to mysql

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222122A (en) * 2019-07-26 2019-09-10 深圳市元征科技股份有限公司 A kind of method of data synchronization and relevant device of MongoDB
CN110674135A (en) * 2019-09-17 2020-01-10 上海易点时空网络有限公司 MongoDB-based answer data statistical method and device and storage medium
CN110795418A (en) * 2019-09-23 2020-02-14 紫光云(南京)数字技术有限公司 Json-based data extraction method from mongoDB to mysql
CN110674113A (en) * 2019-09-24 2020-01-10 咪咕音乐有限公司 One-key migration method and device for data, electronic equipment and storage medium
CN110704400A (en) * 2019-09-29 2020-01-17 上海易点时空网络有限公司 Real-time data synchronization method and device and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GEZAILUSHANG: "Hive导出数据到本地CSV", 《CSDN: HTTPS://BLOG.CSDN.NET/GEZAILUSHANG/ARTICLE/DETAILS/83583621》 *
GUICAIZHOU: "Mongodb数据同步到Hive", 《CSDN:HTTPS://BLOG.CSDN.NET/GUICAIZHOU/ARTICLE/DETAILS/83861633》 *

Similar Documents

Publication Publication Date Title
US10102039B2 (en) Converting a hybrid flow
CN109376196B (en) Method and device for batch synchronization of redo logs
CN102592198A (en) Workflow engine supporting combined service
CN111008020B (en) Method for analyzing logic expression into general query statement
CN106682036A (en) Data exchange system and exchange method thereof
CN106021071A (en) Method and system for monitoring SQL operation process
CN111651365B (en) Automatic interface testing method and device
CN103049367A (en) Automatic testing method for software
CN102915344B (en) SQL (structured query language) statement processing method and device
CN105117441A (en) Data work order processing method and system
CN114173355B (en) Method and system for dynamically executing network instruction with separated design running states
CN112860792A (en) Method and system for synchronizing mongodb cluster data and hive cluster data
CN112631754A (en) Data processing method, data processing device, storage medium and electronic device
CN110879753B (en) GPU acceleration performance optimization method and system based on automatic cluster resource management
CN111241186A (en) Rule-based visual parallel data preparation method, system, equipment and medium
CN115757481A (en) Data migration method, device, equipment and storage medium
CN107122246B (en) Intelligent numerical simulation operation management and feedback method
CN115422898A (en) Visual self-defined report form analysis system based on container cloud
CN111796868B (en) Method for realizing interaction of CAD enhanced attribute block and Excel data
CN110515989B (en) Data real-time statistical method based on financial data management platform
CN111160403B (en) API (application program interface) multiplexing discovery method and device
CN113360576A (en) Power grid mass data real-time processing method and device based on Flink Streaming
CN112948188A (en) Log file screening method, system and medium
CN110765009A (en) Automatic AI voice software testing framework of carrying out
CN110825453A (en) Data processing method and device based on big data platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210528

RJ01 Rejection of invention patent application after publication