CN114238479A - Data import method, device, equipment and medium based on yellowbreak database - Google Patents

Data import method, device, equipment and medium based on yellowbreak database Download PDF

Info

Publication number
CN114238479A
CN114238479A CN202111559424.8A CN202111559424A CN114238479A CN 114238479 A CN114238479 A CN 114238479A CN 202111559424 A CN202111559424 A CN 202111559424A CN 114238479 A CN114238479 A CN 114238479A
Authority
CN
China
Prior art keywords
data
target
database
source
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111559424.8A
Other languages
Chinese (zh)
Inventor
施健健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Pension Insurance Corp
Original Assignee
Ping An Pension Insurance Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Pension Insurance Corp filed Critical Ping An Pension Insurance Corp
Priority to CN202111559424.8A priority Critical patent/CN114238479A/en
Publication of CN114238479A publication Critical patent/CN114238479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a big data technology, and provides a data importing method, a device, equipment and a medium based on a yellowbreak database. The server is used as a transfer station between the source database and the target database, the source data is converted in the server to obtain intermediate data, the intermediate data are transmitted to the target database, the server can load the data to the target database in batches only by reading the table information from the management nodes of the target database once, resources of the management nodes in the target database are saved, and the target database is prevented from being crashed.

Description

Data import method, device, equipment and medium based on yellowbreak database
Technical Field
The invention relates to the technical field of data processing of big data, in particular to a data importing method and device based on a yellowbreak database, computer equipment and a storage medium.
Background
At present, when source data is imported from a source database (generally, a relational database) into a target database, ETL tools such as a keyboard and the like may be used for synchronization (where ETL is an abbreviation of english Extract-Transform-Load and is used to describe a process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end), each piece of data synchronized by using the ETL tools passes through a management node in the target database first and then falls onto a data node in the target database, and once a data volume is large, a resource of the management node is occupied and downtime is easy.
Disclosure of Invention
The embodiment of the invention provides a data importing method, a data importing device, computer equipment and a storage medium based on a yellowbreak database, and aims to solve the problem that when source data are synchronized to a target database from a source database by using an ETL (extract transform and load) tool in the prior art, each piece of data passes through a management node in the target database and then falls on a data node, and once the data volume is large, the management node resources are occupied, so that downtime is easy to occur.
In a first aspect, an embodiment of the present invention provides a data importing method based on a yellowbreak database, including:
responding to a data synchronization instruction, acquiring a first synchronization script corresponding to the data synchronization instruction, acquiring source data from a source database according to the first synchronization script, and converting the source data into intermediate data through the first synchronization script and storing the intermediate data;
importing the intermediate data into a target Hadoop client determined according to Hadoop cluster information; and
and obtaining pre-stored export plug-in data, loading the export plug-in data through the target Hadoop client, and importing the intermediate data into a target database through the loaded export plug-in data.
In a second aspect, an embodiment of the present invention provides a data importing apparatus based on a yellowbreak database, including:
the synchronous script generating unit is used for responding to a data synchronous instruction, acquiring a first synchronous script corresponding to the data synchronous instruction, acquiring source data from a source database according to the first synchronous script, and converting the source data into intermediate data through the first synchronous script and storing the intermediate data;
the intermediate data import unit is used for importing the intermediate data into a target Hadoop client determined according to Hadoop cluster information; and
and the target library importing unit is used for acquiring pre-stored export plug-in data, loading the export plug-in data through the target Hadoop client, and importing the intermediate data into a target database through the loaded export plug-in data.
In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor, when executing the computer program, implements the above-mentioned data importing method based on the yellowbrowse database according to the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the yellowbreak database-based data import method according to the first aspect.
The embodiment of the invention provides a data importing method, a data importing device, computer equipment and a storage medium based on a yellowbreak database. The server is used as a transfer station between the source database and the target database, the source data is converted in the server to obtain intermediate data, the intermediate data are transmitted to the target database, the server can load the data to the target database in batches only by reading the table information from the management nodes of the target database once, resources of the management nodes in the target database are saved, and the target database is prevented from being crashed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is an application scenario diagram of a data importing method based on a yellowbreak database according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a data importing method based on a yellowbreak database according to an embodiment of the present invention;
fig. 3 is a schematic block diagram of a data importing apparatus based on a yellowbreak database according to an embodiment of the present invention;
FIG. 4 is a schematic block diagram of a computer device provided by an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of a data importing method based on a yellowbreak database according to an embodiment of the present invention; fig. 2 is a schematic flowchart of a data importing method based on a yellowbreak database according to an embodiment of the present invention, where the data importing method based on the yellowbreak database is applied to a server, and the method is executed by application software installed in the server.
As shown in fig. 2, the method includes steps S101 to S103.
S101, responding to a data synchronization instruction, obtaining a first synchronization script corresponding to the data synchronization instruction, obtaining source data from a source database according to the first synchronization script, and converting the source data into intermediate data through the first synchronization script and storing the intermediate data.
In order to more clearly understand the technical solution of the present application, the following detailed description is made on the terminal related to the present application. The technical scheme is described by taking a server as an execution subject.
A source database, which may also be understood as a core database, is a relational database, and data in the source database is generally not suitable for data calculation but rather is more preferable to data storage because of the limitation of the calculation performance of the relational database. Therefore, in order to improve the data utilization efficiency, the source data needs to be synchronized from the source database to the target database.
The server can understand an intermediate server cluster, can acquire source data from the source data firstly, convert and process the source data to obtain intermediate data, and then import the intermediate data into the target database through an import tool, so that the high-proportion occupation of joint node resources of the target database caused by directly importing the source data from the source database to the target database is reduced.
And the target database is used for receiving and storing the intermediate data transmitted from the server, and then performing various computing processes by using the stored data to fully utilize the computing capacity of the target database.
In order to extract the source data from the source database to the server more quickly, the server may send a pre-stored first synchronization script for extracting the data to the source database, and after the first synchronization script is run in the source database, the source data may be converted into intermediate data and stored in the server. Specifically, the first synchronization script is a Sqoop script. In specific implementation, the source database is generally relational data such as Oracle and Mysql, the target database is a yellowbreak database in the application, and the server is a Hadoop cluster. The source data can be extracted from the source database to the server cluster through the Sqoop script. Therefore, the Hadoop cluster is used as a server and can be used as an intermediate receiving object and an intermediate processing object of source data, and data extracted and processed by the server is imported into the target database, so that the data import efficiency is ensured, and the consumption of management node resources in the target database is reduced.
The core design of the Hadoop cluster comprises an HDFS and a MapReduce, wherein the HDFS is a Hadoop distributed file system and is used for providing storage for massive data. MapReduce may be understood to be a programming model and have the capability of Map and Reduce for providing computation for massive amounts of data.
In the source database, the source data is stored in the form of a data table, and in the server, the source data is stored in the form of a data text. Moreover, since a Hadoop Distributed File System (HDFS) is adopted in the server, data needs to be stored in a data block manner.
In an embodiment, the acquiring a first synchronization script corresponding to the data synchronization instruction in step S101 includes:
and filling the source database connection address, the source database user name and the source database target data table name corresponding to the data synchronization instruction into a pre-stored script template to obtain a first synchronization script.
In the present embodiment, for example, the script templates stored in advance are as follows:
$sqoop import\
--connect XX\
--username XX\
--table XX
then, after the connection address of the source database, the user name of the source database and the table name of the target data table of the source database are obtained, XX after connection is replaced by the connection address of the source database (such as jdbc: Oracle:// localhost/userdb), XX after username is replaced by the user name of the source database (such as root), XX after table is replaced by the table name of the target data table of the source database (such as table 1-m l), and the first synchronization script can be quickly obtained through the processing.
In one embodiment, converting and storing the source data into intermediate data by the first synchronization script includes:
converting the source data into intermediate data by the first synchronization script;
acquiring a source data volume value of the source data, and if the source data volume value is determined not to exceed a preset data volume threshold value, correspondingly creating a target data block set according to the source data volume value and a preset first partition strategy;
storing the intermediate data to the set of target data blocks.
In the embodiment, the intermediate data is stored in the server in a data block mode, so that the concept of a file can be shielded, and the design of a storage system can be simplified. And the data block is very suitable for the backup of the data, thereby improving the fault-tolerant capability and the availability of the data. If the size of the data block is set to be too small, a general file can be divided into a plurality of data blocks, and a plurality of data block addresses need to be accessed during accessing, so that the efficiency is not high, and the memory consumption of the NameNode is serious; if the data block is set to be too large, the parallel support is not good, and if the system needs to load data in a restarting mode, the larger the data block is, the longer the system can recover. Therefore, in order to realize the block-type storage of data more quickly, at this time, it is first necessary to obtain a source data size value (e.g., 2GB) of the source data, and it is known that the preset data size threshold is 5GB, then a preset first partition policy may be invoked to partition the intermediate data into a plurality of packets (e.g., the first partition policy is used to divide the intermediate data into 10 packets on average), and each packet is stored in a corresponding manner as one data block. Therefore, when big data processing is carried out based on the HDFS, 10 parallel thread processing computing tasks can be operated, and distributed computing is realized, so that the processing efficiency is improved.
In one embodiment, the storing the intermediate data to the set of target data blocks includes:
dividing the intermediate data according to the total number of the target data block sets to obtain split intermediate data sets, and storing each split intermediate data included in the split intermediate data sets into corresponding target data blocks in the target data block sets.
In this embodiment, still referring to the above example, if the source data size value (e.g. 2GB) of the source data is determined, and it is known that the preset data size threshold is 5GB, since the data size is not lost when the source data is converted into the intermediate data by the first synchronization script, the data size value of the intermediate data can still be considered to be equal to the source data size value of the source data. Since the total number of the target data block sets is known before, at this time, the intermediate data may be divided according to the total number of the target data block sets to obtain split intermediate data sets, and each split intermediate data included in the split intermediate data sets is stored in a corresponding target data block in the target data block sets, so that the intermediate data can be rapidly split and stored.
And S102, importing the intermediate data into a target Hadoop client determined according to the Hadoop cluster information.
In this embodiment, if the server is regarded as a Hadoop cluster, it necessarily includes a Hadoop distributed file system (i.e. HDFS), and the HDFS adopts Master-Slave structures of Master and Slave and mainly includes a Name-Node, a Secondary Name Node, and a data Node. The Name-Node is used for managing the Name space of the HDFS, mapping the data block mapping information storage metadata and mapping the file to the data block; the Secondary NameNode is an auxiliary NameNode, shares the work of the NameNode, and can assist to recover the NameNode in an emergency; the DataNode is a Slave node, and is used for actually storing data, performing reading and writing of a data block, and reporting storage information to the NameNode. Therefore, the Hadoop cluster information is obtained in the server, so that the server can know which Hadoop client is the Name-Node, can know which Hadoop clients are the Secondary Namenode, and can know which Hadoop clients are the DataNodes.
After the information of the Name-Node, the Secondary Name Node, the DataNode and the like is acquired according to the Hadoop cluster information, any Hadoop client in the DataNode can be selected as a target Hadoop client to carry out the data export work of the next step, and the Hadoop client in which each target data block corresponding to the previous intermediate data is located can be selected as the target Hadoop client to carry out the data export work of the next step.
In an embodiment, as a first specific embodiment of step S102, step S102 includes:
and obtaining slave node information according to the Hadoop cluster information, and randomly selecting one slave node from the slave node information as a target Hadoop client.
In this embodiment, as a first specific embodiment of selecting a target Hadoop client, a slave node may be directly and randomly selected from a plurality of slave nodes as a target Hadoop client without considering on which slave nodes a target data block set corresponding to previous intermediate data is respectively distributed, and then data of each target data block in the target data block set corresponding to the intermediate data is summarized in the target Hadoop client, and then the target Hadoop client exports the intermediate data to a target database. It can be seen that the target Hadoop client for undertaking data export work can be quickly determined through the method.
In an embodiment, as a second specific embodiment of step S102, step S102 includes:
and acquiring slave node information according to the Hadoop cluster information, acquiring slave nodes corresponding to all target data blocks in the target data block set respectively, and taking the slave nodes corresponding to all the target data blocks as target Hadoop clients.
In this embodiment, as a second specific embodiment of selecting a target Hadoop client, it is considered that target data block sets corresponding to previous intermediate data are respectively distributed on which slave nodes, then the whole slave node where each target data block in the target data block sets corresponding to the intermediate data is located is used as a Hadoop client, and then the target Hadoop client exports the intermediate data to a target database. In this way, target data blocks do not need to be summarized on one of the slave nodes, and a target Hadoop client for undertaking data export work can be quickly determined.
When a target Hadoop client is determined according to Hadoop cluster information, intermediate data needs to be concentrated in the target Hadoop client in a form, and therefore a target data block is used as a data concentration point to send data.
In one embodiment, step S102 further includes:
and importing the source data or target data block set address of the intermediate data into the target Hadoop client.
In this embodiment, for example, in step S102, a slave node may be selected as the target Hadoop client, and then the data of each target data block in the target data block set corresponding to the intermediate data is summarized in the target Hadoop client to realize formal concentration. For another example, in step S102, a plurality of slave nodes may be selected as target Hadoop clients, where the selected slave nodes are slave nodes distributed in the previous target data block set, and when it is known that a data export task exists in the server, the target Hadoop clients establish pre-connections in advance based on respective target data block set addresses to mutually notify that respective target data blocks are to be exported to the target database, so that formal centralization is also achieved.
S103, pre-stored export plug-in data is obtained, the export plug-in data is loaded through the target Hadoop client, and the intermediate data is imported into a target database through the loaded export plug-in data.
In this embodiment, data in each target data block in the target data block set exists in the form of a data text, and each data text is separated by a spacer, that is, data to be imported into the target database in the target Hadoop client is imported in the form of text data. If it is determined that the target database adopts the yellowbreak database, corresponding export plug-in data (specifically, plug-in data corresponding to the ybtools) can be called, and then the intermediate data is obtained by the corresponding export plug-in the export plug-in data (specifically, the yboad plug-in can expand the function of importing data from the HDFS to the yellowbreak database) and then imported to the target database. When the intermediate data are imported into the target database based on the export plug-in data, the table information needs to be read once from the management node of the target database, the data can be loaded to the target database in batches, and resources of the management node in the target database are saved.
The embodiment of the application can acquire and process data in a related source database, server or target database based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The method realizes that the server is used as a transfer station between the source database and the target database, the source data is converted in the server to obtain intermediate data, and the intermediate data is transmitted to the target database, and the server can load the data to the target database in batches only by reading the table information from the management nodes of the target database once, so that the resources of the management nodes in the target database are saved, and the target database is prevented from being crashed.
An embodiment of the present invention further provides a data importing apparatus based on a yellowbreak database, where the data importing apparatus based on the yellowbreak database is configured to execute any embodiment of the foregoing data importing method based on the yellowbreak database. Specifically, referring to fig. 3, fig. 3 is a schematic block diagram of a data importing apparatus 100 based on a yellowbreak database according to an embodiment of the present invention.
As shown in fig. 3, the data importing apparatus 100 based on the yellowbreak database includes a synchronous script generating unit 101, an intermediate data importing unit 102, and a target library importing unit 103.
The synchronization script generating unit 101 is configured to, in response to a data synchronization instruction, acquire a first synchronization script corresponding to the data synchronization instruction, acquire source data from a source database according to the first synchronization script, and convert the source data into intermediate data through the first synchronization script and store the intermediate data.
In this embodiment, in order to extract the source data from the source database to the server more quickly, at this time, the server may send a pre-stored first synchronization script for extracting the data to the source database, and after the first synchronization script is run in the source database, the source data may be converted into intermediate data and stored in the server. Specifically, the first synchronization script is a Sqoop script. In specific implementation, the source database is generally relational data such as Oracle and Mysql, the target database is a yellowbreak database in the application, and the server is a Hadoop cluster. The source data can be extracted from the source database to the server cluster through the Sqoop script. Therefore, the Hadoop cluster is used as a server and can be used as an intermediate receiving object and an intermediate processing object of source data, and data extracted and processed by the server is imported into the target database, so that the data import efficiency is ensured, and the consumption of management node resources in the target database is reduced.
The core design of the Hadoop cluster comprises an HDFS and a MapReduce, wherein the HDFS is a Hadoop distributed file system and is used for providing storage for massive data. MapReduce may be understood to be a programming model and have the capability of Map and Reduce for providing computation for massive amounts of data.
In the source database, the source data is stored in the form of a data table, and in the server, the source data is stored in the form of a data text. Moreover, since a Hadoop Distributed File System (HDFS) is adopted in the server, data needs to be stored in a data block manner.
In an embodiment, the synchronization script generating unit 101 is specifically configured to:
and filling the source database connection address, the source database user name and the source database target data table name corresponding to the data synchronization instruction into a pre-stored script template to obtain a first synchronization script.
In the present embodiment, for example, the script templates stored in advance are as follows:
$sqoop import\
--connect XX\
--username XX\
--table XX
then, after the connection address of the source database, the user name of the source database and the table name of the target data table of the source database are obtained, XX after connection is replaced by the connection address of the source database (such as jdbc: Oracle:// localhost/userdb), XX after username is replaced by the user name of the source database (such as root), XX after table is replaced by the table name of the target data table of the source database (such as table 1-m l), and the first synchronization script can be quickly obtained through the processing.
In an embodiment, the synchronization script generating unit 101 is further specifically configured to:
converting the source data into intermediate data by the first synchronization script;
acquiring a source data volume value of the source data, and if the source data volume value is determined not to exceed a preset data volume threshold value, correspondingly creating a target data block set according to the source data volume value and a preset first partition strategy;
storing the intermediate data to the set of target data blocks.
In the embodiment, the intermediate data is stored in the server in a data block mode, so that the concept of a file can be shielded, and the design of a storage system can be simplified. And the data block is very suitable for the backup of the data, thereby improving the fault-tolerant capability and the availability of the data. If the size of the data block is set to be too small, a general file can be divided into a plurality of data blocks, and a plurality of data block addresses need to be accessed during accessing, so that the efficiency is not high, and the memory consumption of the NameNode is serious; if the data block is set to be too large, the parallel support is not good, and if the system needs to load data in a restarting mode, the larger the data block is, the longer the system can recover. Therefore, in order to realize the block-type storage of data more quickly, at this time, it is first necessary to obtain a source data size value (e.g., 2GB) of the source data, and it is known that the preset data size threshold is 5GB, then a preset first partition policy may be invoked to partition the intermediate data into a plurality of packets (e.g., the first partition policy is used to divide the intermediate data into 10 packets on average), and each packet is stored in a corresponding manner as one data block. Therefore, when big data processing is carried out based on the HDFS, 10 parallel thread processing computing tasks can be operated, and distributed computing is realized, so that the processing efficiency is improved.
In one embodiment, the storing the intermediate data to the set of target data blocks includes:
dividing the intermediate data according to the total number of the target data block sets to obtain split intermediate data sets, and storing each split intermediate data included in the split intermediate data sets into corresponding target data blocks in the target data block sets.
In this embodiment, still referring to the above example, if the source data size value (e.g. 2GB) of the source data is determined, and it is known that the preset data size threshold is 5GB, since the data size is not lost when the source data is converted into the intermediate data by the first synchronization script, the data size value of the intermediate data can still be considered to be equal to the source data size value of the source data. Since the total number of the target data block sets is known before, at this time, the intermediate data may be divided according to the total number of the target data block sets to obtain split intermediate data sets, and each split intermediate data included in the split intermediate data sets is stored in a corresponding target data block in the target data block sets, so that the intermediate data can be rapidly split and stored.
And the intermediate data importing unit 102 is configured to import the intermediate data to a target Hadoop client determined according to the Hadoop cluster information.
In this embodiment, if the server is regarded as a Hadoop cluster, it necessarily includes a Hadoop distributed file system (i.e. HDFS), and the HDFS adopts Master-Slave structures of Master and Slave and mainly includes a Name-Node, a Secondary Name Node, and a data Node. The Name-Node is used for managing the Name space of the HDFS, mapping the data block mapping information storage metadata and mapping the file to the data block; the Secondary NameNode is an auxiliary NameNode, shares the work of the NameNode, and can assist to recover the NameNode in an emergency; the DataNode is a Slave node, and is used for actually storing data, performing reading and writing of a data block, and reporting storage information to the NameNode. Therefore, the Hadoop cluster information is obtained in the server, so that the server can know which Hadoop client is the Name-Node, can know which Hadoop clients are the Secondary Namenode, and can know which Hadoop clients are the DataNodes.
After the information of the Name-Node, the Secondary Name Node, the DataNode and the like is acquired according to the Hadoop cluster information, any Hadoop client in the DataNode can be selected as a target Hadoop client to carry out the data export work of the next step, and the Hadoop client in which each target data block corresponding to the previous intermediate data is located can be selected as the target Hadoop client to carry out the data export work of the next step.
In an embodiment, as a first specific embodiment of the intermediate data importing unit 102, the intermediate data importing unit 102 is specifically configured to:
and obtaining slave node information according to the Hadoop cluster information, and randomly selecting one slave node from the slave node information as a target Hadoop client.
In this embodiment, as a first specific embodiment of selecting a target Hadoop client, a slave node may be directly and randomly selected from a plurality of slave nodes as a target Hadoop client without considering on which slave nodes a target data block set corresponding to previous intermediate data is respectively distributed, and then data of each target data block in the target data block set corresponding to the intermediate data is summarized in the target Hadoop client, and then the target Hadoop client exports the intermediate data to a target database. It can be seen that the target Hadoop client for undertaking data export work can be quickly determined through the method.
In an embodiment, as a second specific embodiment of the intermediate data importing unit 102, the intermediate data importing unit 102 is further specifically configured to:
and acquiring slave node information according to the Hadoop cluster information, acquiring slave nodes corresponding to all target data blocks in the target data block set respectively, and taking the slave nodes corresponding to all the target data blocks as target Hadoop clients.
In this embodiment, as a second specific embodiment of selecting a target Hadoop client, it is considered that target data block sets corresponding to previous intermediate data are respectively distributed on which slave nodes, then the whole slave node where each target data block in the target data block sets corresponding to the intermediate data is located is used as a Hadoop client, and then the target Hadoop client exports the intermediate data to a target database. In this way, target data blocks do not need to be summarized on one of the slave nodes, and a target Hadoop client for undertaking data export work can be quickly determined.
When a target Hadoop client is determined according to Hadoop cluster information, intermediate data needs to be concentrated in the target Hadoop client in a form, and therefore a target data block is used as a data concentration point to send data.
In an embodiment, the intermediate data importing unit 102 is further specifically configured to:
and importing the source data or target data block set address of the intermediate data into the target Hadoop client.
In this embodiment, for example, the intermediate data importing unit 102 may select one slave node as the target Hadoop client, and then summarize the data of each target data block in the target data block set corresponding to the intermediate data in the target Hadoop client to implement formal concentration. For example, the intermediate data importing unit 102 may also select a plurality of slave nodes as target Hadoop clients, where the selected slave nodes are slave nodes distributed in the previous target data block set, and when it is known that a data exporting task exists in the server, the target Hadoop clients establish pre-connections in advance based on respective target data block set addresses to notify each other that the respective target data blocks are exported to the target database, so that formal centralization is also achieved.
And the target library importing unit 103 is configured to acquire pre-stored export plug-in data, load the export plug-in data through the target Hadoop client, and import the intermediate data into the target database through the loaded export plug-in data.
In this embodiment, data in each target data block in the target data block set exists in the form of a data text, and each data text is separated by a spacer, that is, data to be imported into the target database in the target Hadoop client is imported in the form of text data. If it is determined that the target database adopts the yellowbreak database, corresponding export plug-in data (specifically, plug-in data corresponding to the ybtools) can be called, and then the intermediate data is obtained by the corresponding export plug-in the export plug-in data (specifically, the yboad plug-in can expand the function of importing data from the HDFS to the yellowbreak database) and then imported to the target database. When the intermediate data are imported into the target database based on the export plug-in data, the table information needs to be read once from the management node of the target database, the data can be loaded to the target database in batches, and resources of the management node in the target database are saved.
The device realizes that the server is used as a transfer station between the source database and the target database, the source data is converted in the server to obtain intermediate data, and the intermediate data is transmitted to the target database, and the server can load the data to the target database in batches only by reading the table information from the management nodes of the target database once, so that the resources of the management nodes in the target database are saved, and the target database is prevented from being crashed.
The above-mentioned data importing apparatus based on the yellowbreak database may be implemented in the form of a computer program, which may be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present invention. The computer device 500 may be a server or a server cluster. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
Referring to fig. 4, the computer apparatus 500 includes a processor 502, a memory, which may include a storage medium 503 and an internal memory 504, and a network interface 505 connected by a device bus 501.
The storage medium 503 may store an operating device 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a yellowblow database-based data import method.
The processor 502 is used to provide computing and control capabilities that support the operation of the overall computer device 500.
The internal memory 504 provides an environment for running the computer program 5032 in the storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can be caused to execute a yellowbrowse database-based data import method.
The network interface 505 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with aspects of the present invention and is not intended to limit the computing device 500 to which aspects of the present invention may be applied, and that a particular computing device 500 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The processor 502 is configured to run the computer program 5032 stored in the memory to implement the yellowblow database-based data importing method disclosed in the embodiment of the present invention.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that, in the embodiment of the present invention, the Processor 502 may be a Central Processing Unit (CPU), and the Processor 502 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the invention, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the yellowbreak database-based data importing method disclosed by the embodiment of the invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided by the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only a logical division, and there may be other divisions when the actual implementation is performed, or units having the same function may be grouped into one unit, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a background server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A data import method based on a yellowbreak database is characterized by comprising the following steps:
responding to a data synchronization instruction, acquiring a first synchronization script corresponding to the data synchronization instruction, acquiring source data from a source database according to the first synchronization script, and converting the source data into intermediate data through the first synchronization script and storing the intermediate data;
importing the intermediate data into a target Hadoop client determined according to Hadoop cluster information; and
and obtaining pre-stored export plug-in data, loading the export plug-in data through the target Hadoop client, and importing the intermediate data into a target database through the loaded export plug-in data.
2. The yellowbrowser database-based data import method according to claim 1, wherein the obtaining of the first synchronization script corresponding to the data synchronization instruction comprises:
and filling the source database connection address, the source database user name and the source database target data table name corresponding to the data synchronization instruction into a pre-stored script template to obtain a first synchronization script.
3. The yellowblow database-based data import method according to claim 1, wherein the converting the source data into intermediate data and storing the intermediate data through the first synchronization script comprises:
converting the source data into intermediate data by the first synchronization script;
acquiring a source data volume value of the source data, and if the source data volume value is determined not to exceed a preset data volume threshold value, correspondingly creating a target data block set according to the source data volume value and a preset first partition strategy;
storing the intermediate data to the set of target data blocks.
4. The yellowbreak database-based data import method according to claim 1, wherein the importing the intermediate data to a target Hadoop client determined according to Hadoop cluster information comprises:
and obtaining slave node information according to the Hadoop cluster information, and randomly selecting one slave node from the slave node information as a target Hadoop client.
5. The yellowbreak database-based data import method according to claim 1, wherein the importing the intermediate data to a target Hadoop client determined according to Hadoop cluster information comprises:
and acquiring slave node information according to the Hadoop cluster information, acquiring slave nodes corresponding to all target data blocks in the target data block set respectively, and taking the slave nodes corresponding to all the target data blocks as target Hadoop clients.
6. The yellowblow database-based data import method according to claim 3, wherein the storing the intermediate data to the target data block set comprises:
dividing the intermediate data according to the total number of the target data block sets to obtain split intermediate data sets, and storing each split intermediate data included in the split intermediate data sets into corresponding target data blocks in the target data block sets.
7. The yellowbreak database-based data import method according to claim 1, wherein the importing the intermediate data to a target Hadoop client determined according to Hadoop cluster information comprises:
and importing the source data or target data block set address of the intermediate data into the target Hadoop client.
8. A data import device based on a yellowbreak database, comprising:
the synchronous script generating unit is used for responding to a data synchronous instruction, acquiring a first synchronous script corresponding to the data synchronous instruction, acquiring source data from a source database according to the first synchronous script, and converting the source data into intermediate data through the first synchronous script and storing the intermediate data;
the intermediate data import unit is used for importing the intermediate data into a target Hadoop client determined according to Hadoop cluster information; and
and the target library importing unit is used for acquiring pre-stored export plug-in data, loading the export plug-in data through the target Hadoop client, and importing the intermediate data into a target database through the loaded export plug-in data.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the yellowbreak database-based data import method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to execute the yellowcomb database-based data import method according to any one of claims 1 to 7.
CN202111559424.8A 2021-12-20 2021-12-20 Data import method, device, equipment and medium based on yellowbreak database Pending CN114238479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111559424.8A CN114238479A (en) 2021-12-20 2021-12-20 Data import method, device, equipment and medium based on yellowbreak database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111559424.8A CN114238479A (en) 2021-12-20 2021-12-20 Data import method, device, equipment and medium based on yellowbreak database

Publications (1)

Publication Number Publication Date
CN114238479A true CN114238479A (en) 2022-03-25

Family

ID=80758840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111559424.8A Pending CN114238479A (en) 2021-12-20 2021-12-20 Data import method, device, equipment and medium based on yellowbreak database

Country Status (1)

Country Link
CN (1) CN114238479A (en)

Similar Documents

Publication Publication Date Title
US11809726B2 (en) Distributed storage method and device
US10831562B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
CN112286503B (en) Method, device, equipment and medium for unified management of microservices of multiple registries
US10394792B1 (en) Data storage in a graph processing system
JP7360395B2 (en) Input and output schema mapping
US8224825B2 (en) Graph-processing techniques for a MapReduce engine
WO2019042312A1 (en) Distributed computing system, data transmission method and device in distributed computing system
CN108304473A (en) Data transmission method between data source and system
CN113687964B (en) Data processing method, device, electronic equipment, storage medium and program product
CN110837535A (en) Data synchronization method, device, equipment and medium
CN113900810A (en) Distributed graph processing method, system and storage medium
CN112860412B (en) Service data processing method and device, electronic equipment and storage medium
CN108153896B (en) Processing method and device for input data and output data
KR102031589B1 (en) Methods and systems for processing relationship chains, and storage media
CN114238479A (en) Data import method, device, equipment and medium based on yellowbreak database
CN115455006A (en) Data processing method, data processing device, electronic device, and storage medium
WO2022121387A1 (en) Data storage method and apparatus, server, and medium
CN114401239A (en) Metadata transmission method and device, computer equipment and storage medium
US9916372B1 (en) Folded-hashtable synchronization mechanism
US10511656B1 (en) Log information transmission integrity
CN112637288A (en) Streaming data distribution method and system
CN107332679B (en) Centerless information synchronization method and device
WO2014176954A1 (en) Processing method, device and system for data of distributed storage system
CN116743589B (en) Cloud host migration method and device and electronic equipment
CN117319474A (en) Multi-line program columnar data transmission processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination