CN106790489B - Parallel data loading method and system - Google Patents

Parallel data loading method and system Download PDF

Info

Publication number
CN106790489B
CN106790489B CN201611150991.7A CN201611150991A CN106790489B CN 106790489 B CN106790489 B CN 106790489B CN 201611150991 A CN201611150991 A CN 201611150991A CN 106790489 B CN106790489 B CN 106790489B
Authority
CN
China
Prior art keywords
data
loading
node
nodes
loaded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611150991.7A
Other languages
Chinese (zh)
Other versions
CN106790489A (en
Inventor
杨卓慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Huawei Technology Co Ltd
Original Assignee
Chengdu Huawei Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Huawei Technology Co Ltd filed Critical Chengdu Huawei Technology Co Ltd
Priority to CN201611150991.7A priority Critical patent/CN106790489B/en
Publication of CN106790489A publication Critical patent/CN106790489A/en
Application granted granted Critical
Publication of CN106790489B publication Critical patent/CN106790489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses

Abstract

The embodiment of the invention provides a parallel data loading method and a system, wherein data to be loaded is stored in an FTP server, a data node downloads a data block corresponding to loading file information from the FTP server, and the data block is downloaded in the form of an FTP server file, so that the file downloading efficiency is improved, and the parallel data loading efficiency is improved. The loading indication information is sent to the data nodes through the main node, so that the data nodes can load the data stored in the FTP server in parallel, and the data nodes with strong processing capacity can load more data blocks in a mode that the data nodes actively request the server to distribute tasks, so that the task-loading distribution on demand is realized, and the parallel data loading efficiency is further improved.

Description

Parallel data loading method and system
Technical Field
The embodiment of the invention relates to computer technology, in particular to a parallel data loading method and system.
Background
With the rapid development of computer technology, the application of databases is more and more extensive, and the loading efficiency of data directly affects the overall performance of databases.
In the prior art, when data is loaded, an application is connected to a DataBase through Java DataBase connection (JDBC) or Open DataBase Connectivity (ODBC), and a standard SQL statement is used to load the data, for example: SQL SERVER, ORACLE and PostgerSQL and other scenes adopt the method to load data.
However, the prior art data loading is inefficient.
Disclosure of Invention
The embodiment of the invention provides a data loading method and system, which aim to improve the efficiency of parallel data loading.
One aspect of the embodiments of the present invention provides a parallel data loading method, which is applied to a parallel data loading system, and the system includes: the system comprises M host nodes, N data nodes and R FTP servers, wherein M is an integer larger than or equal to 1, N is an integer larger than or equal to 2, R is an integer larger than or equal to 1, the M host nodes are in communication connection with the N data nodes and the R FTP servers, and the N data nodes are in communication connection with the R FTP servers.
The method comprises the following steps: the host node sends loading indication information to the at least two data nodes, and the loading indication information is used for indicating the at least two data nodes to load data stored in the FTP server; the data nodes load data stored in the FTP server in parallel by sending loading indication information to the data nodes through the main node, and the data nodes can load more data blocks with strong processing capacity in a mode that the data nodes actively request the server to distribute tasks, so that the loading tasks are distributed as required, and the efficiency of loading the parallel data is further improved.
Optionally, before the master node sends the file loading information corresponding to the data node to each data node, the method further includes:
and the main node determines the size of the loading file allocated to each data node according to the frequency of the task allocation request information sent by at least two data nodes. The data nodes are distributed according to the actual processing capacity of the data nodes, the utilization rate of processing resources is further improved, and the parallel loading efficiency is improved
Optionally, the method further comprises:
and if the main node determines that the files to be loaded are completely loaded, the main node sends loading completion indication information to the at least two data nodes. So that at least two data nodes stop sending task allocation request information to the master node.
Optionally, before the master node sends the loading indication information to the at least two data nodes, the method further includes:
and the main node receives loading indication information sent by the client, wherein the loading indication information comprises information of a file to be loaded.
Optionally, the method further comprises:
and if the main node determines that the files to be loaded are completely loaded, sending loading completion indication information to the client. So that at least two data nodes stop sending task allocation request information to the master node.
Optionally, before the master node determines the size of the loaded file allocated to each data node according to the frequency of sending the task allocation request information by at least two data nodes, the method further includes:
the main node divides a file to be loaded into a plurality of data blocks, and each data block corresponds to one piece of loading file information.
Another aspect of the embodiments of the present invention provides a parallel data loading system, including:
the system comprises M host nodes, N data nodes and R FTP servers, wherein M is an integer larger than or equal to 1, N is an integer larger than or equal to 2, R is an integer larger than or equal to 1, the M host nodes are in communication connection with the N data nodes and the R FTP servers, and the N data nodes are in communication connection with the R FTP servers.
The FTP server is used for storing files to be loaded;
the host node is used for sending loading indication information to the at least two data nodes, and the loading indication information is used for indicating the at least two data nodes to load data stored in the FTP server;
the main node is also used for receiving task allocation request information sent by at least two data nodes, and the task allocation request information is used for requesting the main node to allocate loading file information for the data nodes;
the main node is also used for sending loading file information corresponding to the data nodes to each data node;
and the data node is used for downloading the data block corresponding to the loading file information from the FTP server according to the loading file information and loading the data block.
Optionally, the master node is further configured to determine, according to the frequency at which the at least two data nodes send the task allocation request information, the size of the loaded file allocated to each data node.
Optionally, the master node is further configured to determine that all files to be loaded are loaded, and send loading completion indication information to the at least two data nodes.
Optionally, the master node is further configured to receive loading indication information sent by the client, where the loading indication information includes information of a file to be loaded.
Optionally, the master node is further configured to determine that all files to be loaded are loaded, and send loading completion indication information to the client.
Optionally, the master node is further configured to divide the file to be loaded into a plurality of data blocks, where each data block corresponds to one piece of loaded file information.
Drawings
FIG. 1 is a diagram of a parallel data loading system architecture according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a parallel data loading method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a parallel data loading interaction according to the present invention.
Detailed Description
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 is an architecture diagram of a parallel data loading system according to an embodiment of the present invention, and as shown in fig. 1, the system according to the embodiment of the present invention includes: the system comprises M master nodes, N data nodes and an external data source, wherein the external data source comprises R (File Transfer Protocol, FTP) servers, where M is an integer greater than or equal to 1, N is an integer greater than or equal to 2, and R is an integer greater than or equal to 1, and fig. 1 illustrates that M is 2, N is 3, and R is 2. Fig. 1 may further include a client, where the client may be a part of the system or a part independent from the system, and the embodiment of the present invention is not limited to this.
The client side is in communication connection with the main node. The main node is in communication connection with the FTP server and the data node. The data node is in communication connection with the FTP server. And the FTP server stores the file to be loaded. The data node is used for loading the file to be loaded stored by the FTP server. And the main node is used for controlling the data node to load the file to be loaded stored by the FTP server.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 2 is a schematic flow diagram of an embodiment of a parallel data loading method of the present invention, where the method of the present embodiment is applied to the parallel data loading system shown in fig. 1, and the method of the present embodiment is as follows:
s200: the master node creates a table.
The master node may create the appearance through SQL statements.
SQL statements are for example:
“create FOREIGN table foreign_table(index int,description varchar(100)
)SERVER file_server OPTIONS(format'text',ftpserver‘ftp://10.123.1.100/data’,delimiter'|',null”)”
s201: and the client sends loading indication information to the main node.
The loading indication information comprises information of a file to be loaded. And storing the file to be loaded in the FTP server. The FTP server may be stored in one FTP server, or may be stored in multiple FTP servers, which is not limited to this embodiment of the present invention.
Alternatively, the client may send the loading indication information to the host node through an SQL statement.
SQL statements are for example:
“Insert into table2select*from foreign_table”
after receiving the loading instruction information sent by the client, the master node executes S202.
S202: the main node sends loading indication information to at least two data nodes.
The loading indication information is used for indicating at least two data nodes to load the data stored in the FTP server.
That is, the master node synchronously sends loading indication information to the at least two data nodes, so that the at least two data nodes load the data stored in the FTP server in parallel.
The data node analyzes the loading indication information, and if the loading indication information is the parallel data loading indication information of the FTP server, S203 is executed.
S203: and the data node sends task allocation request information to the main node.
The task allocation request information is used for requesting the main node to allocate the loading file information for the data node.
S204: and the data node receives the loading file information corresponding to the data node sent by the main node.
S205: and the data node downloads the data block corresponding to the loading file information from the FTP server according to the loading file information and loads the data block.
And after receiving the loading indication information, the data node executes S203-S205, and after loading the data block corresponding to the loading file information, returns to execute S203-S205 until receiving the loading completion indication information sent by the master node, and stops sending the task allocation request information to the master node. By means of the mode that the data nodes actively request the server to distribute the tasks, the data nodes with strong processing capacity can load more data blocks, the tasks are distributed according to requirements, and the parallel data loading efficiency is improved. FIG. 3 is a diagram illustrating a parallel data loading interaction according to the present invention, as shown in FIG. 3. The system comprises two FTP servers, namely an FTP server 1 and an FTP server 2, three data nodes, namely a data node 1, a data node 2 and a data node 3, wherein squares in the FTP servers represent data blocks. The left graph is before loading, and the right graph is in the loading process.
Before the main node sends corresponding file loading information to the data nodes, the main node recursively enumerates files on the FTP server, and the files to be loaded are divided into a plurality of data blocks in advance. Each data block is to ensure row integrity. The plurality of data blocks may have the same size or different sizes, which is not limited in the embodiments of the present invention.
The data blocks correspond to load file information, which may be represented by a name and an offset of the load file.
When the master node allocates the loaded file information to the data nodes, optionally, the size of the loaded file allocated to each data node may be determined according to the frequency of the task allocation request information sent by at least two nodes. If the frequency of sending the task allocation request information by a certain data node is higher, which indicates that the processing capacity of the data node is stronger, a large data block can be allocated to the data node, so as to fully utilize the processing resources of the data node. If the frequency of sending the task allocation request information by a certain data node is low, which indicates that the processing capacity of the data node is weak, a small data block can be allocated to the data node. Therefore, the data nodes are distributed according to the actual processing capacity of the data nodes, the utilization rate of processing resources is further improved, and the parallel loading efficiency is improved.
When the host node allocates the loading file information for different data nodes, the host node can allocate the loading file information corresponding to the data blocks on different FTP servers, so that the network bandwidth of each FTP server is fully utilized, and the parallel loading efficiency is further improved.
S206: and if the main node determines that the files to be loaded are completely loaded, the main node sends loading completion indication information to the at least two data nodes.
The main node sends loading completion indication information to the at least two data nodes so that the at least two data nodes stop sending task allocation request information to the main node.
S207: and if the main node determines that the files to be loaded are completely loaded, sending loading completion indication information to the client.
And after the data node is loaded, returning a loading result to the main node, and returning the loading result to the client by the main node so that the user can know the loading result through the client.
In the embodiment, the data to be loaded is stored in the FTP server, and the data node downloads the data block corresponding to the loaded file information from the FTP server and downloads the data block in the form of the file of the FTP server, so that the file downloading efficiency is improved, and the parallel data loading efficiency is improved. The loading indication information is sent to the data nodes through the main node, so that the data nodes can load the data stored in the FTP server in parallel, and the data nodes with strong processing capacity can load more data blocks in a mode that the data nodes actively request the server to distribute tasks, so that the task-loading distribution on demand is realized, and the parallel data loading efficiency is further improved. The size of the loading file allocated to each data node is determined by the main node according to the frequency of the task allocation request information sent by at least two nodes, so that allocation is realized according to the actual processing capacity of the data nodes, the utilization rate of processing resources is further improved, and the parallel loading efficiency is improved. The host node distributes data on different FTP servers for different data nodes, the network bandwidth of each FTP server is fully utilized, and the parallel loading efficiency is further improved. When the bottleneck of data parallel loading is in network IO, the bottleneck of network IO can be eliminated by adding the FTP server, and the parallel loading efficiency is further improved.
The present invention also provides an embodiment of a parallel data loading system, as shown in fig. 1, including: the system comprises M host nodes, N data nodes and R FTP servers, wherein M is an integer larger than or equal to 1, N is an integer larger than or equal to 2, R is an integer larger than or equal to 1, the M host nodes are in communication connection with the N data nodes and the R FTP servers, and the N data nodes are in communication connection with the R FTP servers.
The FTP server is used for storing files to be loaded.
And the main node is used for sending loading indication information to the at least two data nodes, and the loading indication information is used for indicating the at least two data nodes to load the data stored in the FTP server.
The main node is further used for receiving task allocation request information sent by the at least two data nodes, and the task allocation request information is used for requesting the main node to allocate loading file information for the data nodes.
And the main node is also used for sending loading file information corresponding to the data node to each data node.
And the data node is used for downloading the data block corresponding to the loading file information from the FTP server according to the loading file information and loading the data block.
Optionally, the master node is further configured to determine, according to the frequency at which the at least two data nodes send the task allocation request information, the size of the loaded file allocated to each data node.
Optionally, the master node is further configured to determine that all files to be loaded are loaded, and send loading completion indication information to the at least two data nodes.
Optionally, the master node is further configured to receive loading indication information sent by the client, where the loading indication information includes information of a file to be loaded.
Optionally, the master node is further configured to determine that all files to be loaded are loaded, and send loading completion indication information to the client.
Optionally, the master node is further configured to divide the file to be loaded into a plurality of data blocks, where each data block corresponds to one piece of loaded file information.
The above system embodiment may be used to implement the technical solution of the method embodiment shown in fig. 2, and the implementation principle and the technical effect are similar, which are not described herein again.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Claims (9)

1. A parallel data loading method is applied to a parallel data loading system, and the system comprises: the system comprises M host nodes, N data nodes and R FTP servers, wherein M is an integer greater than or equal to 1, N is an integer greater than or equal to 2, R is an integer greater than or equal to 1, the M host nodes are in communication connection with the N data nodes and the R FTP servers, and the N data nodes are in communication connection with the R FTP servers;
the method comprises the following steps:
the host node sends loading indication information to at least two data nodes, wherein the loading indication information is used for indicating the at least two data nodes to load the data stored in the FTP server;
the main node receives task allocation request information sent by the at least two data nodes, wherein the task allocation request information is used for requesting the main node to allocate loading file information to the data nodes;
the main node sends loading file information corresponding to the data nodes to each data node;
the data node downloads a data block corresponding to the loading file information from the FTP server according to the loading file information and loads the data block;
before the master node sends the loaded file information corresponding to the data node to each data node, the method further includes:
and the main node determines the size of the loading file allocated to each data node according to the frequency of the task allocation request information sent by the at least two data nodes.
2. The method of claim 1, further comprising:
and if the main node determines that the files to be loaded are completely loaded, sending loading completion indication information to the at least two data nodes.
3. The method of claim 2, further comprising:
and if the main node determines that the files to be loaded are completely loaded, sending loading completion indication information to the client.
4. The method according to any one of claims 1 to 3, wherein before the master node determines the size of the loaded file allocated to each data node according to the frequency of sending the task allocation request information by the at least two data nodes, the method further comprises:
the main node divides a file to be loaded into a plurality of data blocks, and each data block corresponds to one piece of loaded file information.
5. A parallel data loading system, comprising:
the system comprises M host nodes, N data nodes and R FTP servers, wherein M is an integer greater than or equal to 1, N is an integer greater than or equal to 2, R is an integer greater than or equal to 1, the M host nodes are in communication connection with the N data nodes and the R FTP servers, and the N data nodes are in communication connection with the R FTP servers;
wherein the content of the first and second substances,
the FTP server is used for storing files to be loaded;
the host node is used for sending loading indication information to at least two data nodes, and the loading indication information is used for indicating the at least two data nodes to load the data stored in the FTP server;
the main node is further configured to receive task allocation request information sent by the at least two data nodes, where the task allocation request information is used to request the main node to allocate loading file information to the data nodes;
the main node is further used for sending loading file information corresponding to the data nodes to each data node;
the data node is used for downloading the data block corresponding to the loading file information from the FTP server according to the loading file information and loading the data block;
and the main node is further used for determining the size of the loading file allocated to each data node according to the frequency of the task allocation request information sent by the at least two data nodes.
6. The system of claim 5,
and the main node is also used for determining that the files to be loaded are completely loaded and sending loading completion indication information to the at least two data nodes.
7. The system of claim 6,
the main node is further configured to receive loading indication information sent by a client, where the loading indication information includes information of the file to be loaded.
8. The system of claim 7,
and the main node is also used for determining that the files to be loaded are completely loaded and sending loading completion indication information to the client.
9. The system according to any one of claims 5 to 8,
the main node is further configured to divide the file to be loaded into a plurality of data blocks, and each data block corresponds to one piece of the loaded file information.
CN201611150991.7A 2016-12-14 2016-12-14 Parallel data loading method and system Active CN106790489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611150991.7A CN106790489B (en) 2016-12-14 2016-12-14 Parallel data loading method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611150991.7A CN106790489B (en) 2016-12-14 2016-12-14 Parallel data loading method and system

Publications (2)

Publication Number Publication Date
CN106790489A CN106790489A (en) 2017-05-31
CN106790489B true CN106790489B (en) 2020-12-22

Family

ID=58887772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611150991.7A Active CN106790489B (en) 2016-12-14 2016-12-14 Parallel data loading method and system

Country Status (1)

Country Link
CN (1) CN106790489B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885460A (en) * 2017-10-12 2018-04-06 北京人大金仓信息技术股份有限公司 A kind of data access method of cluster
CN109547253B (en) * 2018-11-28 2021-12-07 广东海格怡创科技有限公司 File downloading method and device, computer equipment and storage medium
CN109902065A (en) * 2019-02-18 2019-06-18 国家计算机网络与信息安全管理中心 Access distributed type assemblies external data method and device
CN109815295A (en) * 2019-02-18 2019-05-28 国家计算机网络与信息安全管理中心 Distributed type assemblies data lead-in method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1939036A (en) * 2004-06-08 2007-03-28 国际商业机器公司 Optimized concurrent data download within a grid computing environment
CN103544285A (en) * 2013-10-28 2014-01-29 华为技术有限公司 Data loading method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1939036A (en) * 2004-06-08 2007-03-28 国际商业机器公司 Optimized concurrent data download within a grid computing environment
CN103544285A (en) * 2013-10-28 2014-01-29 华为技术有限公司 Data loading method and device

Also Published As

Publication number Publication date
CN106790489A (en) 2017-05-31

Similar Documents

Publication Publication Date Title
CN106790489B (en) Parallel data loading method and system
CN109343963B (en) Application access method and device for container cluster and related equipment
AU2014212780B2 (en) Data stream splitting for low-latency data access
CN103379138B (en) Realize method that the method and system of load balancing and gray scale issue and device
US20140195482A1 (en) Data synchronization in a storage network
CN104954468A (en) Resource allocation method and resource allocation device
CN107807815B (en) Method and device for processing tasks in distributed mode
CN105760184B (en) A kind of method and apparatus of charging assembly
JP2021513694A (en) Dark Roch Realization Method, Equipment, Computational Nodes and Systems
CN108196787B (en) Quota management method of cluster storage system and cluster storage system
CN109510852B (en) Method and device for gray scale publishing
US20190228009A1 (en) Information processing system and information processing method
US20160019090A1 (en) Data processing control method, computer-readable recording medium, and data processing control device
CN108875042A (en) A kind of mixing on-line analysing processing system and data query method
CN108200211B (en) Method, node and query server for downloading mirror image files in cluster
CN110399368B (en) Method for customizing data table, data operation method and device
CN109873839A (en) Method, server and the distributed system of data access
CN101727503A (en) Method for establishing disk file system
CN103369002A (en) A resource downloading method and system
US9619495B2 (en) Surrogate key generation
CN105446824B (en) Table increment acquisition methods and long-distance data backup method
US10048991B2 (en) System and method for parallel processing data blocks containing sequential label ranges of series data
CN113127477A (en) Method and device for accessing database, computer equipment and storage medium
CN111026397B (en) Rpm packet distributed compiling method and device
CN105978744A (en) Resource allocation method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant