CN113840000A

CN113840000A - Distributed network downloading method and device for massive large files

Info

Publication number: CN113840000A
Application number: CN202111109211.5A
Authority: CN
Inventors: 邱江飞; 娄伟贞; 朱梅; 吴敬超
Original assignee: Shandong EHualu Information Technology Co ltd
Current assignee: Shandong EHualu Information Technology Co ltd
Priority date: 2021-06-30
Filing date: 2021-09-22
Publication date: 2021-12-24

Abstract

The invention discloses a distributed network downloading method and device for massive large files. Wherein, the method comprises the following steps: submitting a download address at the control node; monitoring the task state according to the download address; according to the task state, carrying out multithread downloading to obtain downloading data; and judging whether the downloaded data needs to be downloaded repeatedly. The invention solves the problem that the downloading program can be continuously operated on the server in the prior art, but is limited by resources such as network, storage and the like of a single server, and the downloading period is longer when a great number of data files are downloaded. The commands such as curl and wget do not support multithreading technology, efficiency is not high when a single large file is downloaded, downloading speed is low, and meanwhile the commands do not support remote rpc calling and are not beneficial to remote monitoring and downloading task scheduling.

Description

Distributed network downloading method and device for massive large files

Technical Field

The invention relates to the field of data downloading, in particular to a distributed network downloading method and device for massive large files.

Background

Along with the continuous development of intelligent science and technology, people use intelligent equipment more and more among life, work, the study, use intelligent science and technology means, improved the quality of people's life, increased the efficiency of people's study and work.

In many scientific research fields such as atmospheric science, hydrology, oceanography, environmental simulation, geophysical science and the like, data sets are usually published on the internet in files of specific formats for shared access, and common data formats include netCDF, HDF, GRIB and the like. Data in these scientific research fields are usually published according to the data type and time organization form, some form daily data into a file, some form monthly or yearly, and these files contain data of several years, several decades, and even hundreds of years, and contain different observation indexes, so the number of data files is very large, and a single file can reach several tens G at most, and the total amount is very large.

In order to acquire these files, an automated script is generally run on the server side, and the required files are continuously crawled through a curl, wget and other network commands until all the files are completely downloaded. Although the downloading program can be continuously run on the server, the downloading program is limited by resources such as a single server network and storage, and when a large number of data files are downloaded, the downloading period is long. commands such as curl and wget do not support multithreading technology, efficiency is not high when a single large file is downloaded, downloading speed is low, and meanwhile the commands do not support remote rpc calling and are not beneficial to remote monitoring and downloading task scheduling. Often, for security, websites will limit or block IP that frequently crawl websites or download traffic that exceeds a threshold. Downloading a large amount of data at a single point can easily result in the IP being sealed, thereby rendering the data unavailable for downloading.

The invention aims to provide a method for improving the downloading efficiency of massive large files by a distributed network and a multithreading downloading technology. The time required for downloading is shortened by allocating many download tasks to a plurality of servers to work cooperatively. The downloading speed of a single file is improved by a multithreading downloading technology. By means of the distributed cluster, the task of the limited IP can be migrated to other nodes to avoid task failure caused by blocking of the single-point IP.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a distributed network downloading method and device for massive large files, which at least solve the problem that in the prior art, although a downloading program can be continuously operated on a server, the downloading program is limited by resources such as a single server network and storage, and when a great number of data files are downloaded, the downloading period is long. The commands such as curl and wget do not support multithreading technology, efficiency is not high when a single large file is downloaded, downloading speed is low, and meanwhile the commands do not support remote rpc calling and are not beneficial to remote monitoring and downloading task scheduling.

According to an aspect of the embodiments of the present invention, a method for downloading a large number of large files over a distributed network is provided, including: submitting a download address at the control node; monitoring the task state according to the download address; according to the task state, carrying out multithread downloading to obtain downloading data; and judging whether the downloaded data needs to be downloaded repeatedly.

Optionally, after monitoring the task state according to the download address, the method further includes: and acquiring the busy condition according to the task state.

Optionally, the task state includes: busy condition, idle condition.

Optionally, after determining whether the downloaded data needs to be repeatedly downloaded, the method further includes: and repeating the step until the task state is monitored according to the download address.

According to another aspect of the embodiments of the present invention, there is also provided a distributed network downloading apparatus for massive large files, including: the download module is used for submitting a download address at the control node; the monitoring module is used for monitoring the task state according to the download address; the multithreading module is used for carrying out multithreading downloading according to the task state to obtain downloading data; and the judging module is used for judging whether the downloaded data needs to be repeatedly downloaded.

According to another aspect of the embodiments of the present invention, a nonvolatile storage medium is further provided, where the nonvolatile storage medium includes a stored program, and the program controls, when running, a device where the nonvolatile storage medium is located to execute a distributed network downloading method for a large amount of large files.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a processor and a memory; the memory is stored with computer readable instructions, and the processor is used for executing the computer readable instructions, wherein the computer readable instructions execute a distributed network downloading method for massive large files.

In the embodiment of the invention, the download address is submitted at the control node; monitoring the task state according to the download address; according to the task state, carrying out multithread downloading to obtain downloading data; the method for judging whether the downloaded data needs to be repeatedly downloaded solves the problems that the downloading program can be continuously operated on the server in the prior art, but the downloading program is limited by resources such as network and storage of a single server, and the downloading period is long when the downloaded data files are very many. The commands such as curl and wget do not support multithreading technology, efficiency is not high when a single large file is downloaded, downloading speed is low, and meanwhile the commands do not support remote rpc calling and are not beneficial to remote monitoring and downloading task scheduling.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a distributed network downloading method for large files in mass according to an embodiment of the present invention;

fig. 2 is a block diagram of a distributed network downloading apparatus for massive large files according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of a method for distributed network downloading of large files, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Example one

Fig. 1 is a flowchart of a distributed network downloading method for massive large files according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:

and step S102, submitting the download address at the control node.

And step S104, monitoring the task state according to the download address.

And step S106, carrying out multithread downloading according to the task state to obtain downloading data.

And step S108, judging whether the downloaded data needs to be repeatedly downloaded.

Optionally, the task state includes: busy condition, idle condition.

Specifically, the embodiment of the present invention is implemented by the following steps.

According to the technical scheme, the server is divided into 3 types of control nodes, monitoring nodes and downloading nodes according to the type of the executed task.

And the control node is responsible for analyzing all the download links contained in the original links and adding the tasks into the task queue.

The monitoring node is responsible for monitoring the busy state, the downloading progress, the task state and the like of each downloading node, selecting a relatively idle machine and distributing tasks in the task queue.

The download node is responsible for performing specific download tasks. Each downloading node needs to run the aria2 program. aria2 is a downloading program supporting multiple threads, supporting breakpoint resume, supporting file download using multiple sources or protocols, and speeding up the download. The RPC interface built in aria2 can conveniently check the progress and status of the task.

The method comprises the following steps: a download address is submitted at the control node, e.g. this address is the root directory of a certain file server. The control node acquires the download addresses of all files through webpage analysis and recursive calling, and puts the download addresses into the task queue.

Step two: and the monitoring node collects and monitors the task execution state, the execution progress and the busy degree of the downloading node in the task queue at regular time.

Step three: when the monitoring node monitors that a new task is added into the queue, the monitoring node can acquire the busy state of each downloading node and preferentially distribute the new task to the idle nodes. The queues follow the first-in-first-out principle. When all nodes are busy, the new task is waiting in the queue.

Step four: after receiving the task, the download node hands the task to aria2 for multi-threaded download.

Step five: the monitoring node communicates with the downloading node in a remote calling mode of rpc, updates the downloading data and the downloading state of the data at regular time, and stores the node and the task state information into the zookeeper. And if the data downloading is completed, removing the task from the queue.

Step six: if the data downloading fails, the current node will retry. When the retry number limit is exceeded, the monitoring node marks the task status as failed, records the current download current call-back status and the downloaded current size and position, and then skips executing the next task.

Step seven: if the monitoring node monitors that the IP of the current node is forbidden by the target website, the monitoring node allocates the failed task to other host nodes with different IPs, reads the downloaded content from the shared storage, and continuously downloads the content from the last downloading position, thereby avoiding resource waste caused by repeated downloading.

Step nine: and repeating the second step to the eighth step until all tasks are executed.

Through the embodiment, the problem that in the prior art, although the downloading program can be continuously operated on the server, the downloading program is limited by resources such as a single server network and storage, and when a large number of data files are downloaded, the downloading period is long is solved. The commands such as curl and wget do not support multithreading technology, efficiency is not high when a single large file is downloaded, downloading speed is low, and meanwhile the commands do not support remote rpc calling and are not beneficial to remote monitoring and downloading task scheduling.

Example two

Fig. 2 is a flowchart of a distributed network downloading apparatus for massive large files according to an embodiment of the present invention, as shown in fig. 2, the apparatus includes:

and a download module 20 for submitting the download address at the control node.

And the monitoring module 22 is used for monitoring the task state according to the download address.

And the multithreading module 24 is used for performing multithreading downloading according to the task state to obtain downloading data.

And the judging module 26 is used for judging whether the downloaded data needs to be repeatedly downloaded.

Optionally, the task state includes: busy condition, idle condition.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A distributed network downloading method for massive large files is characterized by comprising the following steps:

submitting a download address at the control node;

monitoring the task state according to the download address;

according to the task state, carrying out multithread downloading to obtain downloading data;

and judging whether the downloaded data needs to be downloaded repeatedly.

2. The method of claim 1, wherein after said monitoring task status according to said download address, said method further comprises:

and acquiring the busy condition according to the task state.

3. The method of claim 1, wherein the task state comprises: busy condition, idle condition.

4. The method of claim 1, wherein after said determining whether said downloaded data requires repeated downloading, said method further comprises:

and repeating the step until the task state is monitored according to the download address.

5. A distributed network downloading device for massive large files is characterized by comprising:

the download module is used for submitting a download address at the control node;

the monitoring module is used for monitoring the task state according to the download address;

the multithreading module is used for carrying out multithreading downloading according to the task state to obtain downloading data;

and the judging module is used for judging whether the downloaded data needs to be repeatedly downloaded.

6. The apparatus of claim 5, wherein the apparatus further comprises:

and acquiring the busy condition according to the task state.

7. The apparatus of claim 5, wherein the task state comprises: busy condition, idle condition.

8. The apparatus of claim 5, further comprising:

and the repeating module is used for repeating the task state monitoring according to the download address.

9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the non-volatile storage medium is located to perform the method of any one of claims 1 to 4.

10. An electronic device comprising a processor and a memory; the memory has stored therein computer readable instructions for execution by the processor, wherein the computer readable instructions when executed perform the method of any one of claims 1 to 4.