CN115086304A

CN115086304A - Multi-source distributed downloading system based on FTP protocol

Info

Publication number: CN115086304A
Application number: CN202210806656.7A
Authority: CN
Inventors: 陈旭辉; 王遂缠; 张鸿; 刘洋; 高鹏; 徐娟; 孔小怡; 陈晓峰; 许竹霞; 黄芳芳; 付杰; 张春燕; 王旭东
Original assignee: Gansu Meteorological Information And Technology Equipment Support Center
Current assignee: Gansu Meteorological Information And Technology Equipment Support Center
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2022-09-20
Anticipated expiration: 2042-07-08
Also published as: CN115086304B

Abstract

The invention discloses a multisource distributed downloading system based on an FTP protocol, which firstly provides an MP2MS (multipoint-to-multipoint server) downloading technology, uses an MP2MS technology, adopts a decentralized distributed scheduling technology to provide the multisource distributed downloading technology, realizes that a plurality of downloading clients download the same type of files from a plurality of heterogeneous FTP servers at the same time, and improves the FTP downloading performance and reliability; secondly, a timed downloading, delayed downloading, filename time variable replacement and regular renaming technology is provided, so that the compatibility and the automation degree of a downloading system are improved; the distribution technology can push the downloaded files to a plurality of servers in real time, and can meet the service requirement of actively pushing the files to downstream applications; and finally, developing multi-source distributed downloading software by using the bash, wherein the application of the multi-source distributed downloading software in a test environment and an actual production environment shows that the multi-source distributed downloading has the characteristics of high downloading speed and high reliability.

Description

Multi-source distributed downloading system based on FTP protocol

Technical Field

The invention belongs to the technical field of distributed downloading, and particularly relates to a multi-source distributed downloading system based on an FTP protocol.

Background

Downloading refers to the act of obtaining data from other computers through a network, and the commonly used downloading technologies include three technologies, P2S, P2P, and P2 SP. P2P and P2SP have high downloading performance, but the security is poor, the problems of hotlinking, copyright and the like exist, and the use in the field of professional data sharing is less. And P2S adopts C/S mode, utilizes FTP or HTTP protocol, wherein FTP protocol is the file transfer protocol used at the earliest, the data transmission is reliable, has user authority management function, the data can be encrypted, the security is high, and the method is widely used in the field of professional data sharing. However, P2S also has disadvantages, for example, a client can only connect to one server, and is limited by many factors such as network bandwidth, server speed limit, server connection number limit, server and client computer performance, and the performance improvement range is limited, and meanwhile, only one client and one server has a single point of failure and low reliability.

In order to improve the downloading performance, CN10574400B, thunder and the like, software (Ma Xiao, Chayanna, Lijia) such as flash and the like (Ma Xiao, Chayanna, Lijia. network downloading technology analysis [ J ] based on P2 SP. computer technology and development 2014, (6): 187-; patents such as CN11384000A and CN112199442A realize distributed downloading, and multiple clients can download the same file from 1 server. Although the above technology can improve the downloading performance, the former only supports multiple servers, does not support multiple clients, requires a special FTP server, and further requires an additional resource index server, and is poor in compatibility, the latter only supports multiple clients, does not support multiple servers, and multiple clients can only download the same file at the same time, so that the downloading reliability and performance are still limited.

Disclosure of Invention

In order to solve the problems, the invention provides an MP2MS (Multi-Peer to Multi-Server) downloading technology, and provides a Multi-source distributed downloading system based on an FTP protocol based on the downloading technology, so that a plurality of clients can download files from a plurality of servers in a distributed manner, thereby greatly improving the downloading performance, eliminating single-point faults in the traditional downloading technology and improving the downloading reliability.

The technical solution for realizing the invention is as follows:

a multi-source distributed download system based on FTP protocol is characterized by comprising a download task management module, a download scheduling module, a file name changing module, a file download module, a file distribution module and a download statistical module;

the task management module is used for configuration management of downloading tasks and synchronization of the tasks in the cluster;

the download scheduling module is used for scheduling a plurality of files on the remote server to different clients for distributed downloading through the distributed scheduling subsystem;

the file renaming module is used for renaming the source file according to the renaming rule;

the file downloading module is used for calling downloading software to download files;

the file distribution module is used for secondary distribution of downloaded files, is responsible for pushing the downloaded files to other servers or production systems needing the files, and supports pushing to a plurality of servers at the same time;

and the download counting module comprises a download task counting submodule and a time period counting submodule and is used for counting the number of the downloaded files and the download amount of the downloaded files of the download client according to the download task name and the time period respectively.

Furthermore, the download tasks are configuration files for storing download information of the same type of files, and are download objects of the multi-source distributed download system, and each download task comprises server information, download parameters and a download queue;

server information including server URL aliases and time zones;

downloading parameters, including downloading command, number of parallel downloading files, downloading mode (bin or text), connection mode (active or passive), maximum number of single file downloading threads;

the download queue comprises a source file path, a file name matching template, a file name unique identifier, a renaming rule, a delayed download threshold value and a distribution path, wherein a plurality of download queues can be configured in one download task, one queue occupies one row, and if the file path and the file name have specific dates, time variables are used for replacing the download queues.

Furthermore, the time variable is a special character string representing a time format, 33 variables are used for representing 33 time formats, the time variable is used for replacing specific time in a source file path, a file name matching template, a renaming rule and a distribution path in a task configuration file, and after the downloading system is started, the time variable replacing module is responsible for replacing the specific time with actual time.

Furthermore, the download task management module comprises a task renaming sub-module, a task cloning sub-module and a task synchronization sub-module;

the task renaming submodule is used for renaming the configuration file of the download task, and modifying other configuration or log files related to the configuration file of the download task after renaming, wherein the modification comprises modifying a crontab plan task, modifying the name of the task in the download log file (ensuring the continuity of download statistical results), and modifying the name of a daily download information summary file;

the task cloning submodule is used for generating a new configuration file according to the existing task configuration file, and if the A, B servers are isomorphic (the directory and the file name of the same file on different servers are the same), the task configuration of downloading the file from the server B is quickly generated according to the task configuration of downloading the file from the server A and is synchronized to other hosts in the downloading cluster;

and the task synchronization submodule is used for quickly creating a downloading cluster, firstly generating a planned task of another host according to the planned task of the current host, setting different starting delays aiming at different tasks, staggering the starting time of a downloading program, and secondly synchronizing a downloading task configuration file which takes effect on the local host to the other downloading host.

Furthermore, the remote multi-source server supports isomorphism and also supports isomerism, wherein isomorphism means that the directory and the file name of the same file on a plurality of services are the same, isomerism means that the path and the file name of the same file on different servers are different, and isomorphism and isomerism are realized by the file renaming module and comprise an increase and decrease prefix submodule, an increase and decrease suffix submodule, a case and case conversion submodule and a unique file renaming submodule;

the prefix increasing and decreasing submodule is used for increasing or removing prefixes of the source file name and then generating a local file name;

the suffix increasing and decreasing submodule is used for increasing or removing a suffix from the source file name to generate a local file name;

the capital and small case conversion submodule is used for converting the capital and small of English letters in the source file name and then generating a local file name;

and the unique file name changing module is used for mapping a plurality of different source file names into 1 local file name according to the unique key words in the file names.

Furthermore, the file downloading module supports timed downloading, expired file downloading, delayed downloading and parallel downloading;

the downloading method comprises the steps of downloading regularly, wherein each downloading task is managed by linux crontab, downloading files of the downloading software is started regularly, a starting time window of each downloading task is set according to file downloading requirements, the time window has two attributes, the first is hour and is used for setting the hour of downloading the files, the second is minute and is used for setting the minutes of downloading the files, and the two time formats both adopt linux crontab time formats;

the overdue file is downloaded, if the file is not uploaded to a server on time or download software is not started, so that a download task misses a download time window, a download command can be manually executed, and the backward time is set;

delaying downloading, if the delayed downloading threshold value of the downloading task is larger than 0, calculating the difference between the downloading time and the source file creating time (time after time zone conversion), if the difference value is larger than the delayed downloading threshold value and is smaller than 10 days, downloading the file, otherwise, not downloading the file; if the two clients respectively download data from the two servers at the same time, downloading from the server with a small downloading delay threshold value preferentially, wherein the downloading delay threshold value is equivalent to the priority of the server, and the smaller the value is, the higher the priority is, and the value is set in a downloading queue of the downloading task configuration file;

the method comprises the steps that multiple files are downloaded in parallel, multiple files in one queue can be downloaded in parallel, a download scheduling module creates a background parallel download pool for each download queue, when the download pool is full, the background parallel download pool enters a waiting state, otherwise, a download file is added to the pool, the size of the download pool is configurable, and the default size is 2;

the single file download supports breakpoint continuous transmission, multithreading and the function of adding temporary file suffixes.

Furthermore, the distribution module comprises a real-time distribution sub-module and a failed file retransmission sub-module;

the real-time distribution submodule is used for pushing the downloaded file to other servers in real time, the pushing adopts an asynchronous mode, a plurality of servers can be pushed in parallel, the pushing supports renaming, the target file can adopt a source file name or a renamed file name, a temporary file suffix is automatically added during the pushing, and a distribution failure log is recorded when the pushing fails;

the failed file retransmitting submodule is used for redistributing failed files according to the distribution failure log, retransmitting the files with the function of ignoring expired files, not retransmitting the files with failure time exceeding 48 hours, and having the function of short-circuit transmission, namely when one file fails to retransmit, other files of the type of data do not execute transmission operation.

Furthermore, the distributed scheduling subsystem adopts a decentralized multi-task scheduling technology, each downloading host runs the same scheduling program, the downloading scheduling is not influenced by the fault of a single host, and the distributed scheduling is realized through a distributed file lock.

Furthermore, the distributed file lock is realized by adopting a way of competing for the first line of the file, and the operation of the distributed file lock comprises creation, state detection, destruction and updating;

the lock creating operation is carried out in a way of competing for the first line of the lock file, when an applicant detects that the lock file does not exist, the information of the node is written into the 1 line of the lock file in an additional mode, the content of the first line of the lock file is read immediately after the writing operation is finished, and is compared with the written information, if the content is the same as the content of the first line of the lock file, the loading and unloading of the lock are successful, otherwise, the lock file competition is quitted;

lock state detection, which is used for detecting the state of the existing lock after the applicant detects the existing lock, and if the existing lock is an abnormal lock, executing lock updating operation;

lock destroying operation, which is used for automatically deleting the download lock of the file after the file download is finished; when capturing a signal for terminating the process during the execution of the downloading program, automatically clearing all downloading locks generated by the process; when a downloading program is started, forcibly stopping the existing downloading process and clearing all locks generated by the process;

and the lock updating operation is used for deleting the orphan lock, recreating the lock, remotely ending the locking process on the locking node, clearing the overtime lock and recreating the lock on the node.

Furthermore, the downloading software has a self-healing function, the downloading task is started periodically, the downloading program searches whether a process corresponding to the downloading task exists or not when the downloading program is started each time, if the process exists, the downloading program is forced to finish restarting, otherwise the downloading program is directly started, and the mechanism enables the multi-source distributed downloading system to have the self-healing function, and faults such as process false death, process dead death and the like can not occur.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the MP2MS downloading technology provided by the invention integrates multi-source downloading and distributed downloading technologies, realizes that a plurality of clients download data from a plurality of servers at the same time, changes a single downloading client into a downloading cluster, solves the problem of single-point failure and performance bottleneck of the traditional P2S downloading system, can greatly improve the downloading performance and reliability, increases the downloading performance of a C/S downloading channel by more than 30 percent, and increases the downloading reliability of 1 server or client by 1 time;

secondly, the multi-source downloading system designed by the invention supports a plurality of heterogeneous downloading source servers, is compatible with all FTP servers, does not need to install any software on the servers, and does not need to carry out any configuration; different downloading priorities can be set for different data source servers, and the downloading sequence of the files on the multi-service is controlled through the priorities;

thirdly, the invention provides a file renaming technology based on rules, the downloaded file can be renamed according to the rules, the file with the file name dynamically changing along with time can be automatically downloaded, the automatic operation degree of the downloading system is improved through the time variable replacement of the file name and the renaming according to the rules, and the docking of the downloading system and other service systems is facilitated;

fourthly, the invention provides a file pushing service through a secondary distribution module, and can push the downloaded files to a plurality of FTP servers;

in conclusion, the invention has better application prospect, in the single-server occasion, the downloading speed can be improved through a plurality of downloading clients, the single-point fault of the client can be eliminated, in the multi-server occasion (with a plurality of mirror images or copies), the downloading service capability can be improved through a plurality of servers, and a new copy server can be quickly established by utilizing a plurality of servers.

Drawings

FIGS. 1(a) - (d) are schematic diagrams of download modes of P2S, P2MS, MP2S and MP2MS, respectively;

FIG. 2 is an architecture diagram of a multi-source distributed download system according to the present invention;

FIG. 3 is a topology diagram of a multi-source distributed download system according to the present invention;

FIG. 4 is a flowchart of a file download process of the multi-source distributed download system according to the present invention;

FIG. 5 is a flowchart of a distributed lock file maintenance process according to the present invention;

FIG. 6 is a configuration example of a download task (EC data download task configuration file) according to the present invention;

fig. 7 is a diagram of part of the timing tasks on download cluster node01 in actual service;

fig. 8 is a diagram of part of the timing tasks on download cluster node02 in actual service;

FIG. 9 is a diagram of the number of files downloading JMA _ GMS _12 material in a downloading cluster composed of 3 clients in actual service.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the following further describes the technical solution of the present invention with reference to the drawings and the embodiments.

One, various FTP download techniques are background

According to the connection mode between the FTP client and the server, the connection modes between the client and the server are divided into four modes, i.e., P2S, P2MS (point-to-multipoint server), MP2S (multipoint-to-multipoint server), and MP2MS (multipoint-to-multipoint server), as shown in fig. 1.

(1) P2S downloads: as shown in fig. 1(a), P2S is a standard FTP download mode, one client can only download files from one server, and although the download speed can be increased by multithreading, the download speed is limited by factors such as client, server performance, network bandwidth, and server speed limit, the increase range is limited, and there is a single point of failure, and the reliability is low.

(2) P2MS downloads: as shown in fig. 1(b), P2MS is a multi-source download, where 1 client can download the same or similar files from multiple servers, the download speed is high, the servers have redundancy, but only 1 client has a single point of failure.

(3) MP2S downloads: as shown in fig. 1(c), MP2S is a distributed download, and multiple clients can download the same or similar files from 1 server, and although there are multiple clients, there are only 1 server, and there is still a performance bottleneck and a single point of failure. Current FTP software does not support this download mode.

(4) MP2MS downloads: as shown in fig. 1(d), the present invention refers to a technology in which multiple clients simultaneously connect to multiple servers to download files as MP2 MS. In MP2MS downloading, if there are m clients participating in downloading and n servers, theoretically, n × m independent downloading channels can be generated, and the theoretical downloading speed is n × m times of that of the traditional downloading method. The download mode performance can be dynamically expanded according to needs, the client and the server have redundancy, the reliability is high, the FTP download technology is the most ideal, and the download performance and the reliability can be greatly improved in a multi-server environment. There is currently no downloaded software that supports MP2 MS.

Second, the multi-source distributed download system provided by the invention

The invention is based on the proposed MP2MS technology, and develops a multi-source distributed download system (MSDD for short) which is composed of six modules of download task management, download scheduling, file renaming, file downloading, file distribution and download statistics, and the specific structure is shown in fig. 2-3.

The task management module is used for being responsible for configuration management of the downloading tasks and synchronization of the tasks in the cluster;

the distribution module is used for secondary distribution of the downloaded file, is responsible for pushing the downloaded file to other servers or production systems needing the file, and supports pushing to a plurality of servers at the same time;

and the download counting module is used for counting the number of the downloaded files and the download amount according to the download task type and the time period.

Thirdly, the key design of the invention

(1) Download task design

MSDD packages the download information of the same type of files into a download task, takes the download task as a download object of a download system, and can download a batch of files by starting one download task. The information package encapsulated in the download task expands 3 types of server information, download parameters, and download queue, and fig. 5 is an example of the download task of EC data.

The server information comprises a server URL alias and a time zone, wherein the server URL alias is short for the server URL and is configured in URL _ alias.

And if the protocol is FTP, the server is represented as an FTP server, and if the protocol is file, the server is represented as a local machine. The time zone refers to a time zone adopted by the server, and the value is used in the delayed downloading module and is used for calculating the creation time of the file on the server.

The download parameters comprise a download command, the number of parallel download files, a download mode (bin or text), a connection mode (active or passive), and a single file download maximum thread number, wherein the download command supports 5 commands of cp, ftp, lftp, wget and curl, the cp represents that a server is a local machine, and the cp command is adopted to copy files;

the download queue comprises a source file path, a file name matching template, a file name unique identifier, a renaming rule, a delayed download threshold value and a distribution path, and one download task can comprise a plurality of queues. The file name matching template consists of characters, numbers, wildcard characters, { }, [ ] and the like supported by linux ls commands and is used for retrieving a class of files; the file name unique identification consists of field separators and field numbers, if the file name is divided by the field separator, the 5 th field is a file unique representation character string, and the source file name is divided by the field separator; the renaming rule indicates that the name is not renamed by setting a renaming method after a source file is downloaded, { } indicates that the name is not renamed, { UPPER } indicates that all letters in the source file name are converted into uppercase, { lower } indicates that all letters in the source file name are converted into lowercase, + prefix { } indicates that a prefix of 'prefix' is added in front of the source file, { } + suffix indicates that a suffix of 'suffix' is added in back of the source file, and { } -suffix indicates that a suffix of 'suffix' in back of the source file is removed; and the delayed downloading threshold value configures delayed downloading parameters of the files in the queue, represented by Tn or n, wherein n is the number of delayed minutes, and 0 represents that delayed downloading is not started. The distribution path refers to the server information to be pushed of the downloaded file, and the alias is the alias of the host computer URL: path specifies space separation between multiple servers, host URL alias configured in URL _ alias.

(2) Distributed download scheduling design

The distributed scheduling is responsible for coordinating a plurality of downloading nodes to download files of the same type from a plurality of servers in a balanced manner, and ensures that the files are not repeated and have no omission. The distributed task scheduling is generally realized by adopting message intermediate, the scheduling mode comprises two modes of centralized scheduling and decentralized scheduling, the system adopts decentralized distributed task scheduling technology, the downloading cluster nodes have no primary and secondary scores, and each node runs the same scheduling program.

The core of distributed scheduling is distributed lock, and the distributed lock is generally implemented based on middleware between messages, databases, caches, files, and the like, such as message middleware, Redis cache, and Hash time lock. The system provides a distributed file locking technology based on shared storage, is realized by using bash, does not need third-party software, and is simple to realize. The operations on the lock include lock creation, lock destruction, lock update, etc., and the lock file maintenance flow is shown in fig. 3.

Lock creation: a lock method can be adopted to create a file lock on a single Linux computer, and after testing, the file cannot be locked among a plurality of computers using shared storage by the method, the created file lock is not unique, and the phenomenon of repeated locking exists. The file lock is realized by a way of competing for locking the first line of the file, and the method comprises the following steps: when detecting that the lock file does not exist, an applicant writes the information of the node into the lock file in an additional mode (only 1 line), immediately reads the content of the first line of the lock file after the writing operation is finished, compares the content with the written information, if the content is the same as the written information, the locking is successful, and otherwise quits the lock file competition.

And (3) lock state detection: if the applicant detects that the lock already exists, the state of the lock is detected, and if the lock is an abnormal lock, the lock updating operation is executed. The abnormal lock has two types of overtime lock and orphan lock, and if the creation time of the lock file exceeds the life cycle of the lock, the lock is the overtime lock. If the locking process of a lock does not exist, the lock is an orphan lock, and the orphan lock is a deadlock formed by exception of a downloading node.

Destroying the lock: the lock is to be destroyed in the following cases: (1) automatically deleting the download lock of the file after the download of the file is finished; (2) when capturing a signal for terminating the process during the execution of the downloading program, automatically clearing all downloading locks generated by the process; (3) when the downloading program is started, the existing downloading process is forcibly stopped, and all locks generated by the process are cleared.

And (3) lock updating: for an orphan lock, the lock is directly deleted and then recreated. For the overtime lock, the locking process on the locking node is remotely ended by ssh, the lock is cleared, and then the lock is re-created on the node.

(3) In-queue multi-file parallel download

And the scheduling system creates a background parallel download pool for each queue, enters a waiting state when the download pool is full, and otherwise adds a download file to the pool. The download pool size is configurable, with a default value of 2.

(4) Filename time variable replacement

The time is one of the important attributes of the weather data, and the file names of all the data contain date and time information, which brings some influence to the automatic downloading. In order to realize automatic running of the downloaded software, a time variable is used for replacing a specific time in a file name in a downloaded task configuration file, and the time is automatically replaced by a proper time when the software is downloaded. The supported time variable is formed by combining 4 basic variables of yyyy (four-digit year), yy (two-digit year), mm (month), dd (day) and hh (hour), and the total number of the basic variables is 33. Variables begin with "#" for the previous day, "$" for the current day, and "%" for the next day, e.g., $ yyymmddhh, for the year, month, day of the current day.

(5) Rule-based file renaming

According to the requirements of the name change of the meteorological data, a file name change rule is designed, and the name change can be carried out on the file during downloading. The renaming rule comprises four types of unique matching of case conversion, prefix increase and decrease, suffix increase and decrease and file names, and a plurality of rules can be combined for use. The first three rules are simple, only the source file name is simply converted, the unique file name matching is a special rule, different scenes are named on different servers by the same file, and a plurality of different file names are mapped into one file through unique keywords in the file names.

Taking the american ocep weather data as an example (see table 1), the file name of the data is gfs.t00z.pgrb2.1p00.f012 on the ocep server, and the national weather information center renames the data to W _ NAFP _ C _ kbbc _20210101000000_ P _ gfs.t00z.pgrb2.1p00.f012.bin after downloading, and adds a data description, a timestamp and a suffix. In actual use, a task of downloading data from a U.S. server sets a renaming rule of adding a prefix of 'W _ NAFP _ C _ KWBC _ $ yyyymddhh 0000 _' and adding a suffix of 'bin', a task of downloading data from Beijing service sets a unique file name matching rule of '7', '7' indicates that the file name is separated by using '7' as a separator, the content of the 7 th field is a key word unique identification key of the file name, a local file matching template is generated by using the file matching template and the unique key, and the file is searched, and if the file exists, the file does not need to be downloaded.

TABLE 1 naming of NCEP materials on different servers

(6) Delayed downloading

In a multi-server download environment, different download priorities may be set for different servers, with files being downloaded first from servers with higher priorities and automatically from other servers if a server is unavailable. The priority is set by a download delay threshold in the download task, if the download delay is n minutes, only files generated within [ n minutes, 10 days) are downloaded, and the limitation of adding 10 days is to prevent the downloading of outdated weather data files.

(7) Self-healing function

MSDD adopts Linux crontab to periodically and automatically start a downloading task, and when a downloading program is started each time, if a downloading process of the task already exists, the downloading program is forcibly ended and then restarted. The mechanism can prevent the process from being dead, and can retrieve the files to be downloaded from the server again according to the current time to preferentially download new data.

Examples

One, download example

Based on the system provided by the invention, the steps of implementing distributed file downloading comprise:

step 1: configuring download task configuration files

The task configuration file name is: "host alias _ data is abbreviated as [ _ data time ]. INI", data time represents the starting downloading time of the data, and is represented by universal time hour, data with unfixed time or more downloading frequency can not be matched, data name can not appear "_", and configuration file content is as described above;

step 2: starting the downloading program, reading the input parameters

The parameters include: downloading task, backing time, whether forced downloading or not, whether testing or not, 4 parameters, wherein the first parameter is a necessary parameter, the others are optional, and default values are respectively as follows: 0. not mandatory, not test;

and step 3: starting trap, when detecting 'INT KILLTERM HUP' signal, according to the content in the lock queue file of the download task under the lck directory (all locks created by the process are stored in the file), deleting the lock created by the process, emptying the lock queue file, and then exiting;

and 4, step 4: detecting whether the downloading task is running or not according to three key words of the name of the downloading program, the name of the downloading task and the process number of the current process, and if the downloading task is running, forcibly ending the existing process by using kill-9;

and 5: generating download times

If the download back time is input, the download time is the current time minus the download back time, if the download back time is not input, the download time is intelligently judged according to the time parameter in the task configuration file name and the current time, if the hour of the current time is less than the time of the data, the download starting time is set to be 23:59 of the previous day, otherwise, the download time is the current system time;

step 6: reading a download task configuration file, analyzing variables in the download task configuration file, and assigning initial values to vacancy parameters;

and 7: starting circulation, acquiring a queue, analyzing and then putting the queue into a variable;

and 8: taking the downloading time obtained in the step 5 as relative current time, and respectively replacing time variables in the downloading queue by using the current time and two time points of the previous 1 hour to obtain downloaded file path and file name information of 2 times, wherein if the obtained two information are completely the same, only the data of the current time is downloaded, otherwise, the data of the previous 1 time point and the current time point are circularly downloaded, so that the file of the previous hour is prevented from being missed in the process of crossing hours and dates;

and step 9: starting recycling again, and downloading the file at a single time point;

step 10: using curl to test whether the download server is available, if the download server is not available, recording a log, and turning to the step 7;

step 11: searching files meeting the conditions on the remote server according to the file matching template, returning file names, file sizes and file time, storing the files in a list directory, and calculating the number of the files meeting the conditions;

step 12: sequentially executing case and case conversion, prefix increase and decrease, suffix increase and decrease and unique identification renaming rules on the file matching template to generate a local file retrieval template;

step 13: acquiring the number of local files meeting the conditions;

step 14: writing the remote file directory, the file name template, the remote file number and the local file number into a daily download information summary file in the last directory;

step 15: according to the file number in the summary file of the information downloaded every day for the last 5 days, automatically judging whether the data is completely downloaded, if so, recording the log, and turning to step 7;

step 16: writing the number of the remote files and the local files retrieved this time into a log file;

and step 17: starting a new cycle, and processing the returned files one by one;

step 18: according to the remote file name, executing capital and small conversion, prefix increase and decrease, suffix increase and decrease, unique identification and name change rules, and generating a local file name and a download lock file name;

step 19: if the local file exists and the size of the local file is the same as that of the remote file, downloading is not needed, and the step 17 is returned;

step 20: if the 'test' parameter is input, outputting the analyzed configuration file, the remote end and the local file information, generating a test log file, and turning to the step 17; the function is used for testing whether the task configuration file is correct or not and whether the server is normal or not, only testing is carried out, and downloading is not executed;

step 21: if the corresponding download lock file of the download file does not exist, turning to step 24;

step 22: judging whether the lock is overtime or not according to the lock file creating time, if not, downloading is not needed, and returning to the step 17;

step 23: delete timeout lock

Reading the content of a lock file, logging in a downloading host machine by ssh according to the IP (Internet protocol) of the downloading host machine, the downloading process and the downloading task name, detecting whether the corresponding downloading process exists, and if so, indicating that the lock is an overtime lock, and directly killing the process; otherwise, the lock is an orphan lock, and the lock file is directly deleted;

step 24: application download lock

Writing 'locking time downloading host IP downloading process downloading task name' into the lock file in an additional mode, if the writing fails, turning to the step 17, otherwise, reading the first line of the lock file, comparing the first line with the written content, and if the first line is different from the written content, turning to the step 17;

step 25: adding the name of the lock file to a lock queue file under an lck directory;

step 26: if the forced download parameters are input, go to step 28;

step 27: time zone conversion is carried out on the source file time, the difference value between the downloading time and the source file time is calculated, and if the difference value does not meet the downloading delay condition, the step 17 is carried out;

step 28: if the parallel downloading number of the downloading queue reaches the upper limit, executing sleep 1 circularly and waiting for 1 second;

step 29: according to the downloading command, calling different downloading functions in a background mode and executing downloading operation;

step 30: the downloading function is executed in an asynchronous mode, a tmp suffix is automatically added, after the downloading is finished, a lock used by the function is deleted, then a lock file is deleted, a log is recorded by related information, and if the downloading is successful, a background calls a distribution program to distribute; the function starts a trap function, and when the function exits abnormally, a download lock is cleared;

step 31: file distribution

The file distribution is an independent program, is called by a main program and is asynchronously executed in the background;

the file distribution program comprises 6 parameters of a task name, server information, a source file name, a local file name, a file size and a reissue mark, the file size is an optional parameter, the program automatically detects whether the local computer has an Iftp or ftp software, preferentially uses the Iftp to transmit the file, automatically adds a tmp suffix when transmitting the file, and determines whether the transmission is successful or not by comparing the file size; when the reissue mark is false, the first distribution is indicated, and if the transmission fails, the transmission information is recorded to a transmission failure log;

step 32: file reissue

The file reissue program is an independent program and is started at regular time by a scheduled task, the program is executed once every 10 minutes, after the program is started, failure file information is read in line by line from a transmission failure log file, whether the first transmission failure time of the file exceeds 48 hours (only failure files within 48 hours are transmitted) is detected, if the first transmission failure time of the file exceeds 48 hours, the file is deleted from the failure file, otherwise, a distribution program is called to carry out reissue in a reissue mode, if the reissue is successful, the file is removed from the transmission failure log file, and if the reissue fails, a reissue failure mark of the type of data is set to true;

step 33: hair-complementing short circuit detection

When the file is reduplicated, reduplication short-circuit detection is executed, whether the reduplication failure mark of the data is true is detected, if the reduplication failure mark of the data is true, the file is ignored (if one file in the same data fails to be reduplicated, the server or the directory is indicated to have a problem, and the rest other files of the same type do not call the distribution program any more);

step 34: if the source file is not processed, turning to step 17;

step 35: if the current queue has unprocessed download time, step 9;

step 36: if the unprocessed download queue exists, turning to step 7;

step 37: the downloading process is completed and the flowchart of the downloading process is shown in fig. 4.

Step 38: the task clone is an independent program, a new task configuration file is quickly generated by referring to the existing task configuration file, the first parameter is the name of the existing download task, and the second parameter is the URL alias of the new download host. Firstly, reading an existing configuration file by a program, changing a 'host-old _ host' item into a 'host-new _ host', and generating a new download task configuration file according to a step 1 naming specification;

step 39: the task synchronization is an independent program for quickly creating a download cluster, and first, a timed download task loaded in a convtab on a current download host is read (see fig. 7), a sleep command is added in front of each download command, a task n with a period of minutes is 'cycle/cluster number + m', a task n with other types is 60+ m, and m is a random number between 0 and 30 (the purpose is to stagger time windows of a plurality of clients for accessing a server as much as possible, and stagger the time for the clients to start the download program at the same time, so that on one hand, a large number of download programs are prevented from being started by one client, on the other hand, a server is prevented from being accessed by a plurality of clients at the same time), and then the timed download task is combined with an original scheduled task (stored in a crontab1. dat) on the target host to generate a crontab timed task table (see fig. 8) on the target host. Second, the enabled download task configuration file on the current host is synchronized to the target host. Finally, if there is no target host IP in the download cluster node configuration file cluster.

Step 40: the cluster download statistics is an independent program for counting the download condition of the download cluster, and a time parameter can be input. The program first reads in the cluster. ini file, analyzes the host IP of the download cluster, executes ssh in a parallel manner, executes a single-node download statistical command on each host to obtain the download condition of a single download host, then merges the statistical results of each node to obtain the download condition of the whole download cluster, and fig. 9 is the download file number of two kinds of data on 3 download hosts in a certain period.

Step 41: the single-node downloading statistical program is an independent program, firstly, the downloading log of the host is read, the data type, the downloading time and the file size are analyzed, and classification and summarization are carried out according to the time interval and the data type, so that the file downloading number and the downloading capacity under different summarization modes are obtained.

Second, system test

To further illustrate the effectiveness of the present system, the following tests were performed.

1. Download performance testing

In the VMware virtualization environment, a download test environment is formed by 2 Linux FTP servers (vsftp is installed, no speed limit is set), and 2 Linux download computers, and the shared storage adopts NAS, and the structure diagram is shown in fig. 1 (d). Copying about 100M files to two servers, establishing 2 download tasks on each download computer, and designing 7 download schemes. The protocol and results are shown in Table 2. The acceleration performance in the table is calculated by adopting a formula ((when the scheme 1 is used-when the scheme is used) 100/when the scheme 1 is used), the larger the numerical value is, the smaller the time is, and the acceleration effect is obvious.

(1) Single task download effects

Schemes 2 and 3 in table 2 are both single-task downloading, where scheme 2 is single-file serial downloading and scheme 3 is 2-file parallel downloading. As can be seen from the results, the time consumption of the scheme 2 is 31% more than that of the traditional downloading (scheme 1), which indicates that the distributed scheduling increases the downloading overhead; the time of the scheme 3 is basically the same as that of the scheme 1, which shows that the overhead generated by distributed scheduling can be offset by multi-file parallel downloading, and the performance of the 2-file parallel downloading is the same as that of the traditional ftp serial performance.

(2) Distributed download effects

Schemes 4-7 in table 2 are multi-task downloading, wherein 5-7 are distributed downloading, the number of downloaded files in each task can be found, and the distributed scheduling can distribute the files to different downloading clients according to the principle of more than one type. In the aspect of downloading performance, the downloading speed can be improved by adding the client or the server, the performance is improved by about 30% when the number of the tasks is increased from 1 to 2, and the performance is improved by about 50% when the number of the tasks is increased from 2 to 4. The overall performance is obviously improved.

(3) Reliability test

In the downloading process, one server or one client is closed, the downloading service is not interrupted, and the integrity of the file is not influenced.

Table 2 different download protocol experimental data

The MSDD provided by the invention is applied to a meteorological big data platform in Gansu province and is used for collecting various meteorological data. The operation environment is as follows: the four data downloading sources are respectively Beijing, Tianshui, Wuwei and American NCEP servers, the downloading client is two Linux virtual servers to form a downloading cluster, and the two Linux virtual servers are downloaded from 4 servers at the same time. Based on the data collection requirements, 101 download tasks are defined for 19 types of data. Compared with the old downloading system (mainly adopting wget software and CMAcast complement system), the method has the following characteristics:

(1) the automation degree is high. The automation of data collection services such as missing file detection, downloading, renaming according to rules, secondary distribution and the like is realized for 19 types of data without manual intervention.

(2) The reliability is high. The traditional download has 2 fault points of a server and a client, in the MSDD, one client or one server is added, 1 fault point can be reduced, the reliability is doubled, and the download is not influenced by the fault of one download client or one FTP server.

(3) The downloading efficiency is high. The old system required about 2 hours to download 66 EC files of about 110M in size from beijing. When multi-source distributed downloading is adopted, the downloading can be carried out from Beijing, Zhangye and Tianshui simultaneously, and only 20 minutes is needed.

The MSDD system has better application prospect, can improve the downloading speed through a plurality of downloading clients and eliminate single-point faults of the clients in the single-server occasion, and can realize the redundancy of the clients and the server while improving the downloading speed in the multi-server occasion.

Those not described in detail in this specification are within the skill of the art. Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and modifications of the invention can be made, and equivalents of some features of the invention can be substituted, and any changes, equivalents, improvements and the like, which fall within the spirit and principle of the invention, are intended to be included within the scope of the invention.

Claims

1. A multi-source distributed download system based on FTP protocol is characterized by comprising a download task management module, a download scheduling module, a file name changing module, a file download module, a file distribution module and a download statistical module;

the download scheduling module is used for scheduling a plurality of files on the remote multi-source server to different clients for distributed downloading through the distributed scheduling subsystem;

the file distribution module is used for secondary distribution of the downloaded file, pushing the downloaded file to other servers needing the downloaded file, and supporting the pushing to a plurality of servers at the same time;

and the download counting module comprises a download task counting submodule and a time period counting submodule and is respectively used for counting the number of the downloaded files and the download amount of the downloaded files of the download client according to the download task name and the time period.

2. The multi-source distributed downloading system based on the FTP protocol as claimed in claim 1, wherein the downloading task is a configuration file for storing downloading information of the same type of file, and is a downloading object of the multi-source distributed downloading system, and each downloading task comprises server information, downloading parameters and a downloading queue;

server information including server URL aliases and time zones;

downloading parameters including downloading command, number of parallel downloading files, downloading mode, connection mode and maximum number of single file downloading threads;

the download queue comprises a source file path, a file name matching template, a file name unique identifier, a renaming rule, a delayed download threshold value and a distribution path, wherein a plurality of download queues can be configured in one download task, one download queue occupies one row, and time variables are used for replacing specific dates in the file path and the file name.

3. The multi-source distributed downloading system based on the FTP protocol as claimed in claim 2, wherein the time variable is a special character string representing a time format, and there are a plurality of time variables, each variable represents a time format, and is used to replace a source file path, a file name matching template, a renaming rule, and specific time in a distribution path in the task configuration file, and after the downloading system is started, the time variable replacing module is responsible for replacing the time variable with the actual time.

4. The multi-source distributed downloading system based on the FTP protocol of claim 1, wherein the downloading task management module comprises a task rename sub-module, a task clone sub-module and a task synchronization sub-module;

the task renaming submodule is used for renaming the configuration file of the downloaded task and modifying other configuration or log files related to the configuration file after renaming, wherein the configuration or log files comprise a crontab plan task, a name of the task in the downloaded log file and a daily download information summary file name;

the task cloning submodule is used for generating a new configuration file according to the existing task configuration file, if A, B servers are isomorphic, the task configuration of downloading files from the server B is quickly generated according to the task configuration of downloading files from the server A, and the task configuration is synchronized to other hosts in the downloading cluster;

and the task synchronization submodule is used for quickly creating a downloading cluster, generating a planned task of another host according to the planned task of the current host, setting different starting delays aiming at different tasks, staggering the starting time of a downloading program, and synchronizing the configuration file of the downloading task which takes effect on the local host to the other downloading host.

5. The multi-source distributed download system based on the FTP protocol of claim 1, wherein the file renaming module is used to support isomorphism and isomerism of the remote multi-source server, and comprises an increase/decrease prefix submodule, an increase/decrease suffix submodule, a case and case conversion submodule, and a unique file renaming submodule;

the prefix increasing and decreasing submodule is used for adding or removing prefixes to the source file name and then generating a local file name;

6. The multi-source distributed downloading system based on the FTP protocol as claimed in claim 1, wherein said file downloading module supports a timed downloading module, an expired file downloading module, a delayed downloading module, a parallel downloading module;

the timed downloading module is used for starting the downloading software to download files at regular time, and the starting time window of each downloading task is set according to the file downloading requirement;

the overdue file downloading module is used for manually executing a downloading command and downloading the overdue file when the downloading task misses the downloading time window;

the delayed downloading module is used for downloading the files from the corresponding servers in sequence according to the delayed downloading threshold value configured by the downloading task;

the parallel downloading module is used for downloading a plurality of files in one queue in parallel, the downloading scheduling module creates a background parallel downloading pool for each downloading queue, and when the downloading pool is full, the downloading scheduling module enters a waiting state.

7. The multi-source distributed download system based on FTP protocol of claim 1, wherein the file distribution module comprises a real-time distribution sub-module, a failed file retransmission sub-module;

the real-time distribution submodule is used for pushing the downloaded file to other servers in real time, the pushing adopts an asynchronous mode, a plurality of servers can push in parallel, the name changing is supported by the pushing, a temporary file suffix is automatically added during the pushing, and a distribution failure log is recorded when the pushing fails;

and the failed file retransmission submodule is used for redistributing failed files according to the distribution failure log and has the functions of ignoring expired files and sending short circuits.

8. The multi-source distributed downloading system based on the FTP protocol as claimed in claim 1, wherein the distributed scheduling subsystem employs decentralized multi-task scheduling technique, each downloading host runs the same scheduling program for ensuring that the downloading scheduling is not affected when a single host fails, and the distributed scheduling is implemented by distributed file locks.

9. The multi-source distributed download system based on FTP protocol of claim 8, wherein the operation of the distributed file lock comprises creation, status detection, destruction and update;

lock state detection, which is used for detecting the state of the lock after an applicant detects the existing lock, and if the lock is an abnormal lock, executing lock updating operation;

10. The multi-source distributed downloading system based on the FTP protocol as claimed in claim 1, wherein the downloading software has a self-healing function, the downloading task is periodically started, the downloading program searches whether a process corresponding to the downloading task exists during each starting, if yes, the downloading program forcibly ends and restarts, otherwise, the downloading task is directly started.