CN115086304B

CN115086304B - Multi-source distributed downloading system based on FTP protocol

Info

Publication number: CN115086304B
Application number: CN202210806656.7A
Authority: CN
Inventors: 陈旭辉; 王遂缠; 张鸿; 刘洋; 高鹏; 徐娟; 孔小怡; 陈晓峰; 许竹霞; 黄芳芳; 付杰; 张春燕; 王旭东
Original assignee: Gansu Meteorological Information And Technology Equipment Support Center
Current assignee: Gansu Meteorological Information And Technology Equipment Support Center
Priority date: 2022-07-08
Filing date: 2022-07-08
Publication date: 2024-04-19
Anticipated expiration: 2042-07-08
Also published as: CN115086304A

Abstract

The invention discloses a multi-source distributed downloading system based on an FTP protocol, which firstly provides an MP2MS (multipoint-to-many server) downloading technology, adopts the MP2MS technology, adopts a decentralised distributed scheduling technology to provide a multi-source distributed downloading technology, realizes that a plurality of downloading clients simultaneously download the same type of files from a plurality of heterogeneous FTP servers, and improves the FTP downloading performance and reliability; secondly, a timing downloading, delay downloading, file name time variable replacement and renaming technology according to rules are provided, so that the compatibility and the automation degree of a downloading system are improved; the distribution technology can push the downloaded files to a plurality of servers in real time, and can meet the service requirement of actively pushing the files to downstream applications; and finally, the multi-source distributed download software is developed by using the flash, and the application in the test environment and the actual production environment shows that the multi-source distributed download has the characteristics of high download speed and high reliability.

Description

Multi-source distributed downloading system based on FTP protocol

Technical Field

The invention belongs to the technical field of distributed downloading, and particularly relates to a multi-source distributed downloading system based on an FTP protocol.

Background

Downloading refers to the act of obtaining data from other computers via a network, and there are three common downloading techniques, P2S, P2P, P SP. The downloading performance of the P2P and the P2SP is high, but the security is poor, the problems of hotlinking, copyright and the like exist, and the method is less in use in the professional data sharing field. And the P2S adopts a C/S mode, and utilizes the FTP or HTTP protocol, wherein the FTP protocol is the earliest used file transfer protocol, the data transmission is reliable, the data transmission has a user authority management function, the data can be encrypted, the security is high, and the data transmission method is widely used in the professional data sharing field. However, the P2S has disadvantages, for example, one client can only connect to one server, and is limited by many factors such as network bandwidth, server speed limit, server connection number limit, server and client computer performance, and the performance improvement range is limited, and meanwhile, only one client and server has single point of failure, and the reliability is low.

In order to improve the downloading performance, CN10574400B, sunlar, etc., flashget, etc. (Ma Xiao, chai Yanna, li Jia. P2SP based network downloading technology analysis [ J ]. Computer technology and development, 2014, (6): 187-191.), gridFTP (should be macro, liu Fuming, yellow river, gridFTP multiple data transmission mode design [ J ]. Computer engineering and design, 2008,29 (15): 3895-3897) adopts a multi-source downloading technology, 1 client can download the same file from multiple servers; CN11384000A, CN112199442a et al implement distributed download, and multiple clients can download the same file from 1 server. The above technology can promote the downloading performance, but the former only supports multiple servers, does not support multiple clients, needs special FTP servers, needs to increase resource index servers, has poor compatibility, the latter only supports multiple clients, does not support multiple servers, and multiple clients can only download the same file at the same time, so that the downloading reliability and performance are still limited.

Disclosure of Invention

Aiming at the problems, the invention provides an MP2MS (multipoint-to-multipoint Server) downloading technology, and provides a Multi-source distributed downloading system based on an FTP protocol based on the MP2MS downloading technology, so that a plurality of clients can simultaneously download files from a plurality of servers in a distributed manner, the downloading performance is greatly improved, single-point faults in the traditional downloading technology are eliminated, and the downloading reliability is improved.

The technical scheme for realizing the invention is as follows:

The multi-source distributed downloading system based on the FTP protocol is characterized by comprising a downloading task management module, a downloading scheduling module, a file renaming module, a file downloading module, a file distributing module and a downloading statistics module;

The task management module is used for downloading configuration management of tasks and synchronization of the tasks in the cluster;

the download scheduling module is used for scheduling a plurality of files on the remote server to different clients through the distributed scheduling subsystem for distributed download;

The file renaming module is used for renaming the source file according to a renaming rule;

the file downloading module is used for calling the downloading software to download the file;

The file distribution module is used for secondarily distributing the downloaded file, is responsible for pushing the downloaded file to other servers or production systems needing the file, and supports pushing to a plurality of servers at the same time;

the downloading statistics module comprises a downloading task statistics sub-module and a time period statistics sub-module, and is used for counting the number and downloading amount of the downloaded files of the downloading client according to the downloading task name and the time period respectively.

Further, the download task is a configuration file for storing download information of the same type of file, and is a download object of the multi-source distributed download system, and each download task comprises server information, download parameters and a download queue;

Server information including server URL aliases and time zones;

download parameters including download command, number of parallel download files, download mode (bin or text), connection mode (active or passive), maximum number of threads downloaded by single file;

the downloading queue comprises a source file path, a file name matching template, a file name unique identifier, a renaming rule, a delayed downloading threshold value and a distributing path, wherein a plurality of downloading queues can be configured in one downloading task, one queue occupies one row, and if a specific date exists in the file path and the file name, the downloading queue is replaced by a time variable.

Further, the time variable is a special character string representing time formats, and 33 variables are used for representing 33 time formats, the time variable is used for replacing specific time in a source file path, a file name matching template, a renaming rule and a distribution path in a task configuration file, and after the downloading system is started, the time variable replacing module is responsible for replacing the specific time with actual time.

Further, the download task management module comprises a task renaming sub-module, a task cloning sub-module and a task synchronization sub-module;

the task renaming sub-module is used for renaming the downloaded task configuration file, and after renaming, modifying other configuration or log files related to the downloaded task configuration file at the same time, including modifying a crontab planning task, modifying the name of the task in the downloaded log file (ensuring the continuity of the downloading statistical result), and modifying the daily downloading information summary file name;

A task cloning sub-module, configured to generate a new configuration file according to an existing task configuration file, if A, B servers are isomorphic (the directories and file names of the same file on different servers are the same), quickly generate a task configuration of downloading the file from server B according to a task configuration of downloading the file from server a, and synchronize to other hosts in the downloading cluster;

The task synchronization sub-module is used for quickly creating a download cluster, firstly generating a planning task of another host according to the planning task of the current host, setting different starting time delays for different tasks, staggering the starting time of a download program, and secondly synchronizing a download task configuration file which takes effect on the host to the other download host.

Further, the remote multi-source server supports isomorphism, which means that the same file has the same directory and file name on a plurality of services, and the isomorphism means that the paths and the file names of the same file on different servers are different, and the isomorphism are realized by the file renaming module, including a prefix adding and removing sub-module, a suffix adding and removing sub-module, a case-case conversion sub-module and a unique file renaming sub-module;

The prefix adding and removing sub-module is used for adding or removing the prefix from the source file name to generate a local file name;

The suffix adding and removing sub-module is used for adding or removing the suffix from the source file name to generate a local file name;

The case-case conversion sub-module is used for performing case-case conversion on English letters in the source file name to generate a local file name;

And the unique file renaming module is used for mapping a plurality of different source file names into 1 local file name according to the unique key words in the file names.

Further, the file downloading module supports timed downloading, expired file downloading, delayed downloading and parallel downloading;

The method comprises the steps of downloading at fixed time, wherein each downloading task is managed by a linux crontab, downloading files of the downloaded software are started at fixed time, a starting time window of each downloading task is set according to file downloading requirements, the time window has two attributes, the first time is hours and the second time is minutes, the first time is used for setting the hours of downloading the files, the second time is minutes, the second time is used for setting the minutes of downloading the files, and the two time formats are in a linux crontab time format;

If the file is not uploaded to the server on time or the downloading software is not started to cause the downloading task to miss a downloading time window, the downloading command can be manually executed, the backward time is set, the downloading software can backward the downloading time to the appointed time, and then the date variable replacement is executed, so that the expired file is downloaded;

the method comprises the steps of carrying out delayed downloading, if a delayed downloading threshold value of a downloading task is larger than 0, calculating the difference between the downloading time and the source file creation time (time after time zone conversion), and if the difference is larger than the delayed downloading threshold value and smaller than 10 days, downloading a file, otherwise, not downloading; if two clients download data from two servers at the same time, the downloading is preferably performed from the server with a small delay downloading threshold, the downloading delay threshold is equal to the priority of the server, the smaller the value is, the higher the priority is, and the value is set in a downloading queue of a downloading task configuration file;

the method comprises the steps of carrying out multi-file parallel downloading, wherein a plurality of files in one queue can be downloaded in parallel, a background parallel downloading pool is created for each downloading queue by a downloading scheduling module, and when the downloading pool is full, a waiting state is entered, otherwise, a downloading file is added into the pool, the size of the downloading pool is configurable, and the default is 2;

single file download supports breakpoint continuous transmission, multithreading, and temporary file suffix addition.

Further, the distribution module comprises a real-time distribution sub-module and a failure file retransmission sub-module;

The real-time distribution sub-module is used for pushing the downloaded files to other servers in real time, wherein the pushing adopts an asynchronous mode, a plurality of servers can be pushed in parallel, the pushing supports renaming, the target files can adopt source file names or renamed file names, temporary file suffixes are automatically added during pushing, and distribution failure logs are recorded when the pushing fails;

And the failed file resending sub-module is used for resending the failed file according to the failed distribution log, the resending has the function of ignoring the expired file, the file with the failure time exceeding 48 hours is not resent, and the resending module has the short circuit sending function, namely when one file resends the failed file, the other files of the data do not execute the sending operation.

Furthermore, the distributed scheduling subsystem adopts a decentralised multi-task scheduling technology, each download host operates the same scheduling program, the download scheduling is not affected by single host faults, and the distributed scheduling is realized through a distributed file lock.

Further, the distributed file lock is realized in a mode of competing for the first line of the file, and the operations of the distributed file lock comprise creation, state detection, destruction and update;

The lock creation operation is carried out in a first line mode of the competitive lock file, when an applicant detects that the lock file does not exist, the information of the node is written into the line 1 of the lock file in an additional mode, the first line content of the lock file is immediately read after the writing operation is finished, and is compared with the information written by the applicant, if the first line content is the same with the information written by the applicant, the loading and the downloading of the lock are successful, and otherwise, the competition of the lock file is exited;

The lock state detection is used for detecting the state of the lock after the applicant detects the existing lock, and if the lock is an abnormal lock, the lock updating operation is executed;

The lock destroying operation is used for automatically deleting the downloading lock of the file after the file downloading is finished; when capturing a signal for terminating a process during execution of a downloaded program, automatically clearing all download locks generated by the process; when the downloading program is started, the existing downloading process is forcedly stopped, and all locks generated by the process are cleared;

And the lock updating operation is used for deleting the orphan lock, recreating the lock, remotely ending the locking process on the locking node, clearing the overtime lock and recreating the lock on the node.

Furthermore, the download software has a self-healing function, the download task is periodically started, the download program searches whether the process corresponding to the download task exists or not when the download program is started each time, if the process exists, the download task is forcedly ended and restarted, otherwise, the download task is directly started, and the mechanism enables the multi-source distributed download system to have the self-healing function, so that faults such as process false death, process dead death and the like can not occur.

Compared with the prior art, the invention has the following beneficial effects:

firstly, the MP2MS downloading technology integrates the multi-source downloading technology and the distributed downloading technology, realizes that a plurality of clients simultaneously download data from a plurality of servers, changes a single downloading client into a downloading cluster, solves the problems of single-point faults and performance bottlenecks existing in the traditional P2S downloading system, can greatly improve the downloading performance and reliability, increases the downloading performance of one C/S downloading channel by more than 30 percent, and increases the downloading reliability of 1 server or client by 1 time;

Secondly, the multi-source downloading system designed by the invention supports a plurality of heterogeneous downloading source servers, is compatible with all FTP servers, does not need to install any software on the servers, and does not need to carry out any configuration; different downloading priorities can be set for different data source servers, and the downloading sequence of the files on the multiple services is controlled through the priorities;

thirdly, the invention provides a file renaming technology based on rules, downloaded files can be renamed according to rules, files with file names dynamically changing along with time can be automatically downloaded, and the automatic operation degree of a downloading system is improved by replacing file name time variable and renaming according to rules, thereby being beneficial to the docking of the downloading system with other service systems;

Fourth, the invention provides file pushing service through the secondary distribution module, and can push the downloaded files to a plurality of FTP servers;

In summary, the invention has better application prospect, in the single-server occasion, the downloading speed can be improved through a plurality of downloading clients, the single-point faults of the clients are eliminated, in the multi-server occasion (with a plurality of images or copies), the downloading service capability can be improved through a plurality of servers, and a new copy server can be quickly established by utilizing a plurality of servers.

Drawings

Fig. 1 (a) - (d) are schematic diagrams of P2S, P MS, MP2S, and MP2MS download modes, respectively;

FIG. 2 is a diagram of a multi-source distributed download system according to the present invention;

FIG. 3 is a topology of a multi-source distributed download system according to the present invention;

FIG. 4 is a file download flow chart of the multi-source distributed download system according to the present invention;

FIG. 5 is a flow chart of the distributed lock file maintenance according to the present invention;

FIG. 6 is a download task configuration example (EC material download task configuration file) according to the present invention;

FIG. 7 is a diagram of a part of timing tasks on the download cluster node01 in actual service;

FIG. 8 is a diagram of a portion of the timing tasks on the download cluster node02 in actual traffic;

Fig. 9 is a diagram showing the number of files in the download cluster of 3 clients in the actual service for downloading jma_gms_12 data.

Detailed Description

In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

1. Various FTP download techniques are background

According to the connection mode between the FTP client and the server, the connection mode between the client and the server is divided into four modes of P2S, P MS (point-to-multi server), MP2S (multipoint-to-server) and MP2MS (multipoint-to-multi server), as shown in fig. 1.

(1) P2S download: as shown in fig. 1 (a), P2S is a standard FTP download mode, and a client can only download a file from a server, but the download speed can be improved by a multithreading technology, but the file is limited by factors such as client, server performance, network bandwidth, server speed limit and the like, the improvement amplitude is limited, and single-point failure exists and the reliability is low.

(2) P2MS download: as shown in fig. 1 (b), the P2MS is a multi-source download, and 1 client can download the same file or the same file type from multiple servers, the download speed is high, the servers have redundancy, but only 1 client has single point of failure.

(3) MP2S download: as shown in fig. 1 (c), MP2S is a distributed download, and multiple clients can download the same or same type of file from 1 server, but there are only 1 server, and there is still a performance bottleneck and a single point of failure. Current FTP software does not support this download mode.

(4) MP2MS download: as shown in (d) of fig. 1, the present invention refers to a technique in which a plurality of clients connect to a plurality of servers simultaneously to download files, as MP2MS. In the downloading of the MP2MS, if the client terminal participating in the downloading has m stations, the server has n stations, n.m independent downloading channels can be theoretically generated, and the theoretical downloading speed is n.m times of that of the traditional downloading mode. The download mode performance can be dynamically expanded according to the requirement, the client and the server have redundancy, the reliability is high, the method is the most ideal FTP download technology, and the download performance and the reliability can be greatly improved in a multi-server environment. There is currently no download software supporting MP2MS.

2. The invention provides a multi-source distributed downloading system

The invention discloses a multi-source distributed downloading system (MSDD for short) based on the MP2MS technology, which consists of six modules of downloading task management, downloading scheduling, file renaming, file downloading, file distributing and downloading statistics, and the specific structure is shown in figures 2-3.

The task management module is used for being responsible for configuration management of the download task and synchronization of the task in the cluster;

The distribution module is used for secondarily distributing the downloaded file, is responsible for pushing the downloaded file to other servers or production systems needing the file, and supports pushing to a plurality of servers at the same time;

And the download statistics module is used for counting the number of download files and the download amount according to the download task type and the time period.

3. The key design of the invention

(1) Download task design

MSDD packages the download information of the same type of file as a download task, takes the download task as a download object of the download system, and starts a download task to download a batch of files. The package of information encapsulated in the download task expands the server information, download parameters, download queue 3 classes, fig. 5 is an example of the download task for EC material.

The server information comprises a server URL alias and a time zone, wherein the server URL alias is short for server URL, is configured in URL_alias.INI, and has the following format: URL alias = protocol:// user: password @ ip:/path, if the protocol is FTP, this means that the server is an FTP server, if it is file, this means that the server is local. The time zone refers to the time zone adopted by the server, and the value is used in the delayed downloading module and is used for calculating the creation time of the file on the server.

The downloading parameters comprise downloading commands, the number of parallel downloading files, a downloading mode (bin or text), a connecting mode (active or passive), and the maximum thread number of single file downloading, wherein the downloading commands support cp, ftp, lftp, wget, curl commands, cp represents that the server is local, and the file is copied by adopting the cp command;

The downloading queue comprises a source file path, a file name matching template, a file name unique identification, a renaming rule, a delayed downloading threshold value and a distributing path, and a plurality of queues can be contained in one downloading task. The file name matching template consists of characters, numbers, wildcards, { }, [ (] and the like supported by a linux ls command and is used for searching a file; the unique file name identifier consists of a field separator and a field number, for example, "_5" indicates that the source file names are separated according to "_", and the content of the 5th field is a unique file representation character string; the renaming rule is characterized in that a renaming method after source file downloading is set, wherein { } represents no renaming, { UPPER } represents that all letters in a source file name are converted into capitalization, { lower } represents that all letters in the source file name are converted into lowercase, +prefix { } represents that a prefix of prefix is added in front of the source file, -prefix { } represents that a prefix of prefix in front of the source file is removed, and { } +suffix represents that a suffix of suffix is added behind the source file, { } -suffix represents that a suffix of suffix behind the source file is removed; the delay download threshold configures a file delay download parameter in the queue, denoted by Tn or n, n being the number of delay minutes, 0 indicating that delay download is not enabled. The distribution path refers to server information to be pushed of the downloaded file, and a host URL alias is used: path specifies space separation between servers, and host URL aliases are configured in url_ali file.

(2) Distributed download scheduling design

The distributed scheduling is responsible for coordinating a plurality of downloading nodes to download the same type of files from a plurality of servers in an equalizing mode, and therefore the files are not repeated and omitted. The distributed task scheduling is generally realized by adopting a message middle mode, the scheduling mode comprises a centralized scheduling mode and a decentralized scheduling mode, the system adopts a decentralized distributed task scheduling technology, the downloading cluster nodes have no primary and secondary division, and each node runs the same scheduling program.

At the heart of distributed scheduling is a distributed lock, which is typically implemented based on message-middleware, databases, caches, files, etc., such as message middleware, redis cache, hash time locks. The system provides a distributed file locking technology based on shared storage, is realized by utilizing flash, does not need third-party software, and is simple to realize. The operations for the lock include lock creation, lock destruction, lock update, etc., and the lock file maintenance flow is shown in fig. 3.

Lock creation: a file lock can be created by adopting a flow method on a single Linux computer, and through testing, the file can not be locked among a plurality of computers using shared storage by using the method, and the created file lock is not unique and has the phenomenon of repeated locking. The file lock is realized by a competitive lock file head line mode, and the method comprises the following steps: when the applicant detects that the lock file does not exist, the information of the node is written into the lock file (only 1 row) in an additional mode, the first row of content of the lock file is immediately read after the writing operation is finished, and compared with the information written by the applicant, if the first row of content is the same with the information written by the applicant, the locking is successful, and otherwise, the competition of the lock file is exited.

Lock state detection: if the applicant detects that the lock exists, the state of the lock is detected, and if the lock is abnormal, the lock updating operation is executed. The abnormal lock is provided with a timeout lock and an orphan lock, and if the creation time of the lock file exceeds the life of the lock, the lock is the timeout lock. If the locking process of a lock does not already exist, the lock is an orphan lock, which is a deadlock due to an exception of the download node.

Lock destruction: the lock is destroyed if: (1) Automatically deleting the download lock of the file after the file is downloaded; (2) When capturing a signal for terminating a process during execution of a downloaded program, automatically clearing all download locks generated by the process; (3) When the downloading program is started, the existing downloading process is forcedly stopped, and all locks generated by the process are cleared.

Lock update: for orphan locks, the lock is deleted directly and then recreated. For timeout locks, the locking process on the locking node is remotely ended with ssh and the lock is cleared, and then the lock is recreated on the node.

(3) In-queue multiple file parallel download

The scheduling system creates a background parallel download pool for each queue, and enters a waiting state when the download pool is full, otherwise, adds a download file to the pool. The download pool size is configurable with a default value of 2.

(4) File name time variable substitution

Time is one of important attributes of meteorological data, and the file names of all the data contain date and time information, so that automatic downloading is affected. In order to realize automatic running of the downloaded software, a specific time in a file name is replaced by a time variable in a downloaded task configuration file, and the specific time is automatically replaced by a proper time when the specific time is downloaded. The supported time variables are composed of 4 basic variables of yyyy (four years), yy (two years), mm (month), dd (day) and hh (hour), and the total number of the basic variables is 33. The variable starts with "#", starts with "$", starts with "%" indicates the day, and starts with "%" indicates the next day, e.g., $ yyyymmddhh, indicating the time of day, year, month, and day.

(5) Rule-based file renaming

According to the requirements of weather data renaming, file renaming rules are designed, and files can be renamed during downloading. The renaming rules comprise four types of case-case conversion, prefix increase and decrease, suffix increase and decrease and unique matching of file names, and a plurality of rules can be combined for use. The first three rules are simpler, only the source file name is simply converted, the unique matching of the file name is a special rule which is used for naming different scenes of the same file on different servers, and a plurality of different file names are mapped into one file through unique keywords in the file names.

Taking the us NCEP weather data as an example (see table 1), the file name of the data is gfs.t00z.pgrb2.1p00 f012 on the NCEP server, the file is renamed to w_nafp_c_ KWBC _20210101000000_p_gfs.t00z.pgrb2.1p00.f012.Bin after being downloaded by the national weather information center, and the description, the time stamp and the suffix of the data are added, so that in order to realize the simultaneous downloading of the data from two data sources of the us and beijing, the file must be uniquely identified according to the core content reserved in the file name, and repeated downloading is avoided. In practical use, the task setting for downloading material from the U.S. server adds the "W_NAFP_C_ KWBC _ $ yyyymmddhh0000_" prefix, adds the renaming rule of the ". Bin" suffix, the task setting for downloading material from the Beijing service "_7" file name unique matching rule, "_7" indicates that the file names are separated by "_7" as separators, the content of the 7 th field is the file name unique identification keyword unique_key, the local file matching template is generated by the file matching template and the unique_key, and the file is retrieved, if existing, the file does not need to be downloaded.

TABLE 1 naming of NCEP materials on different servers

(6) Delayed download

In a multi-server download environment, different download priorities may be set for different servers, with files being downloaded first from a server with a higher priority, and automatically from other servers if the server is not available. The priority is set by a download delay threshold in the download task, and if the download delay is n minutes, only files generated within [ n minutes, 10 days ] are downloaded, with the limitation of increasing 10 days to prevent downloading of expired weather data files.

(7) Self-healing function

MSDD periodically and automatically starting a downloading task by using Linux crontab, and when the downloading program is started each time, if the downloading process of the task exists, forcibly ending and then restarting. The mechanism can prevent process dead on one hand, and can retrieve the files to be downloaded on the server according to the current time on the other hand, and the new data is downloaded preferentially.

Examples

1. Downloading instances

Based on the system provided by the invention, the steps of implementing the distributed file downloading comprise:

Step 1: configuration download task configuration file

The task profile name is: "host alias_data [ _data time ]. INI ], the data time represents the beginning download time of the data, expressed by world time hours, can not match the data with time not fixed or more download times, the" _data name can not appear, the content of the configuration file is as above;

step 2: downloading program starting and reading input parameters

The parameters include: the first parameter is a necessary parameter, the other parameters are optional, and default values are respectively: 0. not mandatory, not tested;

Step 3: starting trap, when detecting a INT KILL TERM HUP signal, deleting the lock created by the process according to the content in a lock queue file (all locks created by the process are stored in the file) of the downloading task under the lck directory, emptying the lock queue file, and then exiting;

Step 4: detecting whether a downloading task is running or not according to three keywords of a downloading program name, a downloading task name and a process number of a current process, and if so, forcedly ending the existing process by using a kill-9;

step 5: generating download time

If the download back time is input, the download time is the current time of the system minus the download back time, if the download back time is not input, the download time is intelligently judged according to the time parameter and the current time in the task configuration file name, if the hour of the current time is less than the time of the data, the download starting time is set to be 23:59 of the previous day, otherwise, the download time is the current system time;

Step 6: reading a download task configuration file, analyzing variables in the download task configuration file, and assigning initial values to the vacancy parameters;

step 7: starting circulation, obtaining a queue, analyzing and then putting the queue into a variable;

Step 8: taking the downloading time obtained in the step 5 as relative current time, respectively replacing time variables in a downloading queue by using the current time and the previous 1 hour to obtain downloading file paths and file name information of 2 times, if the obtained two information are identical, only downloading the data of the current time, otherwise, circularly downloading the data of the previous 1 time and the current time, and avoiding missing the file of the previous hour when the time is over the hour and the date is over the date;

step 9: starting the recirculation, and downloading files at a single time point;

Step 10: using a curl to test whether a download server is available, if the download server is not available, recording a log, and turning to the step 7;

Step 11: retrieving files meeting the conditions on the remote server according to the file matching template, returning file names, file sizes and file time, storing the files under a list directory, and calculating the number of files meeting the conditions;

step 12: sequentially executing case-to-case conversion, prefix increase and decrease, suffix increase and decrease and unique identification renaming rules on the file matching template to generate a local file retrieval template;

step 13: acquiring the number of local files meeting the condition;

Step 14: writing the remote file catalogue, the file name template, the remote file number and the local file number into a daily download information summary file under the last catalogue;

Step 15: automatically judging whether the data is completely downloaded according to the number of files in the daily download information summary file of the last 5 days, if so, recording a log, and turning to the step 7;

step 16: writing the number of the remote files and the local files retrieved at the time into a log file;

Step 17: starting a new cycle, and processing returned files one by one;

Step 18: performing case-to-case conversion, prefix addition and subtraction, suffix addition and subtraction, and unique identification renaming rules according to the remote file name to generate a local file name and a download lock file name;

step 19: if the local file exists and the size is the same as the size of the remote file, downloading is not needed, and the step 17 is returned;

Step 20: if the test parameter is input, outputting the analyzed configuration file, remote end and local file information, generating a test log file, and turning to step 17; the function is used for testing whether the task configuration file is correct or not and whether the server is normal or not, and only testing is carried out without downloading;

step 21: if the download lock file corresponding to the download file does not exist, the step 24 is transferred;

step 22: judging whether the lock is overtime or not through the lock file creation time, if not, not downloading, and returning to the step 17;

Step 23: delete timeout lock

Reading the file content of the lock, logging the download host by using ssh according to the download host IP, the download process and the download task name, detecting whether the corresponding download process exists, and if so, indicating that the lock is a timeout lock, and directly kills the process; otherwise, the lock is described as an orphan lock, and the lock file is deleted directly;

step 24: lock for applying for downloading

Writing a 'locking time downloading host IP downloading process downloading task name' into the lock file in an additional mode, if the writing fails, turning to the step 17, otherwise, reading the first row of the lock file, comparing with the written content, and if the first row of the lock file is different, turning to the step 17;

Step 25: adding the lock file name to a lock queue file under the lck directory;

Step 26: if the forced download parameter is inputted, the process proceeds to step 28;

step 27: performing time zone conversion on the source file time, calculating a difference value between the downloading time and the source file time, and if the value does not meet the downloading delay condition, turning to step 17;

step 28: if the parallel download number of the download queue reaches the upper limit, circularly executing sleep 1 to wait for 1 second;

step 29: according to the downloading command, different downloading functions are called in a background mode, and downloading operation is executed;

Step 30: the downloading function is executed in an asynchronous mode, a tmp suffix is automatically added, after the downloading is finished, a lock used by the function is deleted, then a lock file is deleted, a log is recorded in related information, and if the downloading is successful, a distribution program is called by a background to carry out distribution; the function starts the trap function, and when the function exits abnormally, the download lock is cleared;

step 31: file distribution

File distribution is a single program, called by a main program, and executed asynchronously in the background;

The file distribution program comprises 6 parameters including a task name, server information, a source file name, a local file name, a file size and a reissue mark, wherein the file size is an optional parameter, the program automatically detects whether lftp or ftp software exists on the local machine, preferentially uses lftp to send the file, automatically adds tmp suffix when sending, and confirms whether the file is successfully sent by comparing the file size; when the reissue mark is false, the first distribution is indicated, and if the transmission fails, the transmission information is recorded into a transmission failure log;

Step 32: file hair-supplementing

The file reissue program is an independent program, is started by a planned task at regular time, is executed every 10 minutes, reads in failure file information line by line from a transmission failure log file after the program is started, detects whether the first transmission failure time of the file exceeds 48 hours (only the failure file within 48 hours is transmitted), deletes the file from the failure file if the first transmission failure time exceeds 48 hours, otherwise calls a distribution program to reissue in a reissue mode, removes the file from the transmission failure log file if reissue is successful, and sets a reissue failure mark of the type of data as true if the reissue is failed;

step 33: detection of complementary short circuit

When the file reissue is carried out, reissue short circuit detection is carried out, whether a reissue failure mark of the type of data is true or not is detected, if true, the file is ignored (if one file reissue failure exists in the same type of data, the server or the catalog is indicated to have problems, and the rest other similar files do not call a distribution program any more);

Step 34: if there are more source files unprocessed, go to step 17;

Step 35: if the current queue has an unprocessed download time, step 9;

step 36: if the unprocessed downloading queue exists, the step 7 is carried out;

step 37: the downloading is finished, and the flow chart of the downloading program is shown in fig. 4.

Step 38: the task clone is a stand-alone program that quickly generates a new task profile with reference to an existing task profile, the first parameter being the existing download task name and the second parameter being the new download host URL alias. Firstly, a program reads in an existing configuration file, changes a 'host=old_host' item into a 'host=new_host', and generates a new download task configuration file according to a naming specification of the step 1;

Step 39: the task synchronization is an independent program for quickly creating a download cluster, firstly, the timed download task loaded in conrtab on the current download host is read (see fig. 7), a sleep n command is added in front of each download command, for the task n taking the minute as the period/cluster number +m, for the other types of tasks n as 60+m, the random number between m and 0-30 (the purpose is to stagger the time windows of a plurality of clients accessing the server as much as possible, and stagger the time when the clients start the download program simultaneously, on one hand, the aim is to avoid that one client starts a large number of download programs simultaneously, and on the other hand, the clients access one server simultaneously), and then the crontab timed task list (see fig. 8) on the target host is generated by combining with the original scheduled task (stored in crontab1. Dat) on the target host. Second, the enabled download task profile on the current host is synchronized to the target host. Finally, if there is no target host IP in the cluster node configuration file cluster.ini, then IP will be added to the cluster.ini file.

Step 40: the cluster downloading statistics is an independent program, is used for counting the downloading condition of the downloading clusters, and can input a time parameter. The program firstly reads the cluster file, analyzes the host computer IP of the downloading cluster, executes ssh in parallel, executes a single-node downloading statistical command on each host computer to obtain the downloading condition of a single downloading host computer, then combines the statistical results of each node to obtain the downloading condition of the whole downloading cluster, and fig. 9 shows the number of the downloading files of two kinds of data on 3 downloading host computers in a certain period.

Step 41: the single-node downloading statistical program is an independent program, firstly, the downloading log of the host is read, the data type, the downloading time and the file size are analyzed, and classified summarization is carried out according to the time period and the data type, so that the number of file downloading and the downloading capacity under different summarization modes are obtained.

2. System testing

To further illustrate the effect of the present system, the following tests were performed.

1. Download Performance test

In the VMware virtualization environment, a download test environment is formed by adopting 2 Linux FTP servers (installed VSftpd without speed limit) and 2 Linux download computers, and the shared storage adopts NAS, and the structure diagram is shown as (d) in fig. 1. And copying 100 files of about 100M to two servers, establishing 2 download tasks on each download computer, and designing 7 download schemes in total. The schemes and results are shown in Table 2. The acceleration performance in the table is calculated by adopting a formula ((scheme 1 time-scheme time) ×100/scheme 1 time), and the larger the numerical value is, the smaller the time is, and the acceleration effect is obvious.

(1) Single task download effect

Schemes 2 and 3 in table 2 are both single task downloads, where scheme 2 is a single file serial download and scheme 3 is a 2 file parallel download. As can be seen from the results, scheme 2 uses 31% more than traditional download (scheme 1), indicating that distributed scheduling increases download overhead; scheme 3 is basically the same as scheme 1 in time, which means that multi-file parallel downloading can offset the overhead generated by distributed scheduling, and 2 the performance of file parallel downloading is the same as that of conventional ftp serial performance.

(2) Distributed download effect

Schemes 4-7 in table 2 are multi-tasking downloads, where 5-7 are distributed downloads, and from the number of downloaded files per task, it can be found that distributed scheduling can distribute files to different download clients according to the principle of being labor-intensive. In terms of downloading performance, the downloading speed can be increased by adding a client or a server, the performance is increased by about 30% when the number of the clients is increased from 1 to 2 tasks, and the performance is increased by about 50% when the number of the clients is increased from 2 to 4 tasks. The overall performance is obviously improved.

(3) Reliability test

In the downloading process, a server or a client is closed, and the downloading service is not interrupted, so that the integrity of the file is not affected.

Table 2 experimental data for different download schemes

The MSDD provided by the invention is applied to a large meteorological data platform in Gansu province and is used for collecting various meteorological data. The operating environment is: four data downloading sources are Beijing, tianshui, wu Wei and American NCEP servers respectively, and the downloading client is two Linux virtual servers to form a downloading cluster, and the downloading cluster is downloaded from 4 servers simultaneously. 101 download tasks are defined for class 19 data according to the data collection requirements. Compared with the old downloading system (mainly adopting wget software and CMAcast supplementing system), the method has the following characteristics:

(1) The degree of automation is high. The system realizes the automation of data collection services such as missing file detection, downloading, renaming according to rules, secondary distribution and the like for 19 types of data, and does not need manual intervention.

(2) The reliability is high. In the conventional downloading, there are every 2 fault points of the server and the client, in MSDD, one client or server is added, 1 fault point is reduced, reliability is doubled, and the fault of one downloading client or one FTP server does not affect downloading.

(3) The downloading efficiency is high. The old system takes about 2 hours to download 66 EC files of about 110M in size from beijing. When the multi-source distributed downloading is adopted, the downloading can be carried out simultaneously from Beijing, zhangye and Tianshui, and only about 20 minutes is needed.

From the above, the MSDD system provided by the invention has better application prospect, can improve the downloading speed through a plurality of downloading clients in a single server occasion, eliminates the single-point faults of the clients, and can realize the redundancy of the clients and the servers while improving the downloading speed in a multi-server occasion.

What is not described in detail in this specification is prior art known to those skilled in the art. Although the present invention has been described with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements and changes may be made without departing from the spirit and principles of the present invention.

Claims

1. The multi-source distributed downloading system based on the FTP protocol is characterized by comprising a downloading task management module, a downloading scheduling module, a file renaming module, a file downloading module, a file distributing module and a downloading statistics module;

The download scheduling module is used for scheduling a plurality of files on the remote multi-source server to different clients through the distributed scheduling subsystem for distributed download;

The file distribution module is used for secondary distribution of the downloaded file, is used for pushing the downloaded file to other servers needing the downloaded file, and supports simultaneous pushing to a plurality of servers;

the downloading statistics module comprises a downloading task statistics sub-module and a time period statistics sub-module, which are respectively used for counting the number and the downloading amount of the downloaded files of the downloading client according to the name and the time period of the downloading task;

The downloading task is a configuration file for storing the downloading information of the same type of file, is a downloading object of the multi-source distributed downloading system, and comprises server information, downloading parameters and a downloading queue;

Server information including server URL aliases and time zones;

Download parameters including download command, number of parallel download files, download mode, connection mode, maximum number of threads downloaded by single file;

The downloading queue comprises a source file path, a file name matching template, a file name unique identifier, a renaming rule, a delayed downloading threshold value and a distributing path, wherein a plurality of downloading queues can be configured in one downloading task, one downloading queue occupies one row, and if a specific date exists in the file path and the file name, the downloading queue is replaced by a time variable;

The distributed scheduling subsystem adopts a decentralised multi-task scheduling technology, each download host operates the same scheduling program, the download scheduling is not influenced when a single host fails, and the distributed scheduling is realized through a distributed file lock;

the operation of the distributed file lock comprises creation, state detection, destruction and updating;

2. A multi-source distributed download system based on FTP protocol as in claim 1, wherein the time variable is a special string representing time format and is provided in plurality, each variable representing a time format for replacing a specific time in the source file path, file name matching template, renaming rule, distribution path in the task configuration file, and the download system is responsible for replacing it with an actual time by the time variable replacement module after being started.

3. A multi-source distributed download system based on the FTP protocol as in claim 1, wherein the download task management module comprises a task renaming sub-module, a task cloning sub-module, and a task synchronization sub-module;

The task renaming sub-module is used for renaming the downloaded task configuration file, and after renaming, modifying other configuration or log files related to the downloaded task configuration file at the same time, including modifying a crontab planning task, modifying the name of the task in the downloaded log file and modifying the name of a daily download information summary file;

a task cloning sub-module, configured to generate a new configuration file according to an existing task configuration file, if A, B servers are isomorphic, quickly generate a task configuration of downloading files from server B according to a task configuration of downloading files from server a, and synchronize to other hosts in the download cluster;

The task synchronization sub-module is used for quickly creating a download cluster, generating a planning task of another host according to the planning task of the current host, setting different starting time delays for different tasks, staggering the starting time of a download program, and synchronizing a download task configuration file which takes effect on the host to the other download host.

4. The FTP protocol-based multi-source distributed download system according to claim 1, wherein the file renaming module is configured to support isomorphism and isomerism of a remote multi-source server, and comprises a prefix adding/removing sub-module, a suffix adding/removing sub-module, a case-to-case conversion sub-module, and a unique file renaming sub-module;

5. The FTP protocol based multi-source distributed download system according to claim 1, wherein the file download module supports a timed download module, an expired file download module, a delayed download module, a parallel download module;

The timed downloading module is used for starting the downloaded software to download files at fixed time, and the starting time window of each downloading task is set according to the downloading requirement of the files;

the expiration file downloading module is used for manually executing a downloading command to download the expiration file when the downloading task misses the downloading time window;

The delay downloading module is used for downloading files from the corresponding servers in sequence according to the delay downloading threshold configured by the downloading task;

And the parallel downloading module is used for downloading a plurality of files in one queue in parallel, and the downloading scheduling module creates a background parallel downloading pool for each downloading queue and enters a waiting state when the downloading pool is full.

6. The FTP protocol-based multi-source distributed download system according to claim 1, wherein the file distribution module comprises a real-time distribution sub-module and a failed file retransmission sub-module;

The real-time distribution sub-module is used for pushing the downloaded files to other servers in real time, wherein the pushing adopts an asynchronous mode, a plurality of servers can be pushed in parallel, pushing support renames, temporary file suffixes are automatically added during pushing, and a distribution failure log is recorded when the pushing fails;

And the failure file resending sub-module is used for resending the failed file according to the sending failure log and has the functions of ignoring the expired file and sending short circuit.

7. The FTP-protocol-based multi-source distributed download system according to claim 1, wherein the download software has a self-healing function, the download task is periodically started, the download program searches whether the process corresponding to the download task exists or not at each start, if so, the restart is forcedly ended, otherwise, the download task is directly started.