CN115242782A

CN115242782A - Large file fragment transmission method and transmission architecture between super-computing centers

Info

Publication number: CN115242782A
Application number: CN202211148476.0A
Authority: CN
Inventors: 余冬冬; 俞圣亮; 方启明; 秦亦; 孔丽娟
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-10-25
Anticipated expiration: 2042-09-21
Also published as: CN115242782B

Abstract

The invention relates to the technical field of resource management of a supercomputer, and discloses a large file fragment transmission method and a transmission architecture among supercomputer centers, wherein the method comprises the following steps: collecting file data of a package transmitted among super-computation centers, and initializing the super-computation center serving as a sender; step two, after the initialization work is finished, the super-computation center of the sender obtains an initial state, the fragment size of the file data is dynamically adjusted by using a reinforcement learning algorithm, and then the file data is fragmented according to the fragment size and transmitted to the super-computation center of the receiver; step three, the super-computation center of the receiver sends transmission feedback to the super-computation center of the sender according to the receiving state; and step four, updating and judging the size of the residual file data so as to judge whether the file data is completely transmitted, and if not, repeating the step two to the step four until the whole file data is completely transmitted. The invention can effectively reduce the waste of system resources and improve the overall system efficiency.

Description

Large file fragment transmission method and transmission architecture between super-computing centers

Technical Field

The invention relates to the technical field of resource management of supercomputers, in particular to a large file fragment transmission method and a transmission framework between supercomputer centers.

Background

Files needing to be transmitted among the super-computation centers are generally large files, and the large files need to be added with a function of fragment transmission in order to prevent integral retransmission caused by transmission interruption in the transmission process. If the large file fragment transmission mechanism can be well optimized, the overall performance of data communication between super-computation centers is undoubtedly improved.

At present, a common file fragmentation strategy is a fixed-size fragmentation, a sender and a receiver define the size of the file fragmentation in advance, then the sender fragments the file and transmits the fragments one by one, and if the fragmentation transmission fails due to network instability in the fragmentation transmission process, the sender retransmits the fragments. There is a significant problem with fixed fragmentation strategies: the size of the slice cannot be dynamically adjusted according to actual requirements. When the fragmentation is too large, once the transmission of the fragmentation fails, the subsequent retransmission of the fragmentation still possibly fails, and the cost of retransmitting the fragmentation becomes very high, thereby causing the waste of system network resources.

Therefore, a large file fragmentation strategy is needed to solve the problems in the above technical solutions.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a large file fragment transmission method and a transmission architecture between supercomputing centers, which are used for improving the transmission efficiency of large files between supercomputing centers and reducing the consumption of network bandwidth of the supercomputing centers, and the specific technical scheme is as follows:

a large file fragment transmission method between super-computing centers comprises the following steps:

collecting file data of a package transmitted among super-computation centers, counting and computing average package sending rate, and initializing the super-computation center serving as a sender;

step two, after the initialization work is finished, the super-computation center of the sender obtains an initial state, the fragment size of the file data is dynamically adjusted by using a reinforcement learning algorithm, and then the file data is fragmented according to the fragment size and transmitted to the super-computation center serving as a receiver;

step three, the super-computation center of the receiver sends transmission feedback to the super-computation center of the sender according to the receiving state of the super-computation center, and calculates the average packet sending rate among the super-computation centers according to the transmission result;

and step four, updating and judging the size of the residual file data so as to judge whether the file data is completely transmitted, and if not, repeating the step two to the step four until the whole file data is completely transmitted.

Further, the specific content of the initialization work is as follows: initializing a set of states { S }, a set of shards { C }, a desired reward Q (S, C), a reward mechanism R, a discounting factor γ, a model learning rate α, and a sampling threshold

；

The states of the set of states { S } include: network running state, data residual size, sender resource load and receiver resource load;

the shard set { C } is a set of shard sizes which can be adopted, and supports various shard strategies;

the expected reward Q (S, C) refers to the expected reward after the fragmentation is carried out according to the fragmentation size C under each state S;

the reward mechanism R is set as follows: let r be the instant reward feedback, if the slicing strategy is adopted in the current state, the average packet sending rate is calculated to be

Then r =

-

，

The average packet sending rate obtained by statistical calculation in the step one is obtained;

the discount factor gamma is used for weakening the reward feedback of the future state to the current state, namely if the time interval between the future state and the current state is larger, the reward feedback is smaller under the condition that the reward value is the same;

the model learning rate α is initially set to: alpha is more than 0 and less than 1;

the sampling threshold value

With greedy policy thresholds, the initialization settings are: 0 < ϵ < 1.

Further, the plurality of slicing strategies include:

and (3) arithmetic permutation: the user can specify the maximum fragment size, the minimum fragment size and the interval size between the fragments, and the system automatically generates a fragment set which is between the maximum fragment size and the minimum fragment size and is arranged from small to large according to the interval size between the fragments;

and (3) geometric proportion arrangement: the user can specify the maximum fragment size, the minimum fragment size and the growth proportion among the fragments, and the system automatically generates a fragment set which is between the maximum fragment size and the minimum fragment size and is arranged from small to large according to the growth proportion among the fragments;

user self-defining: the user specifies the maximum fragment size, the minimum fragment size and a user-defined function, and the system generates an integer value between the maximum fragment size and the minimum fragment size as a fragment set according to the user-defined function;

user-defined fragment size: i.e. to enable the user to manually enter the respective slice sizes.

Further, the second step specifically includes the following sub-steps:

step 2.1, after finishing the initialization work, the super computing center as the sender starts to acquire the current state

Get the file size T _t The senderResource load, network running state, receiver resource load;

step 2.1, adopting a Q-learning reinforcement learning algorithm to dynamically adjust the fragment size of the file data, specifically: randomly generating a number x in a range of [0,1] by a system model of a super-computation center of a sender;

if x>

Then according to

Acquiring the size of the fragment capable of generating the optimal expected reward as the size of the next fragment; if a plurality of optimal fragment sizes exist, randomly selecting one from the optimal fragment sizes as the size of the next fragmentc _t Wherein

Is shown in a state

Lower adopting the size of the slicec _t The size of the desired reward to be later received,

for states ofs _t By size of the slicecThen, the state is switched tos _t+1 The obtained states _t+1 The maximum desired reward of;

if x is less than or equal to

Then one slice size needs to be randomly selected from the slice set { C } as the size of the next slice

；

Step 2.3, the super-computation center of the sender according to the size of the fragmentsc _t Transmitting the file data to the interface after fragmentationA super-calculation center of the receiver.

Further, the third step is specifically: if the super-computation center of the receiver successfully receives the data, the super-computation center of the receiver feeds back the successful receiving to the super-computation center of the sender; if the super-computation center of the receiver checks the data error, the super-computation center of the receiver feeds back the receiving failure to the super-computation center of the sender; if the super-computation center of the sender does not receive the feedback for a long time, namely a timeout error is generated, receiving failure processing is fed back according to the super-computation center of the receiver; wherein, when the transmission fails, the current average packet sending rate is

=0; when the transmission is successful, recording the initial time t of transmission _n1 And a transmission end time t _n2 Calculating an average packet transmission rate of

=(t _n2 -t _n1 )/c _t 。

Further, the fourth step specifically includes:

if the super computing center of the receiver successfully receives the data, namely the data is successfully sent, T _t+1 =T _t -c _t (ii) a Otherwise, if the transmission fails, T _t+1 =T _t ,，T _i Representing the size of the file data remaining after i times of transmission;

judging whether the file data is transmitted completely:

if T _t+1 <If the value is =0, the file data is transmitted completely, and the package sending process is ended; if T _t+1 >0, if it is not transmission completed, then compare the current average packet sending rate

And average packet transmission rate

Size, calculating the instant feedback award r _t+1 =

-

Updating the desired reward according to the reward mechanism:

wherein the left side of the equationq(s _t ,c _t ) To represent the states _t Using the size of the fragment c _t Later desired reward, and update acquisition of current statuss _t+1 Including the remaining file size T _t+1 And repeating the second step to the fourth step until the whole file is successfully transmitted.

A transmission architecture of a large file fragmentation transmission method between super-computation centers comprises the super-computation center of a sender and the super-computation center of a receiver, wherein the super-computation center of the sender comprises a learner, a fragmenter and a transmitter, and the super-computation center of the receiver comprises a receiver and a feedback device;

the system comprises a slicer, a transmitter, a receiver, a learner, a feedback value acquisition module, a feedback value feedback module and a feedback module, wherein the slicer is used for acquiring the size of a file, cutting file data and outputting the sliced data to the transmitter; the learner is used for updating the state parameters according to the feedback provided by the transmitter, calculating the size of the next fragment and transmitting the value to the fragment device; the receiver is used for receiving the fragment data of the sender, verifying the data and sending a verification result to the feedback device; the feedback device receives the feedback state of the receiver and then sends the value to the sender of the sender.

A device for transmitting large files among supercomputing centers in a segmented mode comprises one or more processors and is used for achieving the method for transmitting the large files among the supercomputing centers in the segmented mode.

A computer readable storage medium, on which a program is stored, which when executed by a processor, implements the method for fragmented transmission of large files between supercomputing centers.

The invention has the advantages and beneficial effects that:

compared with the traditional fixed fragmentation mode, the method has the sensing capability on the external environment, can adapt to the change of the external environment better, accurately regulates and controls the size of fragments, and effectively reduces the waste of system resources; when the network is unstable and continuous repeated retransmission fails, the method can reduce the size of the fragments, quickly locate the size of the fragments which can be successfully transmitted, and avoid resource waste caused by repeated retransmission; when the network condition is stable, the method can increase the size of the fragments properly, quickly position the maximum size of the fragments which can ensure the successful transmission, and improve the overall system efficiency.

Drawings

Fig. 1 is a schematic block diagram of a large file fragmentation transmission architecture between supercomputing centers according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a file fragment transmission flow of an overall module architecture to which the method of the embodiment of the present invention is applied;

FIG. 3 is a schematic flow chart of a method for transmitting large files in a fragmented manner between supercomputing centers according to the present invention;

fig. 4 is a schematic structural diagram of a large file fragmentation transmission device between supercomputing centers according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

As shown in fig. 1, the architecture for transmitting large file fragments between supercomputing centers of the present invention mainly considers 2 nodes of a sender and a receiver, wherein the supercomputing center as the sender includes a learner, a fragmenter, and a transmitter, and the supercomputing center as the receiver includes a receiver and a feedback device.

The slicer is used for acquiring file size and file data cutting and outputting the sliced data to the sender, the sender is used for receiving the sliced data output by the slicer and sending the data to the receiver of the receiver, and the sender acquires data sending feedback and transmits the feedback value to the learner; the learner is used for updating the state parameters according to the feedback provided by the transmitter, calculating the size of the next fragment and transmitting the value to the fragment device; the receiver is used for receiving the fragment data of the sender, verifying the data and sending a verification result to the feedback device; the feedback device receives the feedback state of the receiver and then sends the value to the sender of the sender.

As shown in fig. 2 and fig. 3, a method for transmitting large files in a fragmented manner between supercomputing centers according to an embodiment of the present invention includes the following steps:

step one, collecting file data of a package transmitted among super-computation centers, counting and computing average package sending rate, and initializing the super-computation center serving as a sending party.

The method for collecting data of the super-computation center to transmit the packets and calculating the average packet sending rate specifically comprises the following steps: collecting a large amount of data of packets transmitted between the super-computation center A and the super-computation center B by taking the super-computation center A as a sender and the super-computation center B as a receiver, and counting and calculating the average packet sending rate

。

The initialization work is carried out on the super computing center as the sender, wherein the specific content of the initialization work is as follows: initialization state set S, shard set C, expected reward Q (S, C) (by default each action of the respective state is expected to be 0), reward mechanism R, discount factor γ, model learning rate α, and sampling threshold

。

The states S of the set of states { S } include: network running state, data residual size, sender resource load and receiver resource load.

The shard set { C } is a set of shard sizes that can be employed, and supports a plurality of shard policies, including:

and (3) arithmetic permutation: the user can specify the maximum fragment size, the minimum fragment size and the size of the interval between the fragments, and the system automatically generates a fragment set which is between the maximum fragment size and the minimum fragment size and is arranged from small to large according to the size of the interval between the fragments;

proportionally arranging: the user can specify the maximum fragment size, the minimum fragment size and the growth proportion among the fragments, and the system automatically generates a fragment set which is between the maximum fragment size and the minimum fragment size and is arranged from small to large according to the growth proportion among the fragments;

The expected reward Q (S, C) refers to an expected reward after slicing according to the slicing size C in each state S.

The reward mechanism R is set as follows: let r be the immediate reward feedback, if the current state s _t Lower adoption fragmentation strategy c _t Then, the average packet sending rate is calculated to be

Then r =

-

(ii) a The reward mechanism R can guide the super-computation center of the sender to obtain an optimal slicing scheme according to the task completion state fed back by the super-computation center of the receiver.

The discount factor γ is used to weaken the reward feedback of the future state to the current state, i.e. if the time interval between the future state and the current state is larger, the reward feedback should be smaller under the same reward value.

The model learning rate α is initially set to: alpha is more than 0 and less than 1.

The sampling threshold value

With greedy policy thresholds, the initialization settings are: 0 < ϵ < 1.

Step two, after the initialization work is completed, the super computing center of the sender obtains an initial state, the fragment size of the file data is dynamically adjusted by utilizing a Q-learning reinforcement learning algorithm, then the file data is fragmented according to the fragment size and then transmitted to the super computing center serving as a receiver, and the method specifically comprises the following substeps:

step 2.1, after the initialization is finished, the super-computation center as the sender starts to acquire the current state s _t Get the file size T _t The resource load of the sender, the network running state and the resource load of the receiver.

if x>

Then according to

Acquiring the size of the fragment capable of generating the optimal expected reward as the size of the next fragment; if a plurality of optimal fragment sizes exist, randomly selecting one from the optimal fragment sizes as the size of the next fragmentc _t Wherein Q: (s _t , c _t ) Is shown in a states _t Lower by slice size c _t The size of the desired reward to be later received,

indicating for the states _t After the slice size c is adopted, the state is switched tos _t+1 Obtained byState of states _t+1 The maximum desired reward of;

if x is less than or equal to

Then one slice size needs to be randomly selected from the slice set C as the size of the next slice.

Step 2.3, the super-computation center of the sender fragments the file data with corresponding size according to the fragment size and transmits the file data to the super-computation center of the receiver;

specifically, the supercomputing center of the sender cuts the current file once according to the size of the fragment to form data fragments with corresponding sizes, and transmits the data fragments to the supercomputing center of the receiver.

And step three, the super-computation center of the receiver sends transmission feedback to the super-computation center of the sender according to the receiving state of the super-computation center, and calculates the average packet sending rate among the super-computation centers according to the transmission result.

Specifically, obtaining time feedback: if the super-computation center of the receiver successfully receives the data, the super-computation center of the receiver feeds back the successful reception to the super-computation center of the sender through a feedback device; if the receiver checks the data error, the super-computation center of the receiver feeds back the receiving failure to the super-computation center of the sender; if the super-computation center of the sender does not receive the feedback for a long time, namely a timeout error is generated, receiving failure processing is fed back according to the super-computation center of the receiver;

if the transmission fails, the current average packet sending rate is

=0, which means that the receiving side cannot receive the fragmented data;

if the transmission is successful, recording the initial time t of transmission _n1 And a transmission end time t _n2 Calculating an average packet transmission rate of

=(t _n2 -t _n1 )/c _t ，c _t The size of the fragment for the t-th transmission; such as sending 10k data is used for 2s, then the average packet transmission rate is

The unit of the method is that (= 10k/2s (= 5 k/s), and the larger the average packet sending rate is, the faster the data transmission is.

Specifically, the size of the remaining file is updated and calculated as follows: if the super computing center of the receiver successfully receives the data, namely the data is successfully sent, T _t+1 =T _t -c _t (ii) a Otherwise, if the transmission fails, T _t+1 =T _t ,，T _i Indicating the size of the file data remaining after i transfers.

And then judging whether the file data is transmitted completely:

if T _t+1 <If the value is =0, the file data is transmitted completely, and the package sending process is ended;

if T _t+1 >0, indicating that the file data is not transmitted completely, further calculating the size of a new fragment and transmitting the new fragment, and comparing the current average packet transmission rate

And average packet transmission rate

Size, calculating the instant feedback award r _t+1 =

-

Updating the desired reward according to the reward mechanism:

q(s) to the left of the equation _t ,c _t ) To represent the states _t Using the size of the fragment c _t Later desired reward, and update acquisition of current state s _t+1 Including the remaining file size T _t+1 The network running state, the resource load of the sender and the resource load of the receiver are repeated until the whole file is successfully transmitted.

According to the invention, the Q-learning reinforcement learning algorithm is utilized to determine the size of the current file fragment according to the network average time delay of the last fragment transmission, then the reward value is generated according to the network average time delay of the current transmission, the model is dynamically adjusted, and then the size of the next fragment is more accurately and reasonably adjusted, so that the network transmission efficiency of the large file is improved, the fragment sending success rate is improved, and the network bandwidth utilization rate is further improved.

Corresponding to the embodiment of the method for transmitting the large file fragments among the supercomputing centers, the invention also provides an embodiment of a device for transmitting the large file fragments among the supercomputing centers.

Referring to fig. 4, an embodiment of the present invention provides a large-file fragment transmission apparatus between supercomputing centers, including one or more processors, configured to implement a large-file fragment transmission method between supercomputing centers in the foregoing embodiment.

The embodiment of the large file fragmentation transmission device between super computing centers can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 4, the present invention is a hardware structure diagram of an arbitrary device with data processing capability where a large file fragmentation transmission device between super computing centers is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, in an embodiment, the arbitrary device with data processing capability where the device is located may also include other hardware according to the actual function of the arbitrary device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the invention also provides a computer readable storage medium, which stores a program, and when the program is executed by a processor, the method for transmitting the large file segments among the supercomputing centers is realized.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims

1. A method for transmitting large files in a slicing mode among super computing centers is characterized by comprising the following steps:

collecting file data of a package transmitted among supercomputing centers, counting and calculating an average package sending rate, and initializing the supercomputing centers serving as sending sides;

step two, after the initialization work is finished, the super-computation center of the sender obtains an initial state, the fragment size of the file data is dynamically adjusted by using a reinforcement learning algorithm, and then the file data is fragmented according to the fragment size and transmitted to the super-computation center of the receiver;

2. The method for transmitting the large file fragments among the supercomputing centers as recited in claim 1, wherein the specific contents of the initialization work are as follows: initialization state set S, shard set C, expected reward Q (S, C), reward mechanism R, discount factor gamma, model learning rate alpha, and sampling threshold

；

the expected reward Q (S, C) refers to the expected reward after fragmentation is carried out according to the fragmentation size C under each state S;

Then r =

-

，

Average packet sending rate obtained by statistical calculation;

the sampling threshold value

With greedy policy thresholds, the initialization settings are: 0 < ϵ < 1.

3. The method for transmitting the large file fragments among the supercomputing centers as claimed in claim 2, wherein the plurality of fragmentation strategies include:

4. The method for transmitting the large file fragments among the supercomputing centers as recited in claim 2, wherein the second step specifically comprises the following substeps:

Get the file size T _t A sender resource load, a network running state and a receiver resource load;

if x>

Then according to

Is shown in a state

Lower by slice sizec _t The size of the desired reward to be later received,

if x is less than or equal to

；

Step 2.3, the super-computation center of the sender according to the size of the fragmentsc _t And transmitting the file data to a super-computation center of a receiver after fragmentation.

5. The method for transmitting the large file fragments among the supercomputing centers as recited in claim 4, wherein the third step is specifically: if the super-computation center of the receiver successfully receives the data, the super-computation center of the receiver feeds back the successful receiving to the super-computation center of the sender; if the super-computation center of the receiver checks the data error, the super-computation center of the receiver feeds back the receiving failure to the super-computation center of the sender; if the super-computation center of the sender does not receive the feedback for a long time, that isIf timeout errors occur, receiving failure processing is fed back according to a super-computation center of a receiver; wherein, when the transmission fails, the current average packet sending rate is

=0, which means that the receiving party cannot receive the fragmented data; when the transmission is successful, recording the initial time t of transmission _n1 And a transmission end time t _n2 Calculating an average packet transmission rate of

= (t _n2 -t _n1 )/c _t ，c _t The size of the slice is sent for the t-th time.

6. The method for transmitting the large file fragments among the supercomputing centers as recited in claim 5, wherein the fourth step is specifically:

judging whether the file data is transmitted completely:

And average packet transmission rate

Size, calculate instant feedback award r _t+1 =

-

Updating the desired reward according to the reward mechanism:

wherein to the left of the equationq(s _t ,c _t ) To represent the states _t Using the size of the fragment c _t Later desired reward, and update to obtain current statuss _t+1 Including the remaining file size T _t+1 And repeating the second step to the fourth step until the whole file is successfully transmitted.

7. A transmission architecture adopting the transmission method of the large file fragments among the supercomputing centers of any one of claims 1 to 6, comprising a supercomputing center of a sender and a supercomputing center of a receiver, wherein the supercomputing center of the sender comprises a learner, a fragmenter and a transmitter, and the supercomputing center of the receiver comprises a receiver and a feedback device;

the slicer is used for acquiring file size and file data cutting and outputting the sliced data to the sender, the sender is used for receiving the sliced data output by the slicer and sending the data to the receiver of the receiver, and the sender acquires data sending feedback and transmits the feedback value to the learner; the learning device is used for updating the state parameters according to the feedback provided by the transmitter and calculating to obtain the size of the next fragment, and transmitting the value to the fragmenter; the receiver is used for receiving the fragment data of the sender, verifying the data and sending a verification result to the feedback device; the feedback device receives the feedback state of the receiver and then sends the value to the sender of the sender.

8. An apparatus for transmitting large document fragments between supercomputing centers, comprising one or more processors, configured to implement a method for transmitting large document fragments between supercomputing centers according to any one of claims 1 to 6.

9. A computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements a method for fragmented transmission of large files between supercomputing centers as claimed in any of claims 1 to 6.