WO2022110861A1

WO2022110861A1 - Method and apparatus for data set caching in network training, device, and storage medium

Info

Publication number: WO2022110861A1
Application number: PCT/CN2021/109237
Authority: WO
Inventors: 赵仁明; 陈培
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2020-11-27
Filing date: 2021-07-29
Publication date: 2022-06-02
Also published as: CN112446490A

Abstract

A method and apparatus for data set caching in network training, a device, and a storage medium. The method comprises the steps of: collecting statistics about disk performance overhead required by training nodes, which have not cached a data set to be trained, in a network training cluster during a process of caching a data set to be trained; monitoring current performance parameters of the training nodes that have not cached a data set to be trained; selecting, on the basis of the current performance parameters, from the network training cluster a target node meeting the disk performance overhead requirement; and using the target node to cache a data set to be trained transmitted from a source node, so as to perform network training on said data set on the basis of the target node. The present method can ensure the reliability of the training node caching the data set to be trained, thereby ensuring the overall reliability of the network training. The apparatus for data set caching, the device, and the storage medium also have the beneficial effects as described above.

Description

A data set caching method, device, device and storage medium for network training

This application claims the priority of the Chinese patent application filed on November 27, 2020, with the application number of 202011357904.1 and the invention titled "A method, device, equipment and storage medium for data set caching for network training", which The entire contents of this application are incorporated by reference.

technical field

The present application relates to the field of deep learning, and in particular, to a data set caching method, apparatus, device and storage medium for network training.

Background technique

Deep learning has been widely used at present. Deep learning refers to the feature training of neural networks through a large amount of data to generate network models with the ability to identify corresponding data.

Since the number of sample data sets used in the process of neural network training directly affects the effect of deep learning, currently, multiple training nodes are often used in a cluster mode that includes multiple training nodes and data set storage nodes to jointly utilize data. The neural network is trained on the dataset in the set storage node. During the training process, the data sets cached in different training nodes may be different, and there are often situations where multiple training nodes need to be used for neural network training based on the same data set, and the overall reliability of neural network training is also in the current field. focus of attention.

It can be seen that it is a problem to be solved by those skilled in the art to provide a data set caching method for network training to ensure the reliability of training nodes to cache the data set to be trained, thereby ensuring the overall reliability of network training.

SUMMARY OF THE INVENTION

The purpose of this application is to provide a data set caching method, device, device and storage medium for network training, so as to ensure the reliability of training nodes to cache the data set to be trained, thereby ensuring the overall reliability of network training.

In order to solve the above technical problems, the present application provides a data set caching method for network training, including:

Calculate the disk performance overhead required by the training nodes in the network training cluster that do not cache the data set to be trained in the process of caching the data set to be trained;

Monitor the current performance parameters of training nodes that do not cache the data set to be trained;

Select the destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameters;

The destination node is used to cache the to-be-trained data set passed in by the source node, so as to perform network training based on the destination node's to-be-trained data set.

Preferably, before using the destination node to cache the to-be-trained data set passed in by the source node, the method further includes:

Determine whether there is a source training node that caches the data set to be trained in the network training cluster;

If there is a source training node that caches the data set to be trained in the network training cluster, select the target source training node with the largest idle network bandwidth between the source training node and the destination node;

Correspondingly, use the destination node to cache the data set to be trained from the source node, including:

Use the destination node to cache the data set to be trained passed in by the target source training node.

Preferably, when there is no source training node that caches the data set to be trained in the network training cluster, the destination node is used to cache the data set to be trained passed in by the source node, including:

Use the destination node to cache the data set to be trained that is passed in by the data set storage node in the network training cluster.

Preferably, the disk performance overhead required by the training nodes in the network training cluster that do not cache the to-be-trained data set in the process of caching the to-be-trained data set includes:

Based on the hardware performance parameters of the training nodes that do not cache the data set to be trained and the data attribute parameters of the data set to be trained, the disk performance overhead is obtained by statistics.

Preferably, the hardware performance parameters include the disk rotation speed, the disk average patrol time and the disk maximum transfer rate;

The data attribute parameters of the data set to be trained include an average file size; the average file size is calculated based on the total data volume of the data set to be trained and the total number of files.

Determine whether there is free space in the cache queue of the destination node;

If so, execute the step of using the destination node to cache the data set to be trained passed in by the source node;

Otherwise, delete the target data set to be trained with the least number of executions in the cache queue, and perform the step of using the destination node to cache the data set to be trained from the source node.

Preferably, the current performance parameter includes I/O queue length, and the disk performance overhead includes IOPS overhead;

Based on the current performance parameters, select the destination node that meets the disk performance overhead in the network training cluster, including:

Select the destination node whose I/O queue length is less than the IOPS cost in the network cluster.

In addition, the present application also provides a data set cache device for network training, including:

The cost statistics module is used to count the disk performance overhead required by the training nodes in the network training cluster that do not cache the data set to be trained in the process of caching the data set to be trained;

The parameter monitoring module is used to monitor the current performance parameters of the training nodes that do not cache the data set to be trained;

The node selection module is used to select the destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameters;

The node cache module is used for using the destination node to cache the to-be-trained data set passed in by the source node, so as to perform network training based on the destination node's to-be-trained data set.

In addition, this application also provides a data set caching device for network training, including:

memory for storing computer programs;

The processor is configured to implement the steps of the above-mentioned data set caching method for network training when executing the computer program.

In addition, the present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned network training data set caching method are implemented.

The data set caching method for network training provided by this application firstly counts the training nodes in the network training cluster that do not cache the data set to be trained, the disk performance overhead required in the process of caching the data set to be trained, and then monitors the uncached data sets to be trained. The current performance parameters of the training node of the training data set, and based on the current performance parameters, the destination node that satisfies the disk performance overhead is selected in the network training cluster, and then the destination node is used to cache the data set to be trained from the source node. The node performs network training on the training dataset. Since the method obtains the disk performance overhead required for caching the data set to be trained according to the estimation, select the destination node whose current performance parameters meet the disk performance overhead in the training nodes of the network training cluster to cache the data set to be trained input from the source node , which relatively ensures the reliability of the training node to cache the data set to be trained, thereby ensuring the overall reliability of the network training. In addition, the present application also provides a data set cache device, equipment and storage medium for network training, the beneficial effects are the same as above.

Description of drawings

In order to describe the embodiments of the present application more clearly, the following will briefly introduce the drawings that are used in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application, which are not relevant to ordinary skills in the art. As far as personnel are concerned, other drawings can also be obtained from these drawings on the premise of no creative work.

FIG. 1 is a flowchart of a method for caching data sets for network training disclosed in an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a data set cache device for network training disclosed in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. All other embodiments obtained by those of ordinary skill in the art based on the embodiments in the present application without creative work fall within the protection scope of the present application.

To this end, the core of this application is to provide a data set caching method for network training, so as to ensure the reliability of the training node's caching of the data set to be trained, thereby ensuring the overall reliability of network training.

In order to make those skilled in the art better understand the solution of the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

Referring to FIG. 1, an embodiment of the present application discloses a data set caching method for network training, including:

Step S10: Count the disk performance overhead required by the training nodes in the network training cluster that do not cache the to-be-trained data set in the process of caching the to-be-trained data set.

It should be noted that the execution body of this embodiment may be any node with computing capability in the network training cluster. This step firstly counts the training nodes in the network training cluster that do not cache the data set to be trained, and the disk performance overhead required in the process of caching the corresponding data set to be trained, where the data set to be trained refers to the data set used in the network training process. A collection of sample data, which counts the disk performance overhead corresponding to the training nodes that do not cache the data set to be trained. In essence, it estimates the overhead of communication and operating resources occupied by the training nodes in the process of caching the corresponding data set to be trained. In addition, different The disk performance overhead when a training node caches the same dataset to be trained can vary depending on the hardware parameters of the corresponding training nodes.

Step S11: Monitor the current performance parameters of the training nodes that do not cache the data set to be trained.

After the disk performance overhead required by the training node in the process of caching the data set to be trained is obtained by statistics, this step further monitors the current performance parameters of the training node that does not cache the data set to be trained, and the purpose is to perform the following steps according to the data set of the training node. The current performance parameter selects the destination node that can carry the disk performance overhead corresponding to the data set to be trained.

Step S12: Based on the current performance parameters, a destination node that satisfies the disk performance overhead is selected in the network training cluster.

After monitoring the current performance parameters of the training nodes that do not cache the data set to be trained, this step further selects a destination node that satisfies the disk performance overhead in the network cluster based on the current performance parameters. training dataset.

Step S13: Use the destination node to cache the data set to be trained transmitted by the source node, so as to perform network training based on the data set to be trained by the destination node.

After selecting the destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameters, this step further uses the destination node to cache the data set to be trained from the source node, so as to perform network training on the training data set based on the destination node , the purpose is to ensure that when the data set to be trained is cached to the destination node, the destination node can reliably store the data set to be trained.

The data set caching method for network training provided by this application firstly counts the training nodes in the network training cluster that do not cache the data set to be trained, the disk performance overhead required in the process of caching the data set to be trained, and then monitors the uncached data sets to be trained. The current performance parameters of the training node of the training data set, and based on the current performance parameters, the destination node that satisfies the disk performance overhead is selected in the network training cluster, and then the destination node is used to cache the data set to be trained from the source node. The node performs network training on the training dataset. Since the method obtains the disk performance overhead required for caching the data set to be trained according to the estimation, select the destination node whose current performance parameters meet the disk performance overhead in the training nodes of the network training cluster to cache the data set to be trained input from the source node , which relatively ensures the reliability of the training node to cache the data set to be trained, thereby ensuring the overall reliability of the network training.

On the basis of the above embodiment, as a preferred implementation, before using the destination node to cache the data set to be trained transmitted by the source node, the method further includes:

It should be noted that, in this embodiment, when it is determined that there is a source training node that caches the data set to be trained in the network training cluster, that is, the training node that has stored the data set to be trained in the network training cluster, in this case, use Before the destination node caches the data set to be trained from the source node, it first selects the target source training node with the largest idle network bandwidth between the source training node and the destination node, and then uses the destination node to cache the data set passed in by the source node. When the data set is to be trained, the target node caches the data set to be trained from the source training node. Since the bandwidth between the target source training node and the destination node is relatively high, the distance between the target source training node and the destination node is relatively high. The network transmission efficiency is relatively high, and this embodiment further improves the overall efficiency of sharing the data set to be trained among the training nodes.

Further, as a preferred embodiment, when there is no source training node that caches the data set to be trained in the network training cluster, the destination node is used to cache the data set to be trained passed in by the source node, including:

It should be noted that, in this embodiment, the network training cluster includes a data set storage node, and the data set storage node is used to store the to-be-trained data set. When it is determined that there is no source training node that caches the to-be-trained data set in the network training cluster When using the destination node to cache the data set to be trained passed in from the source node, the destination node is specifically used to cache the data set to be trained passed in by the data set storage node in the network training cluster, so as to further ensure that the destination node has The reliability of the acquisition of the dataset to be trained.

On the basis of the above embodiment, as a preferred implementation, statistics of the disk performance overhead required by the training nodes in the network training cluster that do not cache the to-be-trained data set in the process of caching the to-be-trained data set include:

It should be noted that, in this embodiment, when calculating the disk performance overhead required by the training nodes that do not cache the data set to be trained in the network training cluster in the process of caching the data set to be trained, it is specifically based on the uncached data set to be trained. The hardware performance parameters of the training node and the data attribute parameters of the data set to be trained are counted to obtain the performance disk overhead. Data feature parameters. Since the hardware performance parameters can accurately represent the efficiency of the training node for data caching, and the data attribute parameters can accurately represent the data magnitude and data type of the data set to be trained, this embodiment is based on the hardware performance parameters of the training node and the data to be trained. The data attribute parameters of the training data set are calculated to obtain the disk performance overhead, which can further improve the accuracy of the calculated disk performance overhead.

Further, as a preferred embodiment, the hardware performance parameters include the disk rotation speed, the disk average patrol time, and the disk maximum transfer rate;

In this implementation manner, the hardware performance parameters include the rotational speed of the disk, the average track time of the disk, and the maximum transfer rate of the disk. Among them, the disk rotation speed refers to the maximum number of revolutions that the disk platter can complete in one minute. It is one of the key factors determining the internal transfer rate of the disk and directly affects the speed of the disk to a large extent; Track time refers to the average time taken by the magnetic head to move from the beginning to the track where the data is located after the disk receives the system command. It reflects the ability of the disk to read data to a certain extent and affects the internal data transmission of the disk. The maximum transfer rate of the disk refers to the speed at which data is transferred from the head of the disk to the cache, which affects the overall efficiency of data caching in the disk.

In addition, in this embodiment, the data attribute parameter of the data set to be trained includes the average file size, and the average file size is calculated based on the total amount of data in the data set to be trained and the total amount of files, that is, the data of the data set to be trained is calculated. The total, that is, the total size of the data and the total number of files, that is, the number of files, is obtained by quotient operation, and the average file size refers to the average disk storage space occupied by each file in the data set to be trained.

In this implementation manner, the hardware performance parameters of the training nodes and the data attribute parameters of the data set to be trained are refined, thereby further improving the accuracy of the statistical obtained disk performance overhead.

It should be noted that, in this embodiment, before using the destination node to cache the data set to be trained from the source node, it is further determined whether there is free space in the cache queue of the destination node, that is, it is determined whether the cache queue can work normally. The data set to be trained is stored, and when there is free space in the cache queue of the destination node, the step of using the destination node to cache the data set to be trained from the source node is further performed. On the contrary, when there is no free space in the cache queue of the destination node When there is no space, delete the target data set to be trained with the least number of executions in the cache queue, and perform the step of using the destination node to cache the data set to be trained from the source node, so as to ensure that the data set to be trained can be stored normally to the destination The node's cache queue further ensures the overall reliability of the data set caching process.

Based on the above series of implementations, as a preferred implementation, the current performance parameter includes the I/O queue length, and the disk performance overhead includes the IOPS overhead;

It should be noted that, in this implementation manner, the current performance parameters include I/O (abbreviation for Input/Output, that is, input and output ports) queue length, the disk performance overhead includes IOPS overhead, IOPS (Input/Output, Operations Per Second) It is a measurement used for performance testing of computer storage devices such as magnetic disks (HDDs), solid-state disks (SSDs), or storage area networks (SANs), and can be considered as the number of reads and writes per second. When selecting the destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameters, specifically select the destination node whose I/O queue length is less than the IOPS overhead in the network cluster, in order to ensure that the data set to be trained can be reliably cached to The selected destination node further ensures the overall reliability of the data set caching process.

In order to further deepen the understanding of the above embodiments, the present application also provides a scenario embodiment under a specific application scenario for further description.

For each training node in the cluster, the following parameters of the disk used for data set caching can be obtained: disk rotation speed (rpm), disk average seek time (avgSeekTime), and disk maximum transfer rate (maxTransRate). Through the above 4 parameters, the average file size of the dataset (avgSize) and the following formula, calculate the single IO time of the disk in the cache of the dataset, that is, IOTime:

According to the above formula, the approximate value of the time consuming of each IO operation under this data set can be obtained.

For the maximum IOPS capability of the data set when the current node is cached, it can be approximated by the following formula:

At the same time, for each training node, the wrqm/s value, rrqm/s and avgqu-sz value of the data set cache disk can be obtained by monitoring. Where wrqm/s represents the current number of writes per second of the disk after merging data (merge), rrqm/s represents the current number of reads per second after merging data (merge) of the disk, and avgqu-sz represents the I/ O queue length. Considering that the increase of IOTime is not a completely linear relationship with the increase of IOPS, in fact, after reaching a certain IOPS value, IOTime will increase significantly with the increase of IOPS. Therefore, in this method, the node with avgqu-sz<IOPS*70% is used as the destination node of this data set cache, which is used to cache the data set to be trained.

Referring to FIG. 2, an embodiment of the present application provides a data set cache device for network training, including:

The overhead statistics module 10 is used to count the disk performance overhead required by the training nodes in the network training cluster that do not cache the to-be-trained data set in the process of caching the to-be-trained data set;

The parameter monitoring module 11 is used to monitor the current performance parameters of the training nodes that do not cache the data set to be trained;

The node selection module 12 is used to select the destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameter;

The node caching module 13 is configured to use the destination node to cache the to-be-trained data set passed in by the source node, so as to perform network training based on the destination node's to-be-trained data set.

The data set caching device for network training provided by the present application firstly counts the training nodes in the network training cluster that do not cache the data set to be trained, the disk performance overhead required in the process of caching the data set to be trained, and then monitors the uncached data set to be trained. The current performance parameters of the training node of the training data set, and based on the current performance parameters, the destination node that satisfies the disk performance overhead is selected in the network training cluster, and then the destination node is used to cache the data set to be trained from the source node. The node performs network training on the training dataset. Since the device obtains the disk performance overhead required for caching the data set to be trained according to the estimation, selects the destination node whose current performance parameters meet the disk performance overhead in the training nodes of the network training cluster to cache the data set to be trained imported from the source node , which relatively ensures the reliability of the training node to cache the data set to be trained, thereby ensuring the overall reliability of the network training.

memory for storing computer programs;

The data set caching device for network training provided by this application firstly counts the training nodes in the network training cluster that do not cache the data set to be trained, the disk performance overhead required in the process of caching the data set to be trained, and then monitors the uncached data set to be trained. The current performance parameters of the training node of the training data set, and based on the current performance parameters, the destination node that satisfies the disk performance overhead is selected in the network training cluster, and then the destination node is used to cache the data set to be trained from the source node. The node performs network training on the training dataset. Since the device obtains the disk performance overhead required to cache the data set to be trained according to the estimation, select the destination node whose current performance parameters meet the disk performance overhead in the training nodes of the network training cluster to cache the data set to be trained input from the source node , which relatively ensures the reliability of the training node to cache the data set to be trained, thereby ensuring the overall reliability of the network training.

The computer-readable storage medium provided by this application firstly counts the training nodes in the network training cluster that do not cache the data set to be trained, the disk performance overhead required in the process of caching the data set to be trained, and then monitors the uncached data to be trained. The current performance parameters of the training nodes in the set, and based on the current performance parameters, the destination node that satisfies the disk performance overhead is selected in the network training cluster, and then the destination node is used to cache the data set to be trained from the source node. The training dataset performs network training. Since the computer-readable storage medium obtains the disk performance overhead required for caching the data set to be trained according to the estimation, the destination node whose current performance parameters meet the disk performance overhead is selected from the training nodes of the network training cluster to the source node to be trained. The data set is cached, which relatively ensures the reliability of the training node to cache the data set to be trained, thereby ensuring the overall reliability of network training.

The data set caching method, device, device and storage medium for network training provided by the present application have been described in detail above. The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present application, several improvements and modifications can also be made to the present application, and these improvements and modifications also fall within the protection scope of the claims of the present application.

It should also be noted that, in this specification, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is no such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

Claims

A data set caching method for network training, characterized in that it includes:

Counting the disk performance overhead required by the training nodes that do not cache the data set to be trained in the process of caching the data set to be trained in the network training cluster;

monitoring the current performance parameters of the training nodes that do not cache the data set to be trained;

Selecting a destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameter;

The destination node is used to cache the to-be-trained data set passed in by the source node, so as to perform network training on the to-be-trained data set based on the destination node.
The data set caching method for network training according to claim 1, characterized in that, before using the destination node to cache the to-be-trained data set passed in by the source node, the method further comprises:

Judging whether there is a source training node that caches the data set to be trained in the network training cluster;

If there is a source training node that caches the data set to be trained in the network training cluster, select a target source training node with the largest idle network bandwidth between the source training nodes and the destination node;

Correspondingly, using the destination node to cache the to-be-trained data set passed in by the source node includes:

The to-be-trained data set passed in by the target source training node is cached by the destination node.
The data set caching method for network training according to claim 2, wherein when there is no source training node that caches the to-be-trained data set in the network training cluster, the cache using the destination node The to-be-trained data set passed in by the source node includes:

The destination node is used to cache the to-be-trained data set passed in by the data set storage node in the network training cluster.
The data set caching method for network training according to claim 1, characterized in that, a training node in the network training cluster that does not cache the data set to be trained needs a disk in the process of caching the data set to be trained. Performance overhead, including:

The disk performance overhead is statistically obtained based on hardware performance parameters of the training nodes that do not cache the to-be-trained data set and data attribute parameters of the to-be-trained data set.
The data set caching method for network training according to claim 4, wherein the hardware performance parameters include disk rotation speed, disk average patrol time and disk maximum transfer rate;

The data attribute parameter of the data set to be trained includes an average file size; the average file size is calculated based on the total amount of data and the total amount of files in the data set to be trained.
The data set caching method for network training according to claim 1, characterized in that, before using the destination node to cache the to-be-trained data set passed in by the source node, the method further comprises:

Determine whether there is free space in the cache queue of the destination node;

If so, execute the step of using the destination node to cache the data set to be trained passed in by the source node;

Otherwise, delete the target data set to be trained with the least number of executions in the cache queue, and execute the step of using the destination node to cache the data set to be trained transmitted by the source node.
The data set caching method for network training according to any one of claims 1 to 6, wherein the current performance parameter includes an I/O queue length, and the disk performance overhead includes an IOPS overhead;

The selecting a destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameter includes:

The destination node whose I/O queue length is less than the IOPS overhead is selected in the network cluster.
A data set cache device for network training, comprising:

An overhead statistics module, used to count the disk performance overhead required by the training nodes in the network training cluster that do not cache the to-be-trained data set in the process of caching the to-be-trained data set;

a parameter monitoring module, used to monitor the current performance parameters of the training nodes that do not cache the data set to be trained;

A node selection module is used to select a destination node that satisfies the disk performance overhead in the network training cluster based on the current performance parameter;

A node caching module, configured to use the destination node to cache the to-be-trained data set passed in by the source node, so as to perform network training on the to-be-trained data set based on the destination node.
A data set cache device for network training, characterized in that it includes:

memory for storing computer programs;

The processor is configured to implement the steps of the data set caching method for network training according to any one of claims 1 to 7 when executing the computer program.
A computer-readable storage medium, characterized in that, a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the network training according to any one of claims 1 to 7 is implemented. The steps of the dataset caching method.