CN111860835A

CN111860835A - Neural network model training method and device

Info

Publication number: CN111860835A
Application number: CN202010690926.3A
Authority: CN
Inventors: 赵仁明
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2020-10-30

Abstract

The invention discloses a neural network model training method and a device, wherein the method comprises the following steps: uploading a data set used for training to a centralized storage device and submitting a training task; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node and deploying a deep learning framework and a script interface; respectively acquiring metadata information of corresponding training task segments from centralized storage equipment by each node, and segmenting a data set corresponding to the training task segments; and sequentially downloading each data set fragment to a local cache of the node, adding the data set fragment to a local cache queue, loading the data set fragment to a container memory, adding the data set fragment to the container memory queue, and calling computing resources by the container to execute a script imported through a script interface on the deep learning framework. The invention can manage the data set in a centralized way, reduce data redundancy, improve training speed and reduce resource cost.

Description

Neural network model training method and device

Technical Field

The present invention relates to the field of artificial intelligence, and more particularly, to a method and an apparatus for training a neural network model.

Background

In the process of combining intelligent technology and entity economy and iterating, algorithms, computing power and data are all in high positions, and general enterprises have no technical capability and cost to acquire the things. For most enterprises, the cost for recruiting advanced AI experts and investing in research and development time is too high, data management is difficult, data redundancy is high, and training speed is slow. Therefore, the complex AI technology is packaged, a zero-threshold AI platform is built, and the realization of different industries by AI becomes very significant.

Aiming at the problems of difficult management, high redundancy, low processing speed and high cost of deep learning data in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a neural network model training method and apparatus, which can centrally manage a data set, reduce data redundancy, increase training speed, and reduce resource cost.

In view of the above, a first aspect of the embodiments of the present invention provides a neural network model training method, including the following steps:

uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;

Determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;

creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;

respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;

sequentially executing the following steps for each data set fragment: downloading a local cache to a node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.

In some embodiments, the obtaining, by each node, metadata information for a corresponding training task segment from the centralized storage device includes: acquiring the size of a data set, the number of files in the data set and the message abstract of the data set of a corresponding training task segment;

Fragmenting the data set corresponding to the training task segments according to the metadata information comprises: and fragmenting the data set corresponding to the training task fragment according to the size of the data set and the preset unit fragmentation size.

In some embodiments, the following steps are performed for each dataset slice in turn: downloading a local cache to a node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling a computing resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the script is composed of the following steps:

the node agent process of the node controls the local cache to read the data set fragments from the centralized storage equipment so as to download and store the data set fragments into the local cache;

the node agent process controls the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form;

determining files with data set fragments in a local cache queue through a node agent process by an environment agent process of the container, and controlling a container memory to read the files with the data set fragments from the local cache queue so as to load and store the files into the container memory;

the environment agent process controls the container memory to remove the data set fragments stored in the container memory from the local cache queue in a file form and place the data set fragments into the container memory queue;

And importing and executing the script in a packaging mode by using a preset script database as a script interface by the environment agent process so as to train the deep learning framework by using the data set fragments.

In some embodiments, training a deep learning framework using dataset shards comprises: converting the data of the data set fragments into tensors by the deep learning framework, sending the tensors to the computing power resource to perform matrix computation, and reconstructing parameters of the deep learning framework by using the results of the matrix computation.

In some embodiments, the centralized storage device, the local cache, and the container memory communicate on a data plane; the node agent process and the environment agent process communicate on a control plane different from the data plane.

In some implementations, the computing power resources include a graphics processing unit, a central processing unit, internal memory, and/or a solid state disk.

In some embodiments, the centralized storage appliance uses a network file system, a Hadoop distributed file system, or a Luster file system.

A second aspect of an embodiment of the present invention provides a neural network model training device, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:

In some embodiments, the obtaining, by each node, metadata information for a corresponding training task segment from the centralized storage device includes: acquiring the size of a data set, the number of files in the data set and the message abstract of the data set of a corresponding training task segment; fragmenting the data set corresponding to the training task segments according to the metadata information comprises: and fragmenting the data set corresponding to the training task fragment according to the size of the data set and the preset unit fragmentation size.

The invention has the following beneficial technical effects: according to the neural network model training method and device provided by the embodiment of the invention, a data set used for training is uploaded to a centralized storage device, and a training task is submitted based on the data set and a script used for executing the training; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container; respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information; sequentially executing the following steps for each data set fragment: the method comprises the steps of downloading a local cache to a node, adding a local cache queue, loading to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the technical scheme of executing the next step of the next data set fragment in response to the completion of the execution of the next step of the previous data set fragment and the previous step of the next data set fragment can centrally manage the data sets, reduce data redundancy, improve training speed and reduce resource cost.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a neural network model training method provided by the present invention;

FIG. 2 is a schematic diagram of the overall structure of the neural network model training method provided in the present invention;

FIG. 3 is a schematic diagram of an intra-node flow of a neural network model training method provided in the present invention;

FIG. 4 is a schematic diagram of a multi-container-dataset fragment execution sequence of the neural network model training method provided by the present invention;

fig. 5 is a schematic flow chart of the neural network model training method provided by the present invention, which is centered on a node.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of a neural network model training method capable of centrally managing data sets and reducing data redundancy. Fig. 1 is a schematic flow chart of a neural network model training method provided by the present invention.

The neural network model training method, as shown in fig. 1, includes the following steps:

step S101: uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;

step S103: determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;

step S105: creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;

Step S107: respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;

step S109: sequentially executing the following steps for each data set fragment: downloading a local cache to a node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct relevant hardware to perform the processes, and the processes can be stored in a computer readable storage medium, and when executed, the processes can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

The following further illustrates embodiments of the invention in accordance with the specific example shown in fig. 2.

For an AI computing platform, firstly, a training sample data set of a user is uploaded to the platform and is stored uniformly. When a user needs to train, a training data set to be used is first selected. The amount of computational resources (e.g., memory, CPU, and GPU) required for training is then specified. And then the AI computing platform automatically selects the host in the cluster according to the computing power condition and allocates computing power resources. And makes the training data set visible to the computational power resources of the host.

When a user uploads a data set, the data set is uploaded to a centralized storage of a platform, such as a Luster FS/NFS/HDFS (network file system, Hadoop distributed file system, Luster file system). Each Node is a physical machine Node to be scheduled, and resources such as a GPU (graphics processing unit)/CPU (central processing unit)/memory/SSD (solid State disk) and the like exist in the Node. And scheduling resources according to the computing resources applied for use by the user, selecting appropriate Node nodes, creating the Pod, and mounting the corresponding data sets stored in a centralized manner to the Pod. And the SSD hard disk of each physical machine node is used as a data set cache disk of the node. When the scheduling is completed, the SSD disk of the corresponding computing node is preferentially pulled through the network.

As shown in FIG. 3, the Node performs the training task of the model, and the data set uses the data set stored by the NFS Server. There is a process node agent at the physical machine node being trained. When a training task is created on a physical machine node used for training, a container is created for the task. Deep learning frameworks (frames), py lib software packages and env agent processes used for training are preset in each container. When the deep learning training task is started (i.e., when the container is started), the env agent process is automatically started.

Wherein the node agent is used for accessing the mounting path of the NFS. And acquiring information of the data set required to be used in the training, such as the size of the data set, the number of files of the data set and the like. The Node agent copies the corresponding data set in the NFS directory to a local cache (i.e., local disk cache), and after the data copy is completed, puts the corresponding file into a queue of the local cache (representing that the data is ready in the disk cache).

The Local disk cache directory is mounted into each container, and the env agent process in the container can communicate with the node agent through the interface. The Env agent process can inquire the state of the local cache queue and the corresponding file thereof through the interface. And the Env agent reads the data from the queue, loads the data into the memory of the corresponding container, dequeues the data from the local disk cache queue, and adds the data into the memory queue of the Env agent (representing that the data is ready in the memory).

For the env agent process, it provides an interface layer (py lib in the figure) for python. This pylib may be preloaded in the container by means of pip. Thus, when a user writes a training script, the data set can be used by importing the package and then calling a python api of py lib.

During the transmission process of the data set, 6 stages are needed, and the 6 stages are respectively:

1. acquiring information such as the total size of a data set on the NFS Server, the number of files in the data set, and the MD5 abstract of the data set, and segmenting the data set;

2. reading the appointed fragments of the appointed data set on the NFS Server, transmitting the data fragments to the local, and writing the data fragments into a local cache disk;

3. opening a file of the designated fragment from a local cache disk, and reading data to an internal memory;

4. the training framework reads the designated fragment data from the memory and converts the data into a tensor;

5. when needed, the training framework transmits the specified data to the GPU video memory;

6. and the GPU reads the tensor data from the video memory and carries out matrix calculation.

In a specific data slicing manner, one data block of 64MB size may be a slice. 1 to 4 are handled by the platform provisioning mechanism and 5 and 6 require deep learning frames such as tensorflow. After the 6 stages are processed according to a pipeline, the actual effect is as shown in fig. 4, assuming that the number of phases required in a data pipeline is m, the execution time of each Phase is T, and each data set has n boards, before the method is adopted, for a data set of 1280MB (20 boards), the time T required for data transmission is:

T＝n*t*m(m＝6,n＝20)

Using this method, a 1280MB data set was divided into 20 shards. The required transmission time T is then:

T＝m*t+(n-1)*t(m＝6,n＝20)

it follows that the time required for data transmission is greatly reduced using this technique.

The overall process flow with nodes as centers is shown in fig. 5. Uploading the data set to a shared storage of an AIStation platform by a user; writing a training script by a user, wherein the training script uses the api provided by the py lib, the user specifies resources required by a training task, and selects a data set required by training; after a user submits a training task, the training task is dispatched to a designated node, a corresponding container mirror image is obtained, and a designated container is created according to the requirement of training task resources; and carrying out a series of pipeline action processes such as data set acquisition, caching, memory writing and the like.

It can be seen from the above embodiments that, in the neural network model training method provided by the embodiments of the present invention, the data set used for training is uploaded to the centralized storage device, and the training task is submitted based on the data set and the script used for executing the training; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container; respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information; sequentially executing the following steps for each data set fragment: the method comprises the steps of downloading a local cache to a node, adding a local cache queue, loading to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the technical scheme of executing the next step of the next data set fragment in response to the completion of the execution of the next step of the previous data set fragment and the previous step of the next data set fragment can centrally manage the data sets, reduce data redundancy, improve training speed and reduce resource cost.

It should be particularly noted that, the steps in the embodiments of the neural network model training method described above may be mutually intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.

In view of the above, according to a second aspect of the embodiments of the present invention, an embodiment of a neural network model training apparatus capable of centrally managing data sets and reducing data redundancy is provided. The neural network model training device includes:

a processor; and

It can be seen from the foregoing embodiments that, in the neural network model training apparatus provided in the embodiments of the present invention, the data set used for training is uploaded to the centralized storage device, and the training task is submitted based on the data set and the script used for executing the training; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container; respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information; sequentially executing the following steps for each data set fragment: the method comprises the steps of downloading a local cache to a node, adding a local cache queue, loading to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the technical scheme of executing the next step of the next data set fragment in response to the completion of the execution of the next step of the previous data set fragment and the previous step of the next data set fragment can centrally manage the data sets, reduce data redundancy, improve training speed and reduce resource cost.

It should be particularly noted that, the above embodiment of the neural network model training device adopts the embodiment of the neural network model training method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the neural network model training method. Of course, since the steps in the embodiment of the neural network model training method may be mutually intersected, replaced, added, or deleted, the neural network model training device transformed by these reasonable permutations and combinations shall also fall within the scope of the present invention, and shall not limit the scope of the present invention to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A neural network model training method is characterized by comprising the following steps:

determining a plurality of nodes with computing resources by a host according to the training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;

Respectively acquiring metadata information of the corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;

sequentially executing the following steps for each data set fragment: downloading a local cache to the node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.

2. The method of claim 1, wherein obtaining, by each of the nodes, metadata information for the corresponding training task segment from the centralized storage device comprises: acquiring the size of a data set, the number of files in the data set and a data set message abstract of the corresponding training task segment;

fragmenting the data set corresponding to the training task segment according to the metadata information comprises: and fragmenting the data set corresponding to the training task segment according to the size of the data set and the preset unit fragmentation size.

3. The method of claim 2, wherein the following steps are performed for each of the data set slices in turn: downloading a local cache to the node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework comprise:

controlling, by a node proxy process of the node, the local cache to read the dataset fragments from the centralized storage device for downloading and saving into the local cache;

controlling the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form by the node agent process;

determining, by the environment agent process of the container, files having the data set fragments in the local cache queue through the node agent process, and controlling the container memory to read the files of the data set fragments from the local cache queue to load and store the files in the container memory;

controlling, by the environment agent process, the container memory to remove the dataset fragments saved in the container memory from the local cache queue in a file form and place the dataset fragments in a container memory queue;

Importing and executing the script in a packaging mode by the environment agent process by using a preset script database as the script interface so as to train the deep learning framework by using the data set fragments.

4. The method of claim 3, wherein training the deep learning framework using the dataset slices comprises:

converting, by the deep learning framework, data of the dataset slices into tensors, sending the tensors to a computational power resource to perform a matrix computation, and reconstructing parameters of the deep learning framework using results of the matrix computation.

5. The method of claim 3, wherein the centralized storage facility, the local cache, and the in-container communication exist on a data plane; the node proxy process and the environment proxy process communicate on a control plane different from the data plane.

6. The method of claim 1, wherein the computing power resources comprise a graphics processing unit, a central processing unit, internal memory, and/or a solid state disk.

7. The method of claim 1, wherein the centralized storage appliance uses a network file system, a Hadoop distributed file system, or a master file system.

8. A neural network model training device, comprising:

a processor; and

uploading a data set used for training to the centralized storage device, and submitting a training task based on the data set and a script used for executing the training;

determining a plurality of nodes with computing resources by the host according to the training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;

9. The apparatus of claim 8, wherein obtaining, by each of the nodes, metadata information for the corresponding training task segment from the centralized storage device comprises: acquiring the size of a data set, the number of files in the data set and a data set message abstract of the corresponding training task segment;

10. The apparatus of claim 8, wherein the following steps are performed for each of the data set slices in turn: downloading a local cache to the node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework comprise: