CN111860835A - Neural network model training method and device - Google Patents
Neural network model training method and device Download PDFInfo
- Publication number
- CN111860835A CN111860835A CN202010690926.3A CN202010690926A CN111860835A CN 111860835 A CN111860835 A CN 111860835A CN 202010690926 A CN202010690926 A CN 202010690926A CN 111860835 A CN111860835 A CN 111860835A
- Authority
- CN
- China
- Prior art keywords
- data set
- local cache
- node
- container
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000012549 training Methods 0.000 title claims abstract description 160
- 238000000034 method Methods 0.000 title claims abstract description 87
- 238000003062 neural network model Methods 0.000 title claims abstract description 28
- 239000012634 fragment Substances 0.000 claims abstract description 98
- 238000013135 deep learning Methods 0.000 claims abstract description 45
- 239000003795 chemical substances by application Substances 0.000 claims description 49
- 230000008569 process Effects 0.000 claims description 49
- 238000012545 processing Methods 0.000 claims description 9
- 230000004044 response Effects 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000013467 fragmentation Methods 0.000 claims description 6
- 238000006062 fragmentation reaction Methods 0.000 claims description 6
- 238000004806 packaging method and process Methods 0.000 claims description 6
- 239000007787 solid Substances 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims 1
- 238000004886 process control Methods 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000002932 luster Substances 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a neural network model training method and a device, wherein the method comprises the following steps: uploading a data set used for training to a centralized storage device and submitting a training task; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node and deploying a deep learning framework and a script interface; respectively acquiring metadata information of corresponding training task segments from centralized storage equipment by each node, and segmenting a data set corresponding to the training task segments; and sequentially downloading each data set fragment to a local cache of the node, adding the data set fragment to a local cache queue, loading the data set fragment to a container memory, adding the data set fragment to the container memory queue, and calling computing resources by the container to execute a script imported through a script interface on the deep learning framework. The invention can manage the data set in a centralized way, reduce data redundancy, improve training speed and reduce resource cost.
Description
Technical Field
The present invention relates to the field of artificial intelligence, and more particularly, to a method and an apparatus for training a neural network model.
Background
In the process of combining intelligent technology and entity economy and iterating, algorithms, computing power and data are all in high positions, and general enterprises have no technical capability and cost to acquire the things. For most enterprises, the cost for recruiting advanced AI experts and investing in research and development time is too high, data management is difficult, data redundancy is high, and training speed is slow. Therefore, the complex AI technology is packaged, a zero-threshold AI platform is built, and the realization of different industries by AI becomes very significant.
Aiming at the problems of difficult management, high redundancy, low processing speed and high cost of deep learning data in the prior art, no effective solution is available at present.
Disclosure of Invention
In view of this, an object of the embodiments of the present invention is to provide a neural network model training method and apparatus, which can centrally manage a data set, reduce data redundancy, increase training speed, and reduce resource cost.
In view of the above, a first aspect of the embodiments of the present invention provides a neural network model training method, including the following steps:
uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;
Determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;
creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;
respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;
sequentially executing the following steps for each data set fragment: downloading a local cache to a node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.
In some embodiments, the obtaining, by each node, metadata information for a corresponding training task segment from the centralized storage device includes: acquiring the size of a data set, the number of files in the data set and the message abstract of the data set of a corresponding training task segment;
Fragmenting the data set corresponding to the training task segments according to the metadata information comprises: and fragmenting the data set corresponding to the training task fragment according to the size of the data set and the preset unit fragmentation size.
In some embodiments, the following steps are performed for each dataset slice in turn: downloading a local cache to a node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling a computing resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the script is composed of the following steps:
the node agent process of the node controls the local cache to read the data set fragments from the centralized storage equipment so as to download and store the data set fragments into the local cache;
the node agent process controls the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form;
determining files with data set fragments in a local cache queue through a node agent process by an environment agent process of the container, and controlling a container memory to read the files with the data set fragments from the local cache queue so as to load and store the files into the container memory;
the environment agent process controls the container memory to remove the data set fragments stored in the container memory from the local cache queue in a file form and place the data set fragments into the container memory queue;
And importing and executing the script in a packaging mode by using a preset script database as a script interface by the environment agent process so as to train the deep learning framework by using the data set fragments.
In some embodiments, training a deep learning framework using dataset shards comprises: converting the data of the data set fragments into tensors by the deep learning framework, sending the tensors to the computing power resource to perform matrix computation, and reconstructing parameters of the deep learning framework by using the results of the matrix computation.
In some embodiments, the centralized storage device, the local cache, and the container memory communicate on a data plane; the node agent process and the environment agent process communicate on a control plane different from the data plane.
In some implementations, the computing power resources include a graphics processing unit, a central processing unit, internal memory, and/or a solid state disk.
In some embodiments, the centralized storage appliance uses a network file system, a Hadoop distributed file system, or a Luster file system.
A second aspect of an embodiment of the present invention provides a neural network model training device, including:
a processor; and
a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:
Uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;
determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;
creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;
respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;
sequentially executing the following steps for each data set fragment: downloading a local cache to a node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.
In some embodiments, the obtaining, by each node, metadata information for a corresponding training task segment from the centralized storage device includes: acquiring the size of a data set, the number of files in the data set and the message abstract of the data set of a corresponding training task segment; fragmenting the data set corresponding to the training task segments according to the metadata information comprises: and fragmenting the data set corresponding to the training task fragment according to the size of the data set and the preset unit fragmentation size.
In some embodiments, the following steps are performed for each dataset slice in turn: downloading a local cache to a node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling a computing resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the script is composed of the following steps:
the node agent process of the node controls the local cache to read the data set fragments from the centralized storage equipment so as to download and store the data set fragments into the local cache;
the node agent process controls the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form;
determining files with data set fragments in a local cache queue through a node agent process by an environment agent process of the container, and controlling a container memory to read the files with the data set fragments from the local cache queue so as to load and store the files into the container memory;
the environment agent process controls the container memory to remove the data set fragments stored in the container memory from the local cache queue in a file form and place the data set fragments into the container memory queue;
and importing and executing the script in a packaging mode by using a preset script database as a script interface by the environment agent process so as to train the deep learning framework by using the data set fragments.
The invention has the following beneficial technical effects: according to the neural network model training method and device provided by the embodiment of the invention, a data set used for training is uploaded to a centralized storage device, and a training task is submitted based on the data set and a script used for executing the training; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container; respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information; sequentially executing the following steps for each data set fragment: the method comprises the steps of downloading a local cache to a node, adding a local cache queue, loading to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the technical scheme of executing the next step of the next data set fragment in response to the completion of the execution of the next step of the previous data set fragment and the previous step of the next data set fragment can centrally manage the data sets, reduce data redundancy, improve training speed and reduce resource cost.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a neural network model training method provided by the present invention;
FIG. 2 is a schematic diagram of the overall structure of the neural network model training method provided in the present invention;
FIG. 3 is a schematic diagram of an intra-node flow of a neural network model training method provided in the present invention;
FIG. 4 is a schematic diagram of a multi-container-dataset fragment execution sequence of the neural network model training method provided by the present invention;
fig. 5 is a schematic flow chart of the neural network model training method provided by the present invention, which is centered on a node.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the above, a first aspect of the embodiments of the present invention provides an embodiment of a neural network model training method capable of centrally managing data sets and reducing data redundancy. Fig. 1 is a schematic flow chart of a neural network model training method provided by the present invention.
The neural network model training method, as shown in fig. 1, includes the following steps:
step S101: uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;
step S103: determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;
step S105: creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;
Step S107: respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;
step S109: sequentially executing the following steps for each data set fragment: downloading a local cache to a node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct relevant hardware to perform the processes, and the processes can be stored in a computer readable storage medium, and when executed, the processes can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some embodiments, the obtaining, by each node, metadata information for a corresponding training task segment from the centralized storage device includes: acquiring the size of a data set, the number of files in the data set and the message abstract of the data set of a corresponding training task segment; fragmenting the data set corresponding to the training task segments according to the metadata information comprises: and fragmenting the data set corresponding to the training task fragment according to the size of the data set and the preset unit fragmentation size.
In some embodiments, the following steps are performed for each dataset slice in turn: downloading a local cache to a node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling a computing resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the script is composed of the following steps:
the node agent process of the node controls the local cache to read the data set fragments from the centralized storage equipment so as to download and store the data set fragments into the local cache;
the node agent process controls the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form;
determining files with data set fragments in a local cache queue through a node agent process by an environment agent process of the container, and controlling a container memory to read the files with the data set fragments from the local cache queue so as to load and store the files into the container memory;
The environment agent process controls the container memory to remove the data set fragments stored in the container memory from the local cache queue in a file form and place the data set fragments into the container memory queue;
and importing and executing the script in a packaging mode by using a preset script database as a script interface by the environment agent process so as to train the deep learning framework by using the data set fragments.
In some embodiments, training a deep learning framework using dataset shards comprises: converting the data of the data set fragments into tensors by the deep learning framework, sending the tensors to the computing power resource to perform matrix computation, and reconstructing parameters of the deep learning framework by using the results of the matrix computation.
In some embodiments, the centralized storage device, the local cache, and the container memory communicate on a data plane; the node agent process and the environment agent process communicate on a control plane different from the data plane.
In some implementations, the computing power resources include a graphics processing unit, a central processing unit, internal memory, and/or a solid state disk.
In some embodiments, the centralized storage appliance uses a network file system, a Hadoop distributed file system, or a Luster file system.
The following further illustrates embodiments of the invention in accordance with the specific example shown in fig. 2.
For an AI computing platform, firstly, a training sample data set of a user is uploaded to the platform and is stored uniformly. When a user needs to train, a training data set to be used is first selected. The amount of computational resources (e.g., memory, CPU, and GPU) required for training is then specified. And then the AI computing platform automatically selects the host in the cluster according to the computing power condition and allocates computing power resources. And makes the training data set visible to the computational power resources of the host.
When a user uploads a data set, the data set is uploaded to a centralized storage of a platform, such as a Luster FS/NFS/HDFS (network file system, Hadoop distributed file system, Luster file system). Each Node is a physical machine Node to be scheduled, and resources such as a GPU (graphics processing unit)/CPU (central processing unit)/memory/SSD (solid State disk) and the like exist in the Node. And scheduling resources according to the computing resources applied for use by the user, selecting appropriate Node nodes, creating the Pod, and mounting the corresponding data sets stored in a centralized manner to the Pod. And the SSD hard disk of each physical machine node is used as a data set cache disk of the node. When the scheduling is completed, the SSD disk of the corresponding computing node is preferentially pulled through the network.
As shown in FIG. 3, the Node performs the training task of the model, and the data set uses the data set stored by the NFS Server. There is a process node agent at the physical machine node being trained. When a training task is created on a physical machine node used for training, a container is created for the task. Deep learning frameworks (frames), py lib software packages and env agent processes used for training are preset in each container. When the deep learning training task is started (i.e., when the container is started), the env agent process is automatically started.
Wherein the node agent is used for accessing the mounting path of the NFS. And acquiring information of the data set required to be used in the training, such as the size of the data set, the number of files of the data set and the like. The Node agent copies the corresponding data set in the NFS directory to a local cache (i.e., local disk cache), and after the data copy is completed, puts the corresponding file into a queue of the local cache (representing that the data is ready in the disk cache).
The Local disk cache directory is mounted into each container, and the env agent process in the container can communicate with the node agent through the interface. The Env agent process can inquire the state of the local cache queue and the corresponding file thereof through the interface. And the Env agent reads the data from the queue, loads the data into the memory of the corresponding container, dequeues the data from the local disk cache queue, and adds the data into the memory queue of the Env agent (representing that the data is ready in the memory).
For the env agent process, it provides an interface layer (py lib in the figure) for python. This pylib may be preloaded in the container by means of pip. Thus, when a user writes a training script, the data set can be used by importing the package and then calling a python api of py lib.
During the transmission process of the data set, 6 stages are needed, and the 6 stages are respectively:
1. acquiring information such as the total size of a data set on the NFS Server, the number of files in the data set, and the MD5 abstract of the data set, and segmenting the data set;
2. reading the appointed fragments of the appointed data set on the NFS Server, transmitting the data fragments to the local, and writing the data fragments into a local cache disk;
3. opening a file of the designated fragment from a local cache disk, and reading data to an internal memory;
4. the training framework reads the designated fragment data from the memory and converts the data into a tensor;
5. when needed, the training framework transmits the specified data to the GPU video memory;
6. and the GPU reads the tensor data from the video memory and carries out matrix calculation.
In a specific data slicing manner, one data block of 64MB size may be a slice. 1 to 4 are handled by the platform provisioning mechanism and 5 and 6 require deep learning frames such as tensorflow. After the 6 stages are processed according to a pipeline, the actual effect is as shown in fig. 4, assuming that the number of phases required in a data pipeline is m, the execution time of each Phase is T, and each data set has n boards, before the method is adopted, for a data set of 1280MB (20 boards), the time T required for data transmission is:
T=n*t*m(m=6,n=20)
Using this method, a 1280MB data set was divided into 20 shards. The required transmission time T is then:
T=m*t+(n-1)*t(m=6,n=20)
it follows that the time required for data transmission is greatly reduced using this technique.
The overall process flow with nodes as centers is shown in fig. 5. Uploading the data set to a shared storage of an AIStation platform by a user; writing a training script by a user, wherein the training script uses the api provided by the py lib, the user specifies resources required by a training task, and selects a data set required by training; after a user submits a training task, the training task is dispatched to a designated node, a corresponding container mirror image is obtained, and a designated container is created according to the requirement of training task resources; and carrying out a series of pipeline action processes such as data set acquisition, caching, memory writing and the like.
It can be seen from the above embodiments that, in the neural network model training method provided by the embodiments of the present invention, the data set used for training is uploaded to the centralized storage device, and the training task is submitted based on the data set and the script used for executing the training; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container; respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information; sequentially executing the following steps for each data set fragment: the method comprises the steps of downloading a local cache to a node, adding a local cache queue, loading to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the technical scheme of executing the next step of the next data set fragment in response to the completion of the execution of the next step of the previous data set fragment and the previous step of the next data set fragment can centrally manage the data sets, reduce data redundancy, improve training speed and reduce resource cost.
It should be particularly noted that, the steps in the embodiments of the neural network model training method described above may be mutually intersected, replaced, added, or deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.
In view of the above, according to a second aspect of the embodiments of the present invention, an embodiment of a neural network model training apparatus capable of centrally managing data sets and reducing data redundancy is provided. The neural network model training device includes:
a processor; and
a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:
uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;
determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;
creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;
Respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;
sequentially executing the following steps for each data set fragment: downloading a local cache to a node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.
In some embodiments, the obtaining, by each node, metadata information for a corresponding training task segment from the centralized storage device includes: acquiring the size of a data set, the number of files in the data set and the message abstract of the data set of a corresponding training task segment;
fragmenting the data set corresponding to the training task segments according to the metadata information comprises: and fragmenting the data set corresponding to the training task fragment according to the size of the data set and the preset unit fragmentation size.
In some embodiments, the following steps are performed for each dataset slice in turn: downloading a local cache to a node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling a computing resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the script is composed of the following steps:
The node agent process of the node controls the local cache to read the data set fragments from the centralized storage equipment so as to download and store the data set fragments into the local cache;
the node agent process controls the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form;
determining files with data set fragments in a local cache queue through a node agent process by an environment agent process of the container, and controlling a container memory to read the files with the data set fragments from the local cache queue so as to load and store the files into the container memory;
the environment agent process controls the container memory to remove the data set fragments stored in the container memory from the local cache queue in a file form and place the data set fragments into the container memory queue;
and importing and executing the script in a packaging mode by using a preset script database as a script interface by the environment agent process so as to train the deep learning framework by using the data set fragments.
It can be seen from the foregoing embodiments that, in the neural network model training apparatus provided in the embodiments of the present invention, the data set used for training is uploaded to the centralized storage device, and the training task is submitted based on the data set and the script used for executing the training; determining a plurality of nodes with computing resources by a host according to a training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes; creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container; respectively acquiring metadata information of corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information; sequentially executing the following steps for each data set fragment: the method comprises the steps of downloading a local cache to a node, adding a local cache queue, loading to a container memory, adding the container memory queue, calling a computational resource by a container to execute a script imported through a script interface on a deep learning framework, wherein the technical scheme of executing the next step of the next data set fragment in response to the completion of the execution of the next step of the previous data set fragment and the previous step of the next data set fragment can centrally manage the data sets, reduce data redundancy, improve training speed and reduce resource cost.
It should be particularly noted that, the above embodiment of the neural network model training device adopts the embodiment of the neural network model training method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the neural network model training method. Of course, since the steps in the embodiment of the neural network model training method may be mutually intersected, replaced, added, or deleted, the neural network model training device transformed by these reasonable permutations and combinations shall also fall within the scope of the present invention, and shall not limit the scope of the present invention to the embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.
Claims (10)
1. A neural network model training method is characterized by comprising the following steps:
uploading a data set used for training to a centralized storage device, and submitting a training task based on the data set and a script used for executing the training;
determining a plurality of nodes with computing resources by a host according to the training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;
creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;
Respectively acquiring metadata information of the corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;
sequentially executing the following steps for each data set fragment: downloading a local cache to the node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.
2. The method of claim 1, wherein obtaining, by each of the nodes, metadata information for the corresponding training task segment from the centralized storage device comprises: acquiring the size of a data set, the number of files in the data set and a data set message abstract of the corresponding training task segment;
fragmenting the data set corresponding to the training task segment according to the metadata information comprises: and fragmenting the data set corresponding to the training task segment according to the size of the data set and the preset unit fragmentation size.
3. The method of claim 2, wherein the following steps are performed for each of the data set slices in turn: downloading a local cache to the node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework comprise:
controlling, by a node proxy process of the node, the local cache to read the dataset fragments from the centralized storage device for downloading and saving into the local cache;
controlling the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form by the node agent process;
determining, by the environment agent process of the container, files having the data set fragments in the local cache queue through the node agent process, and controlling the container memory to read the files of the data set fragments from the local cache queue to load and store the files in the container memory;
controlling, by the environment agent process, the container memory to remove the dataset fragments saved in the container memory from the local cache queue in a file form and place the dataset fragments in a container memory queue;
Importing and executing the script in a packaging mode by the environment agent process by using a preset script database as the script interface so as to train the deep learning framework by using the data set fragments.
4. The method of claim 3, wherein training the deep learning framework using the dataset slices comprises:
converting, by the deep learning framework, data of the dataset slices into tensors, sending the tensors to a computational power resource to perform a matrix computation, and reconstructing parameters of the deep learning framework using results of the matrix computation.
5. The method of claim 3, wherein the centralized storage facility, the local cache, and the in-container communication exist on a data plane; the node proxy process and the environment proxy process communicate on a control plane different from the data plane.
6. The method of claim 1, wherein the computing power resources comprise a graphics processing unit, a central processing unit, internal memory, and/or a solid state disk.
7. The method of claim 1, wherein the centralized storage appliance uses a network file system, a Hadoop distributed file system, or a master file system.
8. A neural network model training device, comprising:
a processor; and
a memory storing program code executable by the processor, the program code when executed sequentially performing the steps of:
uploading a data set used for training to the centralized storage device, and submitting a training task based on the data set and a script used for executing the training;
determining a plurality of nodes with computing resources by the host according to the training task, and splitting the training task into a plurality of training task segments to be respectively dispatched to the plurality of nodes;
creating a plurality of containers for calling computational power resources on each node, and deploying a deep learning framework and a script interface for each container;
respectively acquiring metadata information of the corresponding training task segments from the centralized storage equipment by each node, and segmenting the data sets corresponding to the training task segments according to the metadata information;
sequentially executing the following steps for each data set fragment: downloading a local cache to the node, adding the local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework, wherein a subsequent step of a subsequent data set fragment is executed in response to completion of execution of both a subsequent step of a previous data set fragment and a previous step of the subsequent data set fragment.
9. The apparatus of claim 8, wherein obtaining, by each of the nodes, metadata information for the corresponding training task segment from the centralized storage device comprises: acquiring the size of a data set, the number of files in the data set and a data set message abstract of the corresponding training task segment;
fragmenting the data set corresponding to the training task segment according to the metadata information comprises: and fragmenting the data set corresponding to the training task segment according to the size of the data set and the preset unit fragmentation size.
10. The apparatus of claim 8, wherein the following steps are performed for each of the data set slices in turn: downloading a local cache to the node, adding a local cache queue, loading the local cache queue to a container memory, adding the container memory queue, and calling computing resources by a container to execute the script imported through the script interface on the deep learning framework comprise:
controlling, by a node proxy process of the node, the local cache to read the dataset fragments from the centralized storage device for downloading and saving into the local cache;
controlling the local cache to put the data set fragments stored in the local cache into a local cache queue in a file form by the node agent process;
Determining, by the environment agent process of the container, files having the data set fragments in the local cache queue through the node agent process, and controlling the container memory to read the files of the data set fragments from the local cache queue to load and store the files in the container memory;
controlling, by the environment agent process, the container memory to remove the dataset fragments saved in the container memory from the local cache queue in a file form and place the dataset fragments in a container memory queue;
importing and executing the script in a packaging mode by the environment agent process by using a preset script database as the script interface so as to train the deep learning framework by using the data set fragments.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010690926.3A CN111860835A (en) | 2020-07-17 | 2020-07-17 | Neural network model training method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010690926.3A CN111860835A (en) | 2020-07-17 | 2020-07-17 | Neural network model training method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111860835A true CN111860835A (en) | 2020-10-30 |
Family
ID=73000501
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010690926.3A Withdrawn CN111860835A (en) | 2020-07-17 | 2020-07-17 | Neural network model training method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111860835A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112700004A (en) * | 2020-12-25 | 2021-04-23 | 南方电网深圳数字电网研究院有限公司 | Deep learning model training method and device based on container technology and storage medium |
CN112882999A (en) * | 2021-01-31 | 2021-06-01 | 云知声智能科技股份有限公司 | Training acceleration method, device and system based on distributed cache affinity scheduling |
CN113469372A (en) * | 2021-07-02 | 2021-10-01 | 北京市商汤科技开发有限公司 | Reinforcement learning training method, device, electronic equipment and storage medium |
CN113569987A (en) * | 2021-08-19 | 2021-10-29 | 北京沃东天骏信息技术有限公司 | Model training method and device |
CN113792885A (en) * | 2021-08-20 | 2021-12-14 | 山东英信计算机技术有限公司 | Execution method and related device for deep learning training |
WO2022161081A1 (en) * | 2021-01-28 | 2022-08-04 | 华为技术有限公司 | Training method, apparatus and system for integrated learning model, and related device |
CN115022405A (en) * | 2022-08-10 | 2022-09-06 | 合肥中科类脑智能技术有限公司 | Intelligent cache acceleration system and method of deep learning cloud platform |
CN115114022A (en) * | 2022-06-24 | 2022-09-27 | 苏州浪潮智能科技有限公司 | Method, system, device and medium for using GPU resources |
CN115509644A (en) * | 2022-11-21 | 2022-12-23 | 北京邮电大学 | Calculation force unloading method and device, electronic equipment and storage medium |
GB2611764A (en) * | 2021-10-08 | 2023-04-19 | Samsung Electronics Co Ltd | Method, system and apparatus for image orientation correction |
CN116136838A (en) * | 2023-04-19 | 2023-05-19 | 之江实验室 | Method and device for fast loading deep learning training data set into temporary buffer memory |
WO2023226284A1 (en) * | 2022-05-26 | 2023-11-30 | 鹏城实验室 | Deep learning model training method and apparatus, device and storage medium |
WO2023241312A1 (en) * | 2022-06-16 | 2023-12-21 | 北京火山引擎科技有限公司 | Model training method and apparatus |
-
2020
- 2020-07-17 CN CN202010690926.3A patent/CN111860835A/en not_active Withdrawn
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112700004A (en) * | 2020-12-25 | 2021-04-23 | 南方电网深圳数字电网研究院有限公司 | Deep learning model training method and device based on container technology and storage medium |
WO2022161081A1 (en) * | 2021-01-28 | 2022-08-04 | 华为技术有限公司 | Training method, apparatus and system for integrated learning model, and related device |
CN112882999A (en) * | 2021-01-31 | 2021-06-01 | 云知声智能科技股份有限公司 | Training acceleration method, device and system based on distributed cache affinity scheduling |
CN113469372A (en) * | 2021-07-02 | 2021-10-01 | 北京市商汤科技开发有限公司 | Reinforcement learning training method, device, electronic equipment and storage medium |
CN113569987A (en) * | 2021-08-19 | 2021-10-29 | 北京沃东天骏信息技术有限公司 | Model training method and device |
CN113792885A (en) * | 2021-08-20 | 2021-12-14 | 山东英信计算机技术有限公司 | Execution method and related device for deep learning training |
GB2611764A (en) * | 2021-10-08 | 2023-04-19 | Samsung Electronics Co Ltd | Method, system and apparatus for image orientation correction |
WO2023226284A1 (en) * | 2022-05-26 | 2023-11-30 | 鹏城实验室 | Deep learning model training method and apparatus, device and storage medium |
WO2023241312A1 (en) * | 2022-06-16 | 2023-12-21 | 北京火山引擎科技有限公司 | Model training method and apparatus |
CN115114022A (en) * | 2022-06-24 | 2022-09-27 | 苏州浪潮智能科技有限公司 | Method, system, device and medium for using GPU resources |
CN115114022B (en) * | 2022-06-24 | 2024-10-15 | 苏州浪潮智能科技有限公司 | Method, system, equipment and medium for using GPU (graphics processing Unit) resources |
CN115022405A (en) * | 2022-08-10 | 2022-09-06 | 合肥中科类脑智能技术有限公司 | Intelligent cache acceleration system and method of deep learning cloud platform |
CN115022405B (en) * | 2022-08-10 | 2022-10-25 | 合肥中科类脑智能技术有限公司 | Intelligent cache acceleration system and method of deep learning cloud platform |
CN115509644A (en) * | 2022-11-21 | 2022-12-23 | 北京邮电大学 | Calculation force unloading method and device, electronic equipment and storage medium |
CN116136838A (en) * | 2023-04-19 | 2023-05-19 | 之江实验室 | Method and device for fast loading deep learning training data set into temporary buffer memory |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111860835A (en) | Neural network model training method and device | |
US9389995B2 (en) | Optimization of Map-Reduce shuffle performance through snuffler I/O pipeline actions and planning | |
CN110262901B (en) | Data processing method and data processing system | |
US9331943B2 (en) | Asynchronous scheduling informed by job characteristics and anticipatory provisioning of data for real-time, parallel processing | |
US20200410031A1 (en) | Systems and methods for cloud computing | |
US20190378016A1 (en) | Distributed computing architecture for large model deep learning | |
CN111309649B (en) | Data transmission and task processing method, device and equipment | |
US9558216B2 (en) | Moving tables across nodes in an in-memory database instance | |
CN107077390A (en) | A kind of task processing method and network interface card | |
US11817999B1 (en) | Computer-based systems for management of big data development platforms based on machine learning techniques and methods of use thereof | |
US20200004464A1 (en) | Method and apparatus for storing data | |
CN112597126B (en) | Data migration method and device | |
CN112948025B (en) | Data loading method and device, storage medium, computing equipment and computing system | |
CN111611622A (en) | Block chain-based file storage method and electronic equipment | |
US20240330410A1 (en) | Managing and streaming a plurality of large-scale datasets | |
Ashu et al. | Intelligent data compression policy for Hadoop performance optimization | |
CN112965939A (en) | File merging method, device and equipment | |
CN115185679A (en) | Task processing method and device for artificial intelligence algorithm, server and storage medium | |
CN106649716A (en) | Multithread-based online file format conversion method and system | |
CN112230956A (en) | Artificial intelligence model updating method, system, electronic equipment and storage medium | |
CN110955461B (en) | Processing method, device, system, server and storage medium for computing task | |
CN111444148A (en) | Data transmission method and device based on MapReduce | |
Liu et al. | A large-scale rendering system based on hadoop | |
CN112182111B (en) | Block chain based distributed system layered processing method and electronic equipment | |
US11809992B1 (en) | Applying compression profiles across similar neural network architectures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20201030 |