CN113849295A - Model training method and device and computer readable storage medium - Google Patents

Model training method and device and computer readable storage medium Download PDF

Info

Publication number
CN113849295A
CN113849295A CN202010600109.4A CN202010600109A CN113849295A CN 113849295 A CN113849295 A CN 113849295A CN 202010600109 A CN202010600109 A CN 202010600109A CN 113849295 A CN113849295 A CN 113849295A
Authority
CN
China
Prior art keywords
resource
node
training
identifier
job
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010600109.4A
Other languages
Chinese (zh)
Inventor
王国威
包小明
徐华
周敏均
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Cloud Computing Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202010600109.4A priority Critical patent/CN113849295A/en
Publication of CN113849295A publication Critical patent/CN113849295A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The application discloses a model training method, a model training device and a computer readable storage medium, and belongs to the field of communication. The method comprises the following steps: the management node schedules a first model training task, the first model training task comprises a first intelligent model and a job identifier of a first parameter adjusting job, and the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjusting job based on a first parameter value set; determining a first computing node according to the job identification, wherein the first computing node has at least one of first training data and idle first resources, the first resources are resources required for processing a first parameter adjustment job, and the first training data are training data required for training an intelligent model of the first parameter adjustment job; and sending a first training request to the first computing node, wherein the first training request is used for the first computing node to train the first intelligent model according to at least one of the first resource and the first training data. The method and the device can improve the efficiency of model training.

Description

Model training method and device and computer readable storage medium
Technical Field
The present application relates to the field of communications, and in particular, to a method and an apparatus for model training, and a computer-readable storage medium.
Background
Training intelligent algorithms such as deep learning and the like to obtain an intelligent model with specific functions, wherein the specific functions can be functions of image recognition, voice recognition and synthesis, or natural language processing and the like. Training the intelligent algorithm is to continuously adjust the value of the super parameter and the value of the common parameter of the intelligent algorithm, so that the intelligent algorithm becomes an intelligent model with a specific function. The super parameters are used for defining the structure, the training process and the like of the intelligent model, and the common parameters are used for defining the functions realized by the intelligent model.
At present, the intelligent algorithm can be trained by using a computing cluster, and a cloud storage system is used for storing training samples required by training the intelligent algorithm. In training the intelligent model, a user configures an intelligent algorithm and at least one hyper-parameter in a computing cluster. And (4) calculating an initial value of each super parameter initialized by the cluster, and configuring an intelligent algorithm according to the initial value of each super parameter to obtain a first intelligent model. The method includes allocating resources to a first intelligent model, and retrieving training data from a cloud storage system, using the training data and training the first intelligent model with the allocated resources. The computing cluster continuously adjusts the value of the common parameter of the first intelligent model in the process of training the first intelligent model until the first intelligent model is converged or cannot be converged successfully, or the times of training the first intelligent model reach the specified times.
When the training is stopped, the computing cluster obtains a training result of the first intelligent model, if the training result does not meet the specified condition, a new value of each super parameter is configured according to the information such as the current value of each super parameter and the training result, and an intelligent algorithm is configured according to the new value of each super parameter to obtain a second intelligent model. Resources are allocated to the second intelligent model, training data are called from the cloud storage system, and the second intelligent model is trained through the allocated resources by using the training data. And continuously adjusting the value of the common parameter of the second intelligent model in the process of training the second intelligent model until the second intelligent model converges or fails to converge successfully, or the times of training the second intelligent model reach the specified times.
And when the training of the second intelligent model is stopped, the computing cluster still obtains the training result of the second intelligent model, if the training result of the second intelligent model does not meet the specified condition, the process of obtaining the second intelligent model and training the second intelligent model is repeated, and if the training result of the second intelligent model meets the specified condition, the second intelligent model is the finally trained model with the specific function.
In the process of implementing the present application, the inventor finds that the prior art has at least the following problems:
in the above process, when an intelligent model is configured each time, resources need to be allocated to the intelligent model again and training data needs to be called from the cloud storage system, so that time consumption is increased, and the efficiency of model training is reduced.
Disclosure of Invention
The application provides a method and a device for model training and a computer readable storage medium, so as to improve the efficiency of model training. The technical scheme is as follows:
in a first aspect, a management node schedules a first model training task, where the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, and the first parameter value set includes a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job. The management node determines a first computing node from the node cluster according to the job identification, wherein the first computing node has at least one of first training data and idle first resources, the first resources are resources required for processing a model training task of the first parameter adjustment job, and the first training data are training data required for training an intelligent model corresponding to the first parameter adjustment job. The management node sends a first training request to the first computing node, the first training request comprises a first model training task, and the first training request is used for the first computing node to train a first intelligent model according to at least one of the first resource and the first training data.
In this way, after the first computing node receives the first training request including the first model training task, it is not necessary to allocate the first resource to the first model training task and/or obtain the first training data, so that the time for allocating the first resource and/or the time for obtaining the first training data are saved, and the efficiency for training the first intelligent model is improved.
In one possible implementation, the management node determines the first computing node from the node cluster according to the resource correspondence, the data correspondence, and the job identifier. Any record in the resource corresponding relation comprises a job identifier of the parameter adjustment job, a node identifier of a computing node in the node cluster, a resource identifier and a resource state, wherein the resource identifier is used for identifying resources which are required by a model training task and are used for processing the parameter adjustment job and included by the computing node, and the resource state is used for describing whether the resources are idle currently. Any record in the data corresponding relation comprises a job identifier of the parameter adjustment job, a node identifier of a computing node in the node cluster and a data identifier, wherein the data identifier is used for identifying training data which are included by the computing node and are needed for training an intelligent model corresponding to the parameter adjustment job. The context information of the first parameter adjustment operation can be recorded through the resource corresponding relation and the data corresponding relation, so that when the computing nodes are distributed to the first model training task of the first parameter adjustment operation, the computing nodes comprising the first resources and/or the first training data can be accurately determined.
In another possible implementation manner, the management node determines, according to the resource correspondence, the data correspondence, and the job identifier, N computing nodes including the first training data and/or the first resource in the node cluster, where N is an integer greater than 0. When at least one target node exists in the N computing nodes, the management node selects one target node from the at least one target node as a first computing node. The target node comprises the idle first resource, or the target node comprises first training data and the idle first resource, or the target node comprises the first training data and the target node comprises unprotected resources with the size exceeding the size of the resource required for processing the first model training task, the unprotected resources are other resources except for the protected resources in the target node, the protected resources are resources allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resources is not finished. Therefore, the first computing node is selected from the target nodes, the first training data and enough resources in the first computing node can be guaranteed to train the first intelligent model, and the training success rate is improved.
In another possible implementation manner, the management node determines at least one target node according to the job identifier, and selects one target node from each target node as the first computing node according to load information and/or node attribute information of each target node in the at least one target node. The target node comprises the first idle resource, or the target node comprises first training data and the first idle resource, or the target node comprises the first training data and the unprotected resource included by the target node exceeds the size of the resource required for processing the first model training task, the unprotected resource is other resources except the protected resource in the target node, the protected resource is a resource allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resource is not finished. Due to the load information and/or node attribute information of each target node, one target node is selected from each target node as the first computing node, so that one or more requirements can be met, for example, the requirement of load balancing or the requirement of energy saving can be met by selecting the first computing node according to the load information of each computing node.
In another possible implementation manner, when the management node does not have a target node among the N computing nodes, it is detected whether a computing node among the N computing nodes becomes the target node within a first time period, where a start time of the first time period is a time for scheduling a first model training task, a time length of the first time period is a first threshold, and the N computing nodes are computing nodes including first training data and/or first resources. The management node detects that a computing node becomes a target node in a first time period, and determines the detected target node as a first computing node. The target node comprises the first idle resource, or the target node comprises first training data and the first idle resource, or the target node comprises the first training data and the unprotected resource included by the target node exceeds the size of the resource required for processing the first model training task, the unprotected resource is other resources except the protected resource in the target node, the protected resource is a resource allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resource is not finished.
When the target node does not exist in the N computing nodes, the computing nodes are not immediately distributed to the first model training task, whether the computing nodes become the target nodes or not is waited for within a first time period, if yes, the target nodes are distributed to the first model training task, the first time period is often short, therefore, when the target nodes process the first model training task, the first resources do not need to be distributed and/or the first training data do not need to be obtained, and the model training efficiency is improved.
In another possible implementation manner, any record in the resource correspondence further includes a resource size of the resource identified by the resource identifier. And the management node detects that no computing node becomes a target node in a first time period, and determines a second computing node from the node cluster according to the resource corresponding relation after the first time period is ended, wherein the size of unprotected resources included in the second computing node is larger than the size of resources required for processing the first model training task. And the management node sends a second training request to the second computing node, wherein the second training request comprises the first model training task, and the second training request is used for the second computing node to train the first intelligent model.
In another possible implementation manner, the management node receives a first deletion request, where the first deletion request includes a node identifier of the computing node and a resource identifier of the first resource, the first deletion request is sent by the first computing node after a first protection period ends, a starting time of the first protection period is a time when the first resource is used last time, and a time length of the first protection period is a second threshold. The management node deletes the record including the node identifier of the first computing node and the resource identifier of the first resource from the resource correspondence. Therefore, when the first computing node releases the first resource, the resource corresponding relation is updated in time, and the accuracy of the content stored in the resource corresponding relation is ensured.
In another possible implementation manner, the management node receives a second deletion request, where the second deletion request includes a node identifier of the first computing node and a data identifier of the first training data, the second deletion request is sent by the first computing node after a second protection period ends, a start time of the second protection period is a time when the first training data is used last time, and a time length of the second protection period is a third threshold. The management node deletes a record including the node identification of the first compute node and the data identification of the first training data from the data correspondence. Therefore, when the first training data is deleted at the first computing node, the data corresponding relation is updated in time, and the accuracy of the content stored in the data corresponding relation is guaranteed.
In another possible implementation manner, the management node sends a third training request to the first computing node, where the third training request includes a second model training task, the second model training task includes a second intelligent model and a job identifier of the first parameter adjustment job, the second model training task is a model training task included in the 1 st task corresponding to the first parameter adjustment job, the second intelligent model is obtained by configuring the algorithm based on a second parameter value set, the second parameter value set includes a second parameter value of each super parameter, and the third training request is used for the first computing node to allocate a first resource used for training the second intelligent model and obtain first training data used for training the second intelligent model. The management node receives a storage request sent by the first computing node, wherein the storage request comprises a data identifier of the first training data, a resource identifier of the first resource and a resource state. The management node stores the corresponding relation among the job identification, the node identification of the first computing node, the resource identification of the first resource and the resource state in the resource corresponding relation; and storing the correspondence among the job identification, the node identification of the first compute node, and the data identification of the first training data in a data correspondence. The context information for training the first parameter adjustment job is thus preserved, ensuring that when training the ith task of the first parameter adjustment job, i is 2, 3, … …, the ith task can be allocated to the compute node having the resources and/or training data required to process the first parameter adjustment job.
In a second aspect, a method for model training is provided, in which a computing node receives a first training request sent by a management node, the first training request includes a first model training task, the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, the first parameter value set includes a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job, and the computing node has at least one of a first resource and first training data bound to the first parameter adjustment job. The computing node acquires at least one of the first resource and the first training data according to the operation identification; the first intelligent model is trained based on at least one of the first resources and the first training data.
The computing node is provided with first resources and/or first training data required for processing the first model training task, so that when the computing node receives the first model training task, the time for allocating the first resources and/or the time for acquiring the first training data can be saved, and the efficiency of training the intelligent model is improved.
In a possible implementation manner, the computing node receives a third training request, where the third training request includes a second model training task, the second model training task includes a second intelligent model and a job identifier of the first parameter adjustment job, the second model training task is a model training task included in the 1 st batch of tasks corresponding to the first parameter adjustment job, the second intelligent model is obtained by configuring the algorithm based on a second parameter value set, and the second parameter value set includes a second parameter value of each super parameter. The method comprises the steps that a computing node allocates first resources for training a second intelligent model from unprotected resources, and acquires first training data for training the second intelligent model, wherein the unprotected resources are other resources except protected resources in the computing node, the protected resources are resources allocated to parameter adjustment operation, and a protection time period corresponding to the protected resources is not finished. The computing node trains a second intelligent model according to the first resources and the first training data. The computing node is not allocated with the first resource for training the second intelligent model from the unprotected resource, so that the condition that the protected resource is not occupied is ensured, the protected resource is used for training other model training tasks, the condition that the computing node does not need to allocate resources for the other model training tasks when receiving the other model training tasks is ensured, and the efficiency of processing the other model training tasks is improved.
In another possible implementation manner, the computing node sends a storage request, where the storage request includes a data identifier of the first training data, a resource identifier of the first resource, and a resource state, and the storage request is used for the management node to store a correspondence between the job identifier, the node identifier of the first computing node, the resource identifier of the first resource, and the resource state in the resource correspondence, and store a correspondence between the job identifier, the node identifier of the first computing node, and the data identifier of the first training data in the data correspondence. Therefore, the management node can be ensured to store the context information for training the first parameter adjustment parameter.
In another possible implementation manner, the compute node sends a first delete request after a first protection time period ends, where the first delete request includes a node identifier of the first compute node and a resource identifier of the first resource, a start time of the first protection time period is a time when the compute node uses the first resource last time, a time length of the first protection time period is a second threshold, and the first delete request is used by the management node to delete a record including the node identifier of the compute node and the resource identifier of the first resource from the resource correspondence. Therefore, when the computing node deletes the first training data, the data corresponding relation in the management node can be updated in time, and the accuracy of the content stored in the data corresponding relation is guaranteed.
In another possible implementation manner, the compute node sends a second delete request after a second protection time period ends, the second delete request includes a node identifier of the compute node and a data identifier of the first training data, a start time of the second protection time period is a time when the compute node uses the first training data last time, a time length of the second protection time period is a third threshold, and the second delete request is used by the management node to delete a record including the node identifier of the compute node and the data identifier of the first training data from the data correspondence. Therefore, when the computing node releases the first resource, the resource corresponding relation in the management node is updated in time, and the accuracy of the content stored in the resource corresponding relation is ensured.
In a third aspect, the present application provides an apparatus for model training, configured to perform the method of the first aspect or any one of the possible implementations of the first aspect. In particular, the apparatus comprises means for performing the method of the first aspect or any one of its possible implementations.
In a fourth aspect, the present application provides an apparatus for model training, configured to perform the method of the second aspect or any one of the possible implementations of the second aspect. In particular, the apparatus comprises means for performing the method of the second aspect or any one of its possible implementations.
In a fifth aspect, the present application provides an apparatus for model training, the apparatus comprising: a processor, a memory, and a network interface. The processor, the memory and the network interface can be connected through a bus system. The memory is configured to store one or more programs, and the processor is configured to execute the one or more programs in the memory, so that the apparatus performs the method of the first aspect or any possible implementation manner of the first aspect.
In a sixth aspect, the present application provides an apparatus for model training, the apparatus comprising: a processor, a memory, and a network interface. The processor, the memory and the network interface can be connected through a bus system. The memory is configured to store one or more programs, and the processor is configured to execute the one or more programs in the memory to cause the apparatus to perform the method of the second aspect or any possible implementation manner of the second aspect.
In a seventh aspect, the present application provides a computer-readable storage medium having program code stored therein, which when run on a computer, causes the computer to perform the above-mentioned first aspect, second aspect, any of the possible implementations of the first aspect, or the method in any of the possible implementations of the second aspect.
In an eighth aspect, the present application provides a computer program product comprising program code which, when run on a computer, causes the computer to perform the above-mentioned first aspect, second aspect, any of the possible implementations of the first aspect, or the method in any of the possible implementations of the second aspect.
In a ninth aspect, the present application provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores a program for implementing the method of the first aspect, the second aspect, any possible implementation manner of the first aspect, or any possible implementation manner of the second aspect.
In a tenth aspect, the present application provides a system for model training, the system comprising the apparatus of the third aspect and the apparatus of the fourth aspect, or the apparatus of the fifth aspect and the apparatus of the sixth aspect.
Drawings
Fig. 1 is a schematic diagram of a network architecture provided in an embodiment of the present application;
FIG. 2 is a flow chart of a method for model training provided by an embodiment of the present application;
FIG. 3 is a flow chart of another method for model training provided by embodiments of the present application;
FIG. 4 is a schematic structural diagram of an apparatus for model training according to an embodiment of the present disclosure;
FIG. 5 is a schematic structural diagram of another model training apparatus provided in the embodiments of the present application;
FIG. 6 is a schematic structural diagram of another model training apparatus provided in the embodiments of the present application;
FIG. 7 is a schematic structural diagram of another model training apparatus provided in the embodiments of the present application;
fig. 8 is a schematic structural diagram of a system for model training according to an embodiment of the present application.
Detailed Description
Referring to fig. 1, an embodiment of the present application provides a system for model training, which includes a management node, a node cluster, and a storage system. The node cluster includes at least one computing node, each computing node includes resources for training the intelligent model, and the resources may be one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a memory, and the like. The storage system is used for storing training data sets required by training the intelligent model.
Optionally, the training data set comprises a plurality of training samples.
Optionally, a network connection is established between the management node and each computing node in the node cluster, and a network connection is established between each computing node in the node cluster and the storage system.
Optionally, when the user needs the system to train an intelligent model, a parameter adjustment job may be submitted to the management node. For convenience of explanation, the parameter adjustment job is referred to as a first parameter adjustment job, and the first parameter adjustment job includes information such as at least one super parameter, an algorithm, a resource name and a resource size required by the first parameter adjustment job, and a storage location of a training data set required by the first parameter adjustment job, where the training data set includes a plurality of training samples.
The management node receives a first parameter adjustment operation, initializes the parameter value of each super parameter for each super parameter in the at least one super parameter, configures the algorithm according to the parameter value of each super parameter to obtain an intelligent model, and divides the training data set into a plurality of training data, wherein each training data comprises at least one training sample. And generating a first batch of tasks corresponding to the first parameter adjustment operation, wherein the first batch of tasks comprise at least one model training task, and for any model training task, the model training task comprises the intelligent model, the operation identification of the first parameter adjustment operation, the resource name and the resource size, the storage position of the training data set, the offset of a piece of training data in the training data set, the size of the training data and other information. And distributing a computing node for each model training task included in the first batch of tasks, and respectively sending each model training task to the computing node corresponding to each model training task.
Alternatively, the size of the training data may be the number of training samples included in the training data.
Optionally, the parameter value of each of the at least one super parameter is used to define the structure and training process of the intelligent model, and the like.
For any computing node, the computing node receives a model training task, allocates resources required by a first parameter adjustment operation according to the resource name and the resource size included in the model training task, and acquires a part of training data from a storage system according to the storage position of a training data set included in the model training task and the offset and the size of the part of training data, namely the training data required by the first parameter adjustment operation. And training the intelligent model included in the model training task through the allocated resources according to the training data.
The computing node further sends the job identifier of the first parameter adjustment job, the data identifier of the training data, the resource identifier of the resource and the resource state to the management node, wherein the resource state is a use state. Wherein the data identification is used to identify the training data in the compute node and the resource identification is used to identify the resource in the compute node.
Among them, it should be noted that: after the computing node allocates the resources required by the first parameter adjustment operation, determining a protection time period of the resources, wherein the resources are not released in the protection time period, and the resources are not allocated to model training tasks included in other parameter adjustment operations except the first parameter adjustment operation. And after the computing node acquires the training data required by the first parameter adjustment operation, determining a protection time period of the training data, wherein the training data cannot be deleted in the protection time period.
The management node receives the job identification of the first parameter adjustment job, the data identification of the training data, the resource identification and the resource state of the resource, correspondingly stores the corresponding relation among the job identification of the first parameter adjustment job, the node identification of the computing node, the resource identification of the resource and the resource state in the resource corresponding relation, and correspondingly stores the corresponding relation among the job identification of the first parameter adjustment job, the node identification of the computing node and the data identification of the training data in the data corresponding relation.
Optionally, after the intelligent model is trained, the computing node sends a training result of the intelligent model, a resource identifier of the resource, and a resource state to the management node, where the resource state is an idle state. And the management node receives the training result, the resource identifier and the resource state of the resource, acquires the resource state of the resource from the resource corresponding relation according to the node identifier of the computing node and the resource identifier of the resource, and updates the resource state of the resource into an idle state.
The management node may receive a training result sent by at least one computing node assigned to the model training task, that is, receive at least one training result, if the at least one training result does not satisfy a specified condition, configure a new parameter value of each super parameter according to information such as a current value of each super parameter and the at least one training result, and configure the algorithm according to the new parameter value of each super parameter, thereby obtaining a new intelligent model. And generating a second batch of tasks corresponding to the first parameter adjustment operation, wherein the second batch of tasks comprise at least one model training task, and for any model training task, the model training task comprises information such as the new intelligent model, the operation identification of the first parameter adjustment operation, the resource name and the resource size, the storage position of the training data set, the offset of a piece of training data in the training data set, the size of the training data and the like.
And the management node allocates a computing node with resources and/or training data required by the first parameter adjustment operation to each model training task included in the second batch of tasks according to the resource corresponding relation and the data corresponding relation, the resource state of the resources is an idle state, and one model training task is sent to the computing node. Because the computing node has the resources and/or training data required by the first parameter adjustment operation, after receiving the model training task, the computing node can adjust the resources and/or training data required by the operation according to the first parameter, and train the intelligent model included in the model training task, thereby improving the efficiency of training the model.
When the management node generates the ith task of the first parameter adjustment job, i is 3, 4 and … …, and the management node allocates a calculation node for each model training task included in the ith task according to the processing mode of the second task. The detailed implementation will be described in detail in the embodiment shown in fig. 3.
For convenience of description, each model training task in the ith task is referred to as a first model training task, and the intelligent model in the first model training task is referred to as a first intelligent model. And each model training task in the first batch of tasks is called a second model training task, and the intelligent model in the second model training task is called a second intelligent model.
Referring to fig. 2, an embodiment of the present application provides a method for training a model, where an intelligent model trained by the method is an intelligent model included in each of a first batch of tasks corresponding to a parameter adjustment job. The method is applicable to the system shown in fig. 1, and comprises the following steps:
step 201: the management node receives a first parameter adjustment operation, wherein the first parameter adjustment operation comprises at least one super parameter, an algorithm, a resource name and a resource size required by the first parameter adjustment operation, a storage position of a training data set required by the first parameter adjustment operation and other information.
When a user needs to train an intelligent model, a first parameter adjustment operation can be configured in a terminal corresponding to the user, and the first parameter adjustment operation is sent to the management node.
Alternatively, the algorithm may be a machine learning algorithm or the like, for example, a neural network algorithm.
The resource name and resource size required for the first parameter adjustment job may be those required for processing one model training task.
The training data set required for the first parameter adjustment operation includes a plurality of training samples. The training data set required for the first parameter adjustment job may be stored in a storage system.
Step 202: the management node generates a first batch of tasks corresponding to the first parameter adjustment operation, wherein the first batch of tasks comprise at least one second model training task, and for any second model training task in the first batch of tasks, the second model training task comprises a second intelligent model, an operation identifier of the first parameter adjustment operation, information such as the name and the size of the resource, the storage position of the training data set, the offset of a piece of training data in the training data set, the size of the training data and the like.
In this step, for each super parameter in the at least one super parameter, the management node configures M parameter values of each super parameter, where M is an integer greater than 0, to obtain M second parameter value sets. And configuring the algorithm according to each second parameter value set to obtain M second intelligent models, and dividing the training data set into a plurality of training data, wherein each training data comprises at least one training sample.
And generating M model training jobs corresponding to the first parameter adjustment job, wherein each model training job can comprise Y second model training tasks, and Y is an integer greater than 1. The second model training tasks included in the M model training jobs constitute a first set of tasks for the first parameter adjustment job, that is: the first plurality of tasks may include M x Y second model training tasks, which are multiplications. For any second model training task, the second model training task includes a second intelligent model, job identification of the first parameter adjustment job, information such as the name and size of the resource, the storage location of the training data set, the offset of a piece of training data in the training data set, and the size of the piece of training data.
Optionally, for any one of the M second intelligent models, the management node may generate Y second model training tasks for the second intelligent model to obtain a model training job, where the Y second model training tasks include the second intelligent model. And the management node generates Y second model training tasks for each second intelligent model to obtain second model training tasks included in the M model training tasks, wherein the second model training tasks included in the M model training tasks are M x Y second model training tasks included in the first batch of tasks.
Optionally, the management node divides the training data set into a plurality of training data sets, and for any training data set, the size of the training data set is the number of training samples included in the training data set.
Alternatively, each training data may include equal or unequal numbers of training samples.
Optionally, the management node may store the M model training jobs in a scheduling queue.
Step 203: and the management node distributes a computing node for each second model training task included in the first batch of tasks in the node cluster, and sends a training request to a computing node corresponding to any one second model training task, wherein the training request includes the second model training task.
In this step, the management node may obtain the resource size of the unprotected resource included in each compute node in the compute cluster. The unprotected resource in the computing node is a resource which is allocated to the parameter adjustment operation by the computing node, and the protection time period corresponding to the protected resource is not finished. For any second model training task included in the first batch of tasks, the management node selects one computing node with the resource size of unprotected resources larger than or equal to the resource size included in the second model task from the node cluster, and sends a training request to the selected computing node, wherein the training request includes the second model training task.
Alternatively, the management node may query each compute node in the node cluster for the resource size of the unprotected resource that each compute node includes.
Optionally, for the implementation of this step, an example is listed below for detailed description, where the example is:
and the management node schedules a model training job from the scheduling queue and schedules a second model training task from second model training tasks included in the model training job. The management node selects one computing node with the resource size of the unprotected resource larger than or equal to the resource size included by the scheduled second model task from the node cluster, and sends a training request to the selected computing node, wherein the training request includes the scheduled second model training task. And the management node continues to schedule other second model training tasks included in the model training job until the other second model training tasks included in the model training job are scheduled.
Then, the management node schedules another model training job from the scheduling queue, and schedules a second model training task included in the another model training job according to the mode. And repeating the operation by the management node, and directly scheduling model training tasks included in all the model training jobs corresponding to the first parameter adjustment job.
Step 204: the computing node receives the training request, wherein the training request comprises a second model training task, and acquires resources and training data required for processing the second model training task.
In this step, the computing node receives the training request, where the training request includes a second model training task, and the second model training task includes a second intelligent model, a job identifier of the first parameter adjustment job, information such as the name and size of the resource, the storage location of the training data set, the offset of a piece of training data in the training data set, and the size of the training data.
The computing node distributes resources required for processing the second model training task according to the resource name and the resource size included in the second model training task, and acquires the training data set from a storage system according to the storage position of the training data set included in the second model training task; and acquiring a piece of training data required for processing the second model training task from the training data set according to the offset of a piece of training data corresponding to the second model training task in the training data set and the size of the piece of training data.
Optionally, the computing node further allocates a resource identifier for a resource required for processing the second model training task, where the resource identifier identifies the resource in the computing node. And correspondingly storing the job identification and the resource identification of the first parameter adjustment job in the corresponding relation of the job identification and the resource identification. And the number of the first and second groups,
optionally, the computing node further allocates a data identifier for the training data required for processing the second model training task, where the data identifier identifies the training data in the computing node. And correspondingly storing the job identification and the data identification of the first parameter adjustment job in the corresponding relation of the job identification and the data identification.
Optionally, the computing node further allocates a first protection time period to the resource, where a starting time of the first protection time period is a time when the resource is used, and a time length of the first protection time period is a second threshold. Since the following operation of step 205 is performed to use the resource after the resource is allocated, the starting time of the first protection period is equal to the time for allocating the resource.
Optionally, the computing node further allocates a second protection time period to the training data, where a starting time of the second protection time period is a time when the training data is used, and a time length of the second protection time period is a third threshold. Since the following operation of step 205 is performed to use the training data after the training data is acquired, the starting time of the second guard period is equal to the time of acquiring the training data.
Optionally, the computing node further sends a storage request to the management node, where the storage request includes a job identifier of the first parameter adjustment job, a data identifier of the training data, a resource identifier of the resource, and a resource state, and the resource state is a use state.
The management node receives the storage request, forms a record by the operation identification of the first parameter adjustment operation, the node identification of the computing node, the resource identification and the resource state and stores the record in the resource corresponding relation; and forming a record by the operation identification of the first parameter adjustment operation, the node identification of the computing node and the data identification of the training data, and storing the record in a data corresponding relation.
Optionally, the storage request may further include a task identifier of the second model training task. Correspondingly, the record saved in the resource corresponding relationship also includes the task identifier, and the record saved in the data corresponding relationship also includes the task identifier.
For example, assuming that the first batch of tasks includes second model training tasks 1, 2, 3, and 4, the management node assigns compute nodes 1, 2, 3, and 4 to the second model training tasks 1, 2, 3, and 4, respectively. The management node sends a training request 1 to the computing node 1, wherein the training request 1 comprises a second model training task 1; sending a training request 2 to the computing node 2, the training request 2 comprising a second model training task 2; sending a training request 3 to the computing node 3, the training request 3 comprising a second model training task 3; a training request 4 is sent to the compute node 4, the training request 4 comprising a second model training task 4.
The computing node 1 receives a training request 1 comprising a second model training task 1, allocates resources according to the resource name and the resource size of the second model training task 1, and acquires a part of training data according to the storage position of a training data set included in the second model training task 1 and the offset and the size of the part of training data corresponding to the second model training task 1; and sending a storage request 1 to the management node, wherein the storage request 1 comprises a job identification IZ1 of the first parameter adjustment job, a data identification ID1 of the piece of training data, a resource identification IR1 of the resource and a resource state, and the resource state is a use state.
The computing node 2 receives a training request 2 comprising a second model training task 2, allocates resources according to the resource name and the resource size of the second model training task 2, and acquires a part of training data according to the storage position of a training data set included in the second model training task 2 and the offset and the size of the part of training data corresponding to the second model training task 2; and sending a storage request 2 to the management node, wherein the storage request 2 comprises a job identification IZ1 of the first parameter adjustment job, a data identification ID2 of the piece of training data, a resource identification IR2 of the resource and a resource state, and the resource state is a use state.
The computing node 3 receives a training request 3 including a second model training task 3, allocates resources according to the resource name and the resource size included in the second model training task 3, and acquires a part of training data according to the storage position of a training data set included in the second model training task 3 and the offset and the size of the part of training data corresponding to the second model training task 3; and sending a storage request 3 to the management node, wherein the storage request 3 comprises a job identification IZ1 of the first parameter adjustment job, a data identification ID3 of the piece of training data, a resource identification IR3 of the resource and a resource state, and the resource state is a use state.
The computing node 4 receives a training request 4 including a second model training task 4, allocates resources according to the resource name and the resource size included in the second model training task 4, and acquires a part of training data according to the storage position of a training data set included in the second model training task 4 and the offset and the size of the part of training data corresponding to the second model training task 4; and sending a storage request 4 to the management node, wherein the storage request 4 comprises the job identification IZ1 of the first parameter adjustment job, the data identification ID4 of the piece of training data, the resource identification IR4 of the resource and the resource state, and the resource state is a use state.
The management node receives the storage request 1, and combines the node identifier IN1 of the computing node 1, the job identifier IZ1 of the first parameter adjustment job included IN the storage request 1, the resource identifier IR1, and the resource status into one record and stores the record IN the resource corresponding relationship shown IN table 1 below. Receiving the storage request 2, combining the node identifier IN2 of the computing node 2, the job identifier IZ1 of the first parameter adjustment job included IN the storage request 2, the resource identifier IR2 and the resource status into one record and storing the record IN the resource corresponding relationship shown IN table 1 below. Receiving the storage request 3, combining the node identifier IN3 of the computing node 3, the job identifier IZ1 of the first parameter adjustment job included IN the storage request 3, the resource identifier IR3 and the resource status into one record and storing the record IN the resource corresponding relationship shown IN table 1 below. And receiving the storage request 4, and combining the node identifier IN4 of the computing node 4, the job identifier IZ1 of the first parameter adjustment job included IN the storage request 4, the resource identifier IR4 and the resource status into one record and storing the record IN the resource corresponding relationship shown IN table 1 below.
TABLE 1
Job identification Node identification Resource identification Resource status
IIZ1 IN1 IR1 State of use
IZ1 IN2 IR2 State of use
IZ1 IN3 IR3 State of use
IZ1 IN4 IR4 State of use
The management node also makes a record of the node identification IN1 of the computing node 1, the job identification IZ1 of the first parameter adjustment job included IN the storage request 1, and the data identification ID1, and stores the record IN the data correspondence relationship shown IN table 2 below. The node identification IN2 of the computing node 2, the job identification IZ1 of the first parameter adjustment job included IN the storage request 2, and the data identification ID2 are combined into one record and stored IN the data correspondence relationship shown IN table 2 below. The node identification IN3 of the computing node 3, the job identification IZ1 of the first parameter adjustment job included IN the storage request 3, and the data identification ID3 are combined into one record and stored IN the data correspondence relationship shown IN table 2 below. And, the node identification IN4 of the computing node 4, the job identification IZ1 of the first parameter adjustment job included IN the storage request 4, and the data identification ID4 are combined into one record and stored IN the resource correspondence relationship as shown IN table 2 below.
TABLE 2
Job identification Node identification Data identification
IZ1 IN1 ID1
IZ1 IN2 ID2
IZ1 IN3 ID3
IZ1 IN4 ID4
Optionally, the user needs to query the management node for the computing node, the resource, the training data, or the like corresponding to the first parameter adjustment job, and may input the job identifier of the first parameter adjustment job in the management node.
And the management node inquires information such as node identification, resource state and the like of the computing node corresponding to the first parameter adjustment operation from the resource corresponding relation according to the operation identification of the first parameter adjustment operation, and displays the inquired information. And/or the management node inquires information such as node identification, data identification and the like of the computing node corresponding to the first parameter adjustment operation from the data corresponding relation according to the operation identification of the first parameter adjustment operation, and displays the inquired information.
Step 205: and the computing node trains a second intelligent model through the resource according to the training data.
And the computing node continuously adjusts the parameter values of the common parameters of the second intelligent model in the process of training the second intelligent model, and the common parameters of the second intelligent model are used for determining the functions of the second intelligent model. For example, assuming that an intelligent model for speech recognition needs to be trained, the parameter values of common parameters of the second intelligent model can be continuously adjusted through training data, so that the second intelligent model has a speech recognition function.
And the computing node continuously adjusts the parameter values of the common parameters of the second intelligent model in the process of training the second intelligent model until the second intelligent model is converged or cannot be converged successfully, or the times of training the second intelligent model reach the specified times. And the computing node acquires a training result of the training of the second intelligent model and sends a notification message to the management node, wherein the notification message comprises the training result and the operation identification of the first parameter adjustment operation.
Any computing node receiving the second model training task trains the second intelligent model according to the operations of 204 and 205, and sends a notification message including a training result and a job identifier of the first parameter adjustment job to the management node after the training is finished.
The management node receives notification messages sent by each computing node, if a training result included in each notification message does not meet a specified condition, a current parameter value of each super parameter in the at least one super parameter corresponding to the first parameter adjustment operation is obtained, and X parameter values of each super parameter are reconfigured according to the current parameter value of each super parameter and the training result included in each notification message, wherein X is an integer greater than 0, and X first parameter value sets are obtained. And configuring an algorithm corresponding to the first parameter adjustment operation according to each first parameter value set to obtain X first intelligent models, and generating X model training operations corresponding to the first parameter adjustment operation, wherein each model training operation can comprise Y first model training tasks. The first model training tasks included in the X model training jobs constitute a second batch of tasks for the first parameter tuning business, that is: the second batch of tasks includes X Y first model training tasks. For any first model training task, the first model training task includes a first intelligent model, job identification of the first parameter adjustment job, information such as the name and size of the resource, the storage location of the training data set, the offset of a piece of training data in the training data set, and the size of the training data. The management node then sends each first model training task included in the second batch of tasks to the computing nodes included in the node cluster to train the first intelligent model included in each first model training task, so as to implement the process in detail, see the embodiment shown in fig. 3 below, and will not be described in detail here.
Optionally, the management node stores the X model training jobs in a scheduling queue.
Optionally, for the above computing node, when the computing node stops training the second intelligent model, the computing node sends an update request to the management node, where the update request includes a job identifier of the first parameter adjustment job, a resource identifier of the resource, and a resource state, and the resource state is an idle state. And the management node receives the update request, adjusts the job identification of the job, the resource identification of the resource and the node identification of the computing node according to the first parameter included in the update request, and sets the resource state of the resource in the resource corresponding relation to be an idle state.
For example, with respect to the above-mentioned computing node 1, when stopping training the second intelligent model, the computing node 1 sends an update request 1 to the management node, where the update request 1 includes the job identification IZ1 of the first parameter adjustment job, the resource identification IR1 of the resource, and the resource state is an idle state. The management node receives the update request 1, adjusts the job identifier IZ1 of the job, the resource identifier IR1 of the resource, and the node identifier IN1 of the computing node according to the first parameter included IN the update request 1, and sets the resource status of the resource IN the resource correspondence relationship shown IN table 1 to an idle status, as shown IN table 3 below.
Likewise, when stopping training the second intelligent model, the computing node 2 sends an update request 2 to the management node, where the update request 2 includes the job identifier IZ1 of the first parameter adjustment job, the resource identifier IR2 of the resource, and the resource state is an idle state. The management node receives the update request 2, adjusts the job identifier IZ1 of the job, the resource identifier IR2 of the resource, and the node identifier IN2 of the computing node according to the first parameter included IN the update request 2, and sets the resource status of the resource IN the resource correspondence relationship shown IN table 1 to an idle status, as shown IN table 3 below.
When stopping training the second intelligent model, the computing node 3 sends an update request 3 to the management node, where the update request 3 includes the job identifier IZ1 of the first parameter adjustment job, the resource identifier IR3 of the resource, and the resource state is an idle state. The management node receives the update request 3, adjusts the job identifier IZ1 of the job, the resource identifier IR3 of the resource, and the node identifier IN3 of the computing node according to the first parameter included IN the update request 3, and sets the resource status of the resource IN the resource correspondence relationship shown IN table 1 to an idle status, as shown IN table 3 below.
When stopping training the second intelligent model, the computing node 4 sends an update request 4 to the management node, where the update request 4 includes the job identifier IZ1 of the first parameter adjustment job, the resource identifier IR4 of the resource, and the resource state is an idle state. The management node receives the update request 4, adjusts the job identifier IZ1 of the job, the resource identifier IR4 of the resource, and the node identifier IN4 of the computing node according to the first parameter included IN the update request 4, and sets the resource status of the resource IN the resource correspondence relationship shown IN table 1 to an idle status, as shown IN table 3 below.
TABLE 3
Job identification Node identification Resource identification Resource status
IIZ1 IN1 IR1 Idle state
IZ1 IN2 IR2 Idle state
IZ1 IN3 IR3 Idle state
IZ1 IN4 IR4 Idle state
In the embodiment of the application, the management node allocates a computing node for each second model training task included in a first group of tasks corresponding to the first parameter adjustment job, and the computing node acquires resources and training data shown by processing the second model training tasks and sends a storage request to the management node, where the storage request includes a job identifier of the first parameter adjustment job, a data identifier of the training data, a resource identifier of the resources, and a resource state. The management node forms a record with the job identifier, the node identifier of the computing node, the resource identifier and the resource state and stores the record in a resource corresponding relation, and forms a record with the job identifier, the node identifier of the computing node and the data identifier and stores the record in a data corresponding relation table. Therefore, when the management node allocates a computing node for the model training task included in the ith task corresponding to the first parameter adjustment task, i is 2, 3 and … …, the computing node is preferentially allocated to the computing node including resources and/or training data required by the model training task for processing the first parameter adjustment task, and the computing node does not need to acquire the resources and/or training data when processing the model training task included in the ith task, so that the time consumption of model training is reduced, and the efficiency of model training is improved.
Referring to fig. 3, an embodiment of the present application provides a method for training a model, where an intelligent model trained by the method is an intelligent model included in each model training task in an ith task corresponding to a parameter adjustment job, where i is 2, 3, and … …. The method is applicable to the system shown in fig. 1, and comprises the following steps:
step 301: the management node schedules a first model training task, and the first model training task comprises a first intelligent model and a job identification of a first parameter adjustment job.
Optionally, the first model training task further includes information such as a resource name and a resource size required for processing the first model training task, a storage location of a training data set corresponding to the first parameter adjustment job, an offset of a piece of training data in the training data set and a size of the piece of training data required for processing the first model training task.
Optionally, the scheduling queue of the management node includes model training jobs corresponding to the first parameter adjustment job, each model training job includes at least one first model training task, and the first model training tasks included in each model training job constitute the ith batch of tasks of the first parameter adjustment job.
In this step, the management node schedules a model training job from the scheduling queue, and schedules a first model training task from among first model training tasks included in the model training job.
For the model training jobs in the scheduling queue, the model training jobs are obtained by:
and the management node receives notification messages sent by each computing node, and training results obtained by training the i-1 st batch of tasks of the first parameter adjustment operation are included according to each notification message. And if the training result included in each notification message does not satisfy the specified condition, acquiring the current parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job, reconfiguring X parameter values of each super parameter according to the current parameter value of each super parameter and the training result included in each notification message, wherein X is an integer greater than 0, and acquiring X first parameter value sets, wherein for any one of the X first parameter value sets, the first parameter value set includes one parameter value of each super parameter. And configuring an algorithm corresponding to the first parameter adjustment operation according to each first parameter value set to obtain X first intelligent models, and generating X model training operations corresponding to the first parameter adjustment operation, wherein each model training operation can comprise Y first model training tasks. The first model training tasks included in the X model training jobs constitute the ith batch of tasks for the first parameter adjustment job, that is: the ith batch of tasks includes X Y first model training tasks. For any first model training task, the first model training task includes a first intelligent model, job identification of the first parameter adjustment job, information such as the name and size of the resource, the storage location of the training data set, the offset of a piece of training data in the training data set, and the size of the training data. The management node saves the X model training jobs to a scheduling queue.
Step 302: the management node determines a first computing node from the node cluster according to the job identification of the first parameter adjustment job, wherein the first computing node comprises at least one of first training data and idle first resources, the first resources are resources required by a model training task for processing the first parameter adjustment job, and the first training data are training data required by an intelligent model corresponding to the first parameter adjustment job.
Optionally, the management node adjusts the job identifier of the job according to the resource correspondence, the data correspondence, and the first parameter, and determines the first computing node from the node cluster. When implemented, the implementation can be achieved by the following operations 3021 to 3022, the operations 3021 to 3022 being:
3021: and the management node adjusts the job identification of the job according to the resource corresponding relation, the data corresponding relation and the first parameter, and determines N computing nodes comprising the first training data and/or the first resource in the node cluster, wherein N is an integer larger than 0.
Optionally, the management node adjusts the job identifier of the job according to the first parameter, and obtains the node identifier of each corresponding computing node, the resource identifier of the first resource on each computing node, and the resource state from the resource correspondence; and adjusting the job identification of the job according to the first parameter, and acquiring the corresponding node identification of each computing node and the data identification of the first training data on each computing node from the data corresponding relation. Assuming that the number of node identifiers of the computing nodes acquired twice is N, N computing nodes including the first training data and/or the first resource are determined.
For example, assume that a first model training task is scheduled, which includes information such as a first intelligent model, job identification IZ1 of the first parameter adjustment job, resource name and resource size, storage location of the training data set, offset of a piece of training data in the training data set, and size of the piece of training data.
The management node adjusts the job identifier IZ1 of the job according to the first parameter, and acquires the node identifier IN1 of the corresponding computing node 1, the resource identifier IR1 of the first resource on the computing node 1, and the resource state (idle state) from the resource correspondence shown IN table 3, the node identifier IN2 of the computing node 2, the resource identifier IR2 of the first resource on the computing node 2, and the resource state (idle state), the node identifier IN3 of the computing node 3, the resource identifier IR3 of the first resource on the computing node 3, and the resource state (idle state), the node identifier IN4 of the computing node 4, the resource identifier IR4 of the first resource on the computing node 4, and the resource state (idle state), from the resource correspondence shown IN table 3. And the number of the first and second groups,
the management node adjusts the job identifier IZ1 of the job according to the first parameter, and acquires the node identifier IN1 of the corresponding compute node 1 and the data identifier ID1 of the first training data on the compute node 1, the node identifier IN2 of the compute node 2 and the data identifier ID2 of the first training data on the compute node 2, the node identifier IN3 of the compute node 3 and the data identifier ID3 of the first training data on the compute node 3, the node identifier IN4 of the compute node 4 and the data identifier ID4 of the first training data on the compute node 4 from the data correspondence shown IN table 2. And obtaining the node identifiers of the 4 computing nodes twice, namely determining the computing nodes 1, 2, 3 and 4 comprising the first training data and/or the first resource.
3022: when at least one target node exists in the N computing nodes, the management node selects one target node from the at least one target node as a first computing node.
The target node comprises idle first resources, or the target node comprises first training data and unprotected resources included by the target node exceed the size of resources required for processing the first model training task, the unprotected resources are other resources except for protected resources in the target node, the protected resources are resources allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resources is not finished.
For example, compute nodes 1, 2, 3, and 4 include first training data and/or idle first resources, and the management node may select compute node 1 from compute nodes 1, 2, 3, and 4 as the first compute node.
In this operation, the management node selects one target node from each target node as the first computing node according to the load information and/or the node attribute information of each target node of the at least one target node.
Optionally, the management node scores each target node according to load information and/or node attribute information of each target node, and selects one target node with the highest score from each target node as the first computing node, or selects one target node with the score exceeding a score threshold as the first computing node.
Optionally, the management node scores each target node according to a specified rule. The designated rule corresponds to one requirement, and different requirements are scored according to different rules.
For example, where load balancing of computing nodes in a node cluster is desired, the specified rule defines that the lighter the load of a computing node, the higher the score scored for that computing node, the higher the load of the computing node, and the lower the score scored for that computing node.
For another example, it is desirable that the load in the node cluster is concentrated on one or more nodes to turn off the nodes without load, so as to achieve the purpose of saving energy, and the specified rule defines that the heavier the load of a computing node is, the higher the score scored for the computing node is, the lighter the load of the computing node is, and the lower the score scored for the computing node is.
Optionally, the management node has a delay scheduling function, so that when there is no target node in the N computing nodes, the computing nodes are not immediately allocated to the first model training task from the whole node cluster. When the target node does not exist in the N computing nodes, whether the computing node becomes the target node or not is detected in a first time period, the starting time of the first time period is the time for scheduling the first model training task, the time length of the first time period is a first threshold value, and if the computing node becomes the target node in the first time period, the detected target node is determined to be the first computing node. In this way, the computing nodes with the idle first resources and/or the first training data are allocated to the first training task, so that the time consumed for allocating resources to the first model training task and/or the time consumed for acquiring the training data by the computing nodes is saved, and the model training efficiency is improved.
And if no computing node is detected to become the target node in the first time period, determining a second computing node from the node cluster after the first time period is ended, wherein the size of unprotected resources included in the second computing node is larger than the size of resources required for processing the first model training task.
Alternatively, the first threshold may be configured in advance by an administrator in the management node.
Optionally, the management node schedules a next first model training task from the first model training tasks included in the model training job, and repeatedly executes the step until each first model training task included in the model training job is scheduled. The following operation of step 303 is then performed.
Step 303: the management node sends a first training request to the first compute node, the first training request including a first model training task.
Optionally, when the first computing node includes an idle first resource, the first training request further includes a resource identifier of the first resource. When the first computing node includes first training data and an idle first resource, the first training request further includes a resource identification of the first resource and a data identification of the first training data. When the first computing node includes the first training data, the first training request also includes a data identification of the first training data.
Optionally, in step 302, the management node determines a first computing node for each first model training task included in the model training job, so that in step, for any first model training task included in the model training job, the management node sends a first training request to a first computing node corresponding to the first model training task, where the first training request includes the first model training task. According to the operation of the step, the management node sends a first training request to the first computing node corresponding to each first model training task included in the model training job.
Then, the management node schedules the next model training job from the scheduling queue. The management node processes each first model training task included in the next model training job according to the above steps 301 to 303 until each first model training task included in each model training job in the scheduling queue is scheduled.
Step 304: the first computing node trains a first intelligent model through a first resource according to the first training data.
The first computing node may be any one of three situations, a first situation in which the first computing node includes a first resource that is idle; second, the first compute node includes first training data and an idle first resource; third, the first computing node includes the first training data and the first computing node includes unprotected resources having a size that exceeds a size of resources required to process the first model training task.
For the first case, the first computing node includes an idle first resource, in this step, the first computing node acquires a local first resource, acquires a training data set from the storage system according to a storage location of the training data set included in the first model training task, acquires a piece of training data corresponding to the first model training task from the training data set according to an offset and a size of the piece of training data corresponding to the first model training task, that is, acquires the first training data, and trains the first intelligent model through the first resource according to the first training data.
Optionally, the first computing node further allocates a data identifier of the first training data, and sends a storage request to the management node, where the storage request includes a job identifier of the first parameter adjustment job and the data identifier of the first training data. And the management node receives the storage request, and forms a record by the node identifier of the first computing node, the job identifier of the first parameter adjustment job included in the storage request and the data identifier of the first training data and stores the record in the data corresponding relation.
Optionally, the first computing node further allocates a first protection time period to the first resource, where a starting time of the first protection time period is a time when the first resource starts to be used, and a time length of the first protection time period is a second threshold.
Optionally, the first computing node further allocates a second protection time period to the first training data, where a starting time of the second protection time period is a time when the first training data is acquired, and a time length of the second protection time period is a third threshold.
For the second case, the first computing node includes first training data and idle first resources, and in this step, the first computing node obtains local first training data and first resources, and trains the first intelligent model through the first resources according to the first training data.
Optionally, the first computing node further allocates a first protection time period to the first resource, where a starting time of the first protection time period is a time when the first resource starts to be used, and a time length of the first protection time period is a second threshold.
Optionally, the first computing node further allocates a second protection time period to the first training data, where a starting time of the second protection time period is a time when the first training data starts to be used, and a time length of the second protection time period is a third threshold.
Optionally, the second threshold or the third threshold may be configured in advance by an administrator in each computing node in the node cluster.
In the first and second cases, the first computing node sends an update request to the management node, where the update request includes a job identifier of the first parameter adjustment job, a resource identifier of the first resource, and a resource status, and the resource status is a use status.
And the management node receives the updating request, and updates the resource state of the first resource stored in the resource corresponding relation into a use state according to the job identification of the first parameter adjusting job, the node identification of the first computing node and the resource identification of the first resource.
For the third case described above, the first computing node includes the first training data and the first computing node includes an unprotected resource size that exceeds a resource size required to process the first model training task. In this step, the first computing node obtains local first training data, allocates a first resource from unprotected resources included in the first computing node according to a resource name and a resource size included in the first model training task and required for processing the first model training task, and trains the first intelligent model through the first resource according to the first training data. The unprotected resource is other resources except the protected resource in the first computing node, the protected resource is a resource which has been allocated to the parameter adjustment job by the first computing node, and the protection time period corresponding to the protected resource is not yet finished.
Optionally, the first computing node further allocates a resource identifier of the first resource, and sends a storage request to the management node, where the storage request includes a job identifier of the first parameter adjustment job, a resource identifier of the first resource, and a resource state, and the resource state is a use state. And the management node receives the storage request, and combines the node identifier of the first computing node, the job identifier of the first parameter adjustment job included in the storage request, the node identifier of the first computing node, the resource identifier of the first resource and the resource state into a record and stores the record in the resource corresponding relation.
Optionally, the first computing node further allocates a first protection time period to the first resource, where a starting time of the first protection time period is a time for allocating the first resource, and a time length of the first protection time period is a second threshold.
Optionally, the first computing node further allocates a second protection time period to the first training data, where a starting time of the second protection time period is a time when the first training data starts to be used, and a time length of the second protection time period is a third threshold.
Optionally, the process of obtaining the local first training data for the first computing node may be: and when the first training request comprises the data identification of the first training data, locally acquiring the first training data according to the data identification of the first training data. Or when the first training request does not include the data identifier of the first training data, adjusting the job identifier of the job according to the first parameter included in the first model training task, acquiring the data identifier of the first training data from the corresponding relation between the job identifier and the data identifier, and locally acquiring the first training data according to the data identifier of the first training data.
Optionally, the process for acquiring the local first resource by the first computing node may be: and when the first training request comprises the resource identifier of the first resource, acquiring the first resource from the local according to the resource identifier of the first resource. Or when the first training request does not include the resource identifier of the first resource, adjusting the job identifier of the job according to the first parameter included in the first model training task, acquiring the resource identifier of the first resource from the corresponding relation between the job identifier and the resource identifier, and acquiring the local first resource according to the resource identifier of the first resource.
And the first computing node continuously adjusts the parameter values of the common parameters of the first intelligent model in the process of training the first intelligent model until the first intelligent model is converged or cannot be successfully converged, or the times of training the first intelligent model reach the specified times. The first computing node acquires a training result of the first intelligent model training and sends a notification message to the management node, wherein the notification message comprises the training result and a job identification of the first parameter adjustment job.
Optionally, in a case that the management node allocates a second computing node to the first model training task, the management node sends a second training request to the second computing node, where the second training request includes the first model training task. And the second computing node receives a second training request, wherein the second training request comprises a first model training task, and the first model training task comprises information such as a first intelligent model, a job identification of a first parameter adjusting job, a resource name and a resource size, a storage position of the training data set, an offset of a piece of training data in the training data set, a size of the training data and the like.
The second computing node distributes first resources required for processing the first model training task according to the resource name and the resource size included in the first model training task, and acquires the training data set from the storage system according to the storage position of the training data set included in the first model training task; and acquiring a piece of training data required for processing the first model training task from the training data set according to the offset of a piece of training data corresponding to the first model training task in the training data set and the size of the piece of training data to obtain first training data. And training the first intelligent model included in the first model training task through the first resource according to the first training data until the first intelligent model converges or fails to converge successfully, or training the first intelligent model for a specified number of times. And the second computing node acquires a training result of the training of the first intelligent model and sends a notification message to the management node, wherein the notification message comprises the training result and the job identification of the first parameter adjustment job.
Optionally, the second computing node further allocates a resource identifier for the first resource required for processing the first model training task, where the resource identifier identifies the first resource in the computing node. And correspondingly storing the job identification and the resource identification of the first parameter adjustment job in the corresponding relation of the job identification and the resource identification. And the number of the first and second groups,
optionally, the second computing node further assigns a data identifier to the first training data required for processing the first model training task, where the data identifier identifies the first training data in the computing node. And correspondingly storing the job identification and the data identification of the first parameter adjustment job in the corresponding relation of the job identification and the data identification.
Optionally, the second computing node further allocates a first protection time period to the first resource, where a starting time of the first protection time period is a time when the first resource is used, and a time length of the first protection time period is a second threshold. Since the first resource is used after the first resource is allocated, the starting time of the first guard period herein is equal to the time of allocating the first resource.
Optionally, the second computing node further allocates a second protection time period to the second training data, where a starting time of the second protection time period is a time when the second training data is used, and a time length of the second protection time period is a third threshold. Since the second training data is used after it is acquired, the start time of the second guard period here is equal to the time of acquiring the second training data.
Optionally, the second computing node further sends a storage request to the management node, where the storage request includes a job identifier of the first parameter adjustment job, a data identifier of the first training data, a resource identifier of the first resource, and a resource status, and the resource status is a use status.
The management node receives the storage request, forms a record by the job identification of the first parameter adjustment job, the node identification of the second computing node, the resource identification and the resource state and stores the record in the resource corresponding relation; and forming a record by the operation identification of the first parameter adjustment operation, the node identification of the second computing node and the data identification of the training data, and storing the record in a data corresponding relation.
Optionally, the management node may receive notification messages sent by different computing nodes, and when a training result included in each notification message does not satisfy a specified condition, acquire the (i + 1) th task corresponding to the first parameter adjustment job, and then start execution from step 301. And stopping continuously training the intelligent model of the first parameter adjustment operation when the training result included in each notification message meets the specified condition.
After the training of the intelligent model of the first parameter adjustment job is stopped, the first resources and/or the first training data in the first computing node will not be used for the first computing node.
Optionally, after the first protection time period corresponding to the first resource in the first computing node is ended, the first resource may be released, or the first resource may not be released. When the first resource is released, the first computing node sends a first deletion request to the management node, wherein the first deletion request comprises a node identifier of the first computing node and a resource identifier of the first resource. The management node receives the first deletion request, and deletes the record comprising the node identifier of the first computing node and the resource identifier of the first resource from the resource corresponding relation.
Optionally, after the second protection period corresponding to the first training data in the first computing node is ended, the first training data may be deleted, or the first training data may not be deleted. When the first training data is deleted, the first computing node sends a second deletion request to the management node, wherein the second deletion request comprises the node identification of the first computing node and the data identification of the first training data. And the management node receives the second deletion request and deletes the record comprising the node identifier of the first computing node and the data identifier of the first training data from the data corresponding relation.
In the embodiment of the application, the management node allocates a computing node for each second model training task included in a first group of tasks corresponding to the first parameter adjustment job, and the computing node acquires resources and training data shown by processing the second model training tasks and sends a storage request to the management node, where the storage request includes a job identifier of the first parameter adjustment job, a data identifier of the training data, a resource identifier of the resources, and a resource state. The management node forms a record with the job identifier, the node identifier of the computing node, the resource identifier and the resource state and stores the record in a resource corresponding relation, and forms a record with the job identifier, the node identifier of the computing node and the data identifier and stores the record in a data corresponding relation table. Therefore, when the management node allocates a computing node for the model training task included in the ith task corresponding to the first parameter adjustment task, i is 2, 3 and … …, the computing node is preferentially allocated to the computing node including resources and/or training data required by the model training task for processing the first parameter adjustment task, and the computing node does not need to acquire the resources and/or training data when processing the model training task included in the ith task, so that the time consumption of model training is reduced, and the efficiency of model training is improved.
Referring to fig. 4, an apparatus 400 for model training is provided in the embodiment of the present application, where the apparatus 400 is deployed on a management node in the embodiment shown in fig. 1, fig. 2, or fig. 3, and includes:
a processing unit 401, configured to schedule a first model training task, where the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, and the first parameter value set includes a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job;
the processing unit 401 is further configured to determine a first computing node from the node cluster according to the job identifier, where the first computing node has at least one of first training data and idle first resources, the first resources are resources required for processing a model training task of a first parameter adjustment job, and the first training data is training data required for training an intelligent model corresponding to the first parameter adjustment job;
a transceiver unit 402, configured to send a first training request to a first computing node, where the first training request includes a first model training task, and the first training request is used for the first computing node to train a first intelligent model according to at least one of a first resource and first training data.
Optionally, the processing unit 401 determines a detailed implementation process of the first computing node, which may refer to relevant contents in step 302 of the embodiment shown in fig. 3 and will not be described in detail here.
Optionally, the processing unit 401 is configured to:
determining a first computing node from the node cluster according to the resource corresponding relation, the data corresponding relation and the operation identification;
any record in the resource corresponding relation comprises a job identifier of the parameter adjustment job, a node identifier of a computing node in the node cluster, a resource identifier and a resource state, wherein the resource identifier is used for identifying resources which are included by the computing node and are needed by a model training task for processing the parameter adjustment job, and the resource state is used for describing whether the resources are idle at present;
any record in the data corresponding relation comprises a job identifier of the parameter adjustment job, a node identifier of a computing node in the node cluster and a data identifier, wherein the data identifier is used for identifying training data which are included by the computing node and are needed for training an intelligent model corresponding to the parameter adjustment job.
Optionally, the processing unit 401 is configured to:
determining N computing nodes including first training data and/or first resources in the node cluster according to the resource corresponding relation, the data corresponding relation and the operation identification, wherein N is an integer larger than 0;
when at least one target node exists in the N computing nodes, selecting one target node from the at least one target node as a first computing node;
the target node comprises idle first resources, or the target node comprises first training data and unprotected resources included by the target node exceed the size of resources required for processing the first model training task, the unprotected resources are other resources except for protected resources in the target node, the protected resources are resources allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resources is not finished.
Optionally, the detailed implementation process of the processing unit 401 determining N computing nodes may refer to relevant contents in step 3021 in the embodiment shown in fig. 3, and will not be described in detail here.
Optionally, the processing unit 401 is configured to:
determining at least one target node according to the job identification, and selecting one target node from each target node as a first computing node according to the load information and/or the node attribute information of each target node in the at least one target node;
the target node comprises idle first resources, or the target node comprises first training data and unprotected resources included by the target node exceed the size of resources required for processing the first model training task, the unprotected resources are other resources except for protected resources in the target node, the protected resources are resources allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resources is not finished.
Optionally, the detailed implementation process of the processing unit 401 selecting one target node as the first computing node may refer to relevant contents in step 3022 of the embodiment shown in fig. 3, and will not be described in detail here.
Optionally, the processing unit 401 is further configured to:
when a target node does not exist in the N computing nodes, detecting whether a computing node becomes the target node in the N computing nodes in a first time period, wherein the starting time of the first time period is the time for scheduling the first model training task, the time length of the first time period is a first threshold value, and the N computing nodes are computing nodes comprising first training data and/or first resources;
detecting that a computing node becomes a target node in a first time period, and determining the detected target node as a first computing node;
the target node comprises idle first resources, or the target node comprises first training data and unprotected resources included by the target node exceed the size of resources required for processing the first model training task, the unprotected resources are other resources except for protected resources in the target node, the protected resources are resources allocated to the parameter adjustment operation, and the protection time period corresponding to the protected resources is not finished.
Optionally, any record in the resource correspondence further includes the resource size of the resource identified by the resource identifier,
the processing unit 401 is further configured to detect that no computing node becomes a target node in a first time period, and after the first time period ends, determine a second computing node from the node cluster according to the resource correspondence, where the size of an unprotected resource included in the second computing node is larger than the size of a resource required for processing the first model training task;
the transceiving unit 402 is further configured to send a second training request to the second computing node, where the second training request includes the first model training task, and the second training request is used for the second computing node to train the first intelligent model.
Optionally, the transceiver unit 402 is further configured to receive a first deletion request, where the first deletion request includes a node identifier of the computing node and a resource identifier of the first resource, the first deletion request is sent by the first computing node after a first protection time period ends, a starting time of the first protection time period is a time when the first resource is used last time, and a time length of the first protection time period is a second threshold;
the processing unit 401 is further configured to delete a record including the node identifier of the first computing node and the resource identifier of the first resource from the resource correspondence.
Optionally, the transceiver 402 is further configured to receive a second deletion request, where the second deletion request includes a node identifier of the first computing node and a data identifier of the first training data, the second deletion request is sent by the first computing node after a second protection time period ends, a starting time of the second protection time period is a time when the first training data is used last time, and a time length of the second protection time period is a third threshold;
the processing unit 401 is further configured to delete a record including a node identifier of the first computing node and a data identifier of the first training data from the data correspondence.
Optionally, the transceiver 402 is further configured to send a third training request to the first computing node, where the third training request includes a second model training task, the second model training task includes a second intelligent model and a job identifier of the first parameter adjustment job, the second model training task is a model training task included in the 1 st batch of tasks corresponding to the first parameter adjustment job, the second intelligent model is obtained by configuring the algorithm based on a second parameter value set, the second parameter value set includes a second parameter value of each super parameter, and the third training request is used for the first computing node to allocate a first resource used for training the second intelligent model and obtain first training data used for training the second intelligent model; receiving a storage request sent by a first computing node, wherein the storage request comprises a data identifier of first training data, a resource identifier of first resources and a resource state;
the processing unit 401 is further configured to store a corresponding relationship among the job identifier, the node identifier of the first computing node, the resource identifier of the first resource, and the resource state in a resource corresponding relationship; and storing the correspondence among the job identification, the node identification of the first compute node, and the data identification of the first training data in a data correspondence.
Optionally, the detailed implementation process of the transceiver 401 sending the third training request may refer to relevant contents in step 203 of the embodiment shown in fig. 2, and will not be described in detail here.
Optionally, the detailed implementation process of the processing unit 401 for saving the content in the resource corresponding relationship and the data corresponding relationship may refer to the related content in step 204 in the embodiment shown in fig. 2, and will not be described in detail here.
In this embodiment of the application, the processing unit schedules a first model training task, where the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, and the first parameter value set includes a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job. And determining a first computing node from the node cluster according to the job identification, wherein the first computing node has at least one of first training data and idle first resources, the first resources are resources required for processing a model training task of the first parameter adjustment job, and the first training data are training data required for training an intelligent model corresponding to the first parameter adjustment job. The transceiver unit sends a first training request to the first computing node, the first training request including a first model training task, the first training request being used by the first computing node to train a first intelligent model according to at least one of the first resource and the first training data. In this way, after the first computing node receives the first training request including the first model training task, it may not be necessary to allocate the first resource to the first model training task and/or obtain the first training data, so that the time for allocating the first resource and/or the time for obtaining the first training data is saved, and the efficiency of training the first intelligent model is improved.
Referring to fig. 5, an embodiment of the present application provides an apparatus 500 for model training, where the apparatus 500 is deployed on a computing node in the embodiment shown in fig. 1, fig. 2, or fig. 3, and includes:
a transceiver unit 501, configured to receive a first training request sent by a management node, where the first training request includes a first model training task, the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, the first parameter value set includes a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job, and the apparatus 500 has at least one of a first resource and first training data bound to the first parameter adjustment job;
a processing unit 502, configured to obtain at least one of a first resource and first training data according to the job identifier; the first intelligent model is trained based on at least one of the first resources and the first training data.
Optionally, the detailed implementation process of the processing unit 502 for training the first intelligent model can be referred to the relevant content in step 304 in the embodiment shown in fig. 3, and will not be described in detail here.
Optionally, the transceiver unit 501 is further configured to receive a third training request, where the third training request includes a second model training task, the second model training task includes a second intelligent model and a job identifier of a first parameter adjustment job, the second model training task is a model training task included in a 1 st batch of tasks corresponding to the first parameter adjustment job, the second intelligent model is obtained by configuring the algorithm based on a second parameter value set, and the second parameter value set includes a second parameter value of each super parameter;
the processing unit 502 is further configured to allocate a first resource for training a second intelligent model from an unprotected resource, and acquire first training data for training the second intelligent model, where the unprotected resource is a resource other than a protected resource in the apparatus 500, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended; a second intelligent model is trained based on the first resources and the first training data.
Optionally, the detailed implementation processes of allocating the first resource, acquiring the first training data and training the second intelligent model by the processing unit 502 may refer to relevant contents in steps 204 and 205 in the embodiment shown in fig. 2, and will not be described in detail here.
Optionally, the transceiving unit 501 is further configured to send a storage request, where the storage request includes a data identifier of the first training data, a resource identifier of the first resource, and a resource state, and the storage request is used to manage that the node stores a correspondence between the job identifier, the node identifier of the apparatus 500, the resource identifier of the first resource, and the resource state in the resource correspondence, and stores a correspondence between the job identifier, the node identifier of the apparatus 500, and the data identifier of the first training data in the data correspondence.
Optionally, the transceiver unit 501 is further configured to send a first deletion request after a first protection time period ends, where the first deletion request includes a node identifier of the apparatus 500 and a resource identifier of the first resource, a starting time of the first protection time period is a time when the apparatus 500 uses the first resource last time, a time length of the first protection time period is a second threshold, and the first deletion request is used by a management node to delete a record including the node identifier of the apparatus 500 and the resource identifier of the first resource from a resource correspondence relationship.
Optionally, the transceiver unit 501 is further configured to send a second deletion request after a second protection time period ends, where the second deletion request includes a node identifier of the apparatus 500 and a data identifier of the first training data, a starting time of the second protection time period is a time when the apparatus 500 uses the first training data for the last time, a time length of the second protection time period is a third threshold, and the second deletion request is used by a management node to delete a record including the node identifier of the apparatus 500 and the data identifier of the first training data from a data correspondence relationship.
In this embodiment of the application, a transceiver unit receives a first training request sent by a management node, where the first training request includes a first model training task, the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, and the first parameter value set includes a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job. Since the device locally has at least one of the first resource and the first training data bound to the first parameter adjustment job. The processing unit is therefore capable of acquiring at least one of the first resource and the first training data based on the job identification; the first intelligent model is trained based on at least one of the first resources and the first training data. Therefore, when the processing unit receives the first model training task, the time for allocating the first resource and/or the time for acquiring the first training data can be saved, and the efficiency for training the intelligent model is improved.
Referring to fig. 6, an embodiment of the present application provides a schematic diagram of an apparatus 600 for model training. The apparatus 600 may be a management node in any of the embodiments described above. The apparatus 600 comprises at least one processor 601, a bus system 602, a memory 603 and at least one network interface 604.
The apparatus 600 is a hardware structure apparatus, and can be used to implement the functional modules in the apparatus 400 described in fig. 4. For example, it will be appreciated by those skilled in the art that the processing unit 401 in the apparatus 400 shown in fig. 4 may be implemented by the at least one processor 601 calling code in the memory 603, and the transceiving unit 402 in the apparatus 400 shown in fig. 4 may be implemented by the network interface 604.
Alternatively, the processor 601 may be a general processing unit (CPU), a Network Processor (NP), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program according to the present disclosure.
The bus system 602 may include a path that carries information between the components.
The network interface 604 is used for communicating with other devices or a communication network.
The memory 603 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.
The memory 603 is used for storing application program codes for executing the scheme of the application, and the processor 601 controls the execution. The processor 601 is configured to execute application program code stored in the memory 603 to implement the functions of the method of the present patent.
In particular implementations, processor 601 may include one or more CPUs such as CPU0 and CPU1 in fig. 6 as an example.
In particular implementations, the apparatus 600 may include multiple processors, such as the processor 601 and the processor 607 of fig. 6, for example, as an embodiment. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
Referring to fig. 7, an embodiment of the present application provides a schematic diagram of a communication device 700 for a PLC system. The apparatus 700 may be a computing node in any of the embodiments described above. The apparatus 700 comprises at least one processor 701, a bus system 702, a memory 703 and at least one network interface 704.
The apparatus 700 is a hardware structure apparatus, and can be used to implement the functional modules in the apparatus 500 described in fig. 5. For example, it will be appreciated by those skilled in the art that the processing unit 502 in the apparatus 500 shown in fig. 5 may be implemented by the at least one processor 701 calling code in the memory 703, and the transceiving unit 501 in the apparatus 500 shown in fig. 5 may be implemented by the network interface 704.
Alternatively, the processor 701 may be a general processing unit (CPU), a Network Processor (NP), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the program according to the present disclosure.
The bus system 702 may include a path that transfers information between the components.
The network interface 704 is used for communication with other devices or communication networks.
The memory 703 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these. The memory may be self-contained and coupled to the processor via a bus. The memory may also be integral to the processor.
The memory 703 is used for storing application program codes for executing the present application, and is controlled by the processor 701 to execute. The processor 701 is configured to execute application program code stored in the memory 703 to implement the functions of the method of the present patent.
In particular implementations, processor 701 may include one or more CPUs such as CPU0 and CPU1 of fig. 7 for one embodiment.
In particular implementations, the apparatus 700 may include multiple processors, such as the processor 701 and the processor 707 in fig. 7, for example, as an example. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).
The embodiment of the present application provides a system for model training, which includes the apparatus 400 provided in the embodiment shown in fig. 4 and the apparatus 500 provided in the embodiment shown in fig. 5, or includes the apparatus 600 provided in the embodiment shown in fig. 6 and the apparatus 700 provided in the embodiment shown in fig. 7.
Referring to fig. 8, the apparatus 400 provided in the embodiment shown in fig. 4 or the apparatus 600 provided in the embodiment shown in fig. 6 is a management node 801, and the apparatus 500 provided in the embodiment shown in fig. 5 or the apparatus 700 provided in the embodiment shown in fig. 7 is a computing node 802.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only an example of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the principles of the present application should be included in the scope of the present application.

Claims (22)

1. A method of model training, the method comprising:
the management node schedules a first model training task, wherein the first model training task comprises a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, and the first parameter value set comprises a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job;
the management node determines a first computing node from a node cluster according to the job identification, wherein the first computing node has at least one of first training data and idle first resources, the first resources are resources required for processing a model training task of the first parameter adjustment job, and the first training data are training data required for training an intelligent model corresponding to the first parameter adjustment job;
the management node sends a first training request to the first computing node, the first training request including the first model training task, the first training request being used by the first computing node to train the first intelligent model according to at least one of the first resource and the first training data.
2. The method of claim 1, wherein the managing node determining a first computing node from a cluster of nodes based on the job identification comprises:
the management node determines a first computing node from the node cluster according to the resource corresponding relation, the data corresponding relation and the operation identification;
any record in the resource corresponding relation comprises a job identifier of a parameter adjustment job, a node identifier of a computing node in the node cluster, a resource identifier and a resource state, wherein the resource identifier is used for identifying resources which are included by the computing node and are needed by a model training task for processing the parameter adjustment job, and the resource state is used for describing whether the resources are idle at present;
any record in the data corresponding relation comprises a job identifier of a parameter adjustment job, a node identifier of a computing node in the node cluster and a data identifier, wherein the data identifier is used for identifying training data which are included by the computing node and are needed by an intelligent model corresponding to the parameter adjustment job.
3. The method of claim 2, wherein the management node determining a first compute node from the cluster of nodes based on the resource correspondence, the data correspondence, and the job identification, comprises:
the management node determines N computing nodes including the first training data and/or the first resource in the node cluster according to the resource corresponding relation, the data corresponding relation and the job identification, wherein N is an integer greater than 0;
when at least one target node exists in the N computing nodes, the management node selects one target node from the at least one target node as a first computing node;
the target node includes the first idle resource, or the target node includes the first training data and the unprotected resource included by the target node exceeds the resource size required for processing the first model training task, the unprotected resource is a resource other than a protected resource in the target node, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended.
4. The method of claim 1 or 3, wherein the managing node determining a first computing node from a cluster of nodes based on the job identification comprises:
the management node determines at least one target node according to the job identification, and selects one target node from each target node as a first computing node according to load information and/or node attribute information of each target node in the at least one target node;
the target node includes the first idle resource, or the target node includes the first training data and the unprotected resource included by the target node exceeds the resource size required for processing the first model training task, the unprotected resource is a resource other than a protected resource in the target node, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended.
5. The method of claim 1, 3 or 4, further comprising:
when a target node does not exist in N computing nodes, the management node detects whether a computing node in the N computing nodes becomes the target node in a first time period, wherein the starting time of the first time period is the time for scheduling the first model training task, the time length of the first time period is a first threshold value, and the N computing nodes are computing nodes comprising the first training data and/or first resources;
the management node detects that a computing node becomes a target node in the first time period, and determines the detected target node as a first computing node;
the target node includes the first idle resource, or the target node includes the first training data and the unprotected resource included by the target node exceeds the resource size required for processing the first model training task, the unprotected resource is a resource other than a protected resource in the target node, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended.
6. The method of any of claims 2 to 5, further comprising:
the management node receives a first deletion request, wherein the first deletion request comprises a node identifier of the computing node and a resource identifier of the first resource, the first deletion request is sent by the first computing node after a first protection time period is ended, the starting time of the first protection time period is the time when the first resource is used last time, and the time length of the first protection time period is a second threshold;
and the management node deletes the record comprising the node identifier of the first computing node and the resource identifier of the first resource from the resource corresponding relation.
7. The method of any of claims 2 to 6, further comprising:
the management node receives a second deletion request, where the second deletion request includes a node identifier of the first computing node and a data identifier of the first training data, the second deletion request is sent by the first computing node after a second protection time period ends, a start time of the second protection time period is a time when the first training data is used last time, and a time length of the second protection time period is a third threshold;
the management node deletes a record including the node identifier of the first computing node and the data identifier of the first training data from the data correspondence.
8. A method of model training, the method comprising:
a computing node receives a first training request sent by a management node, wherein the first training request comprises a first model training task, the first model training task comprises a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, the first parameter value set comprises a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job, and the computing node has at least one of a first resource and first training data bound with the first parameter adjustment job;
the computing node acquires at least one of the first resource and first training data according to the operation identifier;
the computing node trains the first intelligent model according to at least one of the first resources and first training data.
9. The method of claim 8, wherein the method further comprises:
the method comprises the steps that a computing node sends a first deletion request after a first protection time period is finished, the first deletion request comprises a node identifier of the computing node and a resource identifier of a first resource, the starting time of the first protection time period is the time when the computing node uses the first resource last time, the time length of the first protection time period is a second threshold value, the first deletion request is used for a management node to delete a record comprising the node identifier of the computing node and the resource identifier of the first resource from a resource corresponding relation, and the resource corresponding relation is used for storing the corresponding relation of a job identifier of a parameter adjustment job, the node identifier of the computing node, the resource identifier and a resource state.
10. The method of claim 8 or 9, wherein the method further comprises:
the method includes that a computing node sends a second deletion request after a second protection time period is finished, the second deletion request includes a node identifier of the computing node and a data identifier of first training data, the starting time of the second protection time period is the time when the computing node uses the first training data last time, the time length of the second protection time period is a third threshold, the second deletion request is used for a management node to delete a record including the node identifier of the computing node and the data identifier of the first training data from a data corresponding relation, and the data corresponding relation is used for storing a corresponding relation among a job identifier of a parameter adjustment job, the node identifier of the computing node and the data identifier.
11. An apparatus for model training, the apparatus comprising:
the processing unit is used for scheduling a first model training task, the first model training task comprises a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, and the first parameter value set comprises a first parameter value of each super parameter in at least one super parameter corresponding to the first parameter adjustment job;
the processing unit is further configured to determine a first computing node from a node cluster according to the job identifier, where the first computing node has at least one of first training data and idle first resources, the first resources are resources required for processing a model training task of the first parameter adjustment job, and the first training data is training data required for training an intelligent model corresponding to the first parameter adjustment job;
a transceiver unit, configured to send a first training request to the first computing node, where the first training request includes the first model training task, and the first training request is used for the first computing node to train the first intelligent model according to at least one of the first resource and the first training data.
12. The apparatus as recited in claim 11, said processing unit to:
determining a first computing node from the node cluster according to the resource corresponding relation, the data corresponding relation and the job identification;
any record in the resource corresponding relation comprises a job identifier of a parameter adjustment job, a node identifier of a computing node in the node cluster, a resource identifier and a resource state, wherein the resource identifier is used for identifying resources which are included by the computing node and are needed by a model training task for processing the parameter adjustment job, and the resource state is used for describing whether the resources are idle at present;
any record in the data corresponding relation comprises a job identifier of a parameter adjustment job, a node identifier of a computing node in the node cluster and a data identifier, wherein the data identifier is used for identifying training data which are included by the computing node and are needed by an intelligent model corresponding to the parameter adjustment job.
13. The apparatus as recited in claim 12, said processing unit to:
determining N computing nodes including the first training data and/or the first resource in the node cluster according to the resource corresponding relation, the data corresponding relation and the job identification, wherein N is an integer greater than 0;
when at least one target node exists in the N computing nodes, selecting one target node from the at least one target node as a first computing node;
the target node includes the first idle resource, or the target node includes the first training data and the unprotected resource included by the target node exceeds the resource size required for processing the first model training task, the unprotected resource is a resource other than a protected resource in the target node, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended.
14. The apparatus as claimed in claim 11 or 13, wherein said processing unit is configured to:
determining at least one target node according to the job identification, and selecting one target node from each target node as a first computing node according to load information and/or node attribute information of each target node in the at least one target node;
the target node includes the first idle resource, or the target node includes the first training data and the unprotected resource included by the target node exceeds the resource size required for processing the first model training task, the unprotected resource is a resource other than a protected resource in the target node, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended.
15. The apparatus of claim 11, 13 or 14, wherein the processing unit is further configured to:
when a target node does not exist in N computing nodes, detecting whether a computing node in the N computing nodes becomes the target node or not in a first time period, wherein the starting time of the first time period is the time for scheduling the first model training task, the time length of the first time period is a first threshold value, and the N computing nodes are computing nodes comprising the first training data and/or first resources;
detecting that a computing node becomes a target node in the first time period, and determining the detected target node as a first computing node;
the target node includes the first idle resource, or the target node includes the first training data and the unprotected resource included by the target node exceeds the resource size required for processing the first model training task, the unprotected resource is a resource other than a protected resource in the target node, the protected resource is a resource allocated to a parameter adjustment job, and a protection time period corresponding to the protected resource has not yet ended.
16. The apparatus of any one of claims 11 to 15,
the transceiver unit is further configured to receive a first deletion request, where the first deletion request includes a node identifier of the computing node and a resource identifier of the first resource, the first deletion request is sent by the first computing node after a first protection period ends, a starting time of the first protection period is a time when the first resource is used last time, and a time length of the first protection period is a second threshold;
the processing unit is further configured to delete a record including the node identifier of the first computing node and the resource identifier of the first resource from the resource correspondence.
17. The apparatus of any one of claims 11 to 16,
the transceiver unit is further configured to receive a second deletion request, where the second deletion request includes a node identifier of the first computing node and a data identifier of the first training data, the second deletion request is sent by the first computing node after a second protection period ends, a starting time of the second protection period is a time when the first training data is used last time, and a time length of the second protection period is a third threshold;
the processing unit is further configured to delete a record including the node identifier of the first computing node and the data identifier of the first training data from the data correspondence.
18. An apparatus for model training, the apparatus comprising:
a transceiver unit, configured to receive a first training request sent by a management node, where the first training request includes the first model training task, the first model training task includes a first intelligent model and a job identifier of a first parameter adjustment job, the first intelligent model is obtained by configuring an algorithm corresponding to the first parameter adjustment job based on a first parameter value set, the first parameter value set includes a first parameter value of each of at least one super parameter corresponding to the first parameter adjustment job, and the apparatus has at least one of a first resource and first training data bound to the first parameter adjustment job;
the processing unit is used for acquiring at least one of the first resource and the first training data according to the operation identification; training the first intelligent model based on at least one of the first resource and first training data.
19. The apparatus of claim 18,
the transceiver unit is further configured to send a first deletion request after a first protection time period ends, where the first deletion request includes a node identifier of the device and a resource identifier of the first resource, a start time of the first protection time period is a time when the device uses the first resource last time, a time length of the first protection time period is a second threshold, the first deletion request is used by the management node to delete a record including the node identifier of the device and the resource identifier of the first resource from the resource correspondence, and the resource correspondence is used to store correspondence between a job identifier of a parameter adjustment job, a node identifier of a computing node, a resource identifier, and a resource state.
20. The apparatus of claim 18 or 19,
the transceiver unit is further configured to send a second deletion request after a second protection time period ends, where the second deletion request includes a node identifier of the device and a data identifier of the first training data, a start time of the second protection time period is a time when the device uses the first training data for the last time, a time length of the second protection time period is a third threshold, the second deletion request is used by the management node to delete a record including the node identifier of the device and the data identifier of the first training data from the data correspondence, and the data correspondence is used to store a correspondence between a job identifier of a parameter adjustment job, the node identifier of a computation node, and the data identifier.
21. An apparatus for model training, the apparatus comprising a processor and a memory, the processor executing a program in the memory to cause the apparatus to perform the method of any one of claims 1 to 10.
22. A computer-readable storage medium characterized in that the computer-readable storage medium stores a program for implementing the method of any one of claims 1 to 10.
CN202010600109.4A 2020-06-28 2020-06-28 Model training method and device and computer readable storage medium Pending CN113849295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010600109.4A CN113849295A (en) 2020-06-28 2020-06-28 Model training method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010600109.4A CN113849295A (en) 2020-06-28 2020-06-28 Model training method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113849295A true CN113849295A (en) 2021-12-28

Family

ID=78972689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010600109.4A Pending CN113849295A (en) 2020-06-28 2020-06-28 Model training method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113849295A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220247625A1 (en) * 2021-01-29 2022-08-04 Capital One Services, Llc Platform for establishing computing node clusters in different environments
WO2024067404A1 (en) * 2022-09-27 2024-04-04 华为技术有限公司 Model training management method, apparatus and system
CN118245811A (en) * 2024-05-29 2024-06-25 苏州元脑智能科技有限公司 Model parameter management method and device, storage medium and electronic equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220247625A1 (en) * 2021-01-29 2022-08-04 Capital One Services, Llc Platform for establishing computing node clusters in different environments
US11770295B2 (en) * 2021-01-29 2023-09-26 Capital One Services, Llc Platform for establishing computing node clusters in different environments
WO2024067404A1 (en) * 2022-09-27 2024-04-04 华为技术有限公司 Model training management method, apparatus and system
CN118245811A (en) * 2024-05-29 2024-06-25 苏州元脑智能科技有限公司 Model parameter management method and device, storage medium and electronic equipment
CN118245811B (en) * 2024-05-29 2024-09-24 苏州元脑智能科技有限公司 Model parameter management method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN113849295A (en) Model training method and device and computer readable storage medium
CN110442451B (en) Deep learning-oriented multi-type GPU cluster resource management scheduling method and system
US8266289B2 (en) Concurrent data processing in a distributed system
KR20190132475A (en) Training machine learning models for large distributed systems using job servers
CN109788046B (en) Multi-strategy edge computing resource scheduling method based on improved bee colony algorithm
US20200174844A1 (en) System and method for resource partitioning in distributed computing
CN113946431B (en) Resource scheduling method, system, medium and computing device
WO2023198061A1 (en) Container scheduling method, electronic device, and storage medium
CN113225269B (en) Container-based workflow scheduling method, device and system and storage medium
CN112114973A (en) Data processing method and device
Glazebrook et al. On the optimal allocation of service to impatient tasks
US20230037293A1 (en) Systems and methods of hybrid centralized distributive scheduling on shared physical hosts
CN111709723A (en) RPA business process intelligent processing method, device, computer equipment and storage medium
CN116627661B (en) Method and system for scheduling computing power resources
JP2023532358A (en) Resource scheduling method, resource scheduling system, and equipment
CN109189581B (en) Job scheduling method and device
CN108984286A (en) A kind of resource regulating method and system of cloud computing platform
CN110084507B (en) Scientific workflow scheduling optimization method based on hierarchical perception in cloud computing environment
CA2631255A1 (en) Scalable scheduling of tasks in heterogeneous systems
US20170344266A1 (en) Methods for dynamic resource reservation based on classified i/o requests and devices thereof
US6782535B1 (en) Dynamic queue width system and method
CN112181661B (en) Task scheduling method
CN107526632B (en) Process pool expansion method and device
JP2012038275A (en) Transaction calculation simulation system, method, and program
CN113176933B (en) Dynamic cloud network interconnection method for massive workflow tasks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220221

Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province

Applicant after: Huawei Cloud Computing Technologies Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Applicant before: HUAWEI TECHNOLOGIES Co.,Ltd.

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination