CN117556921A - Model training method and device, electronic equipment and storage medium - Google Patents

Model training method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117556921A
CN117556921A CN202311491387.0A CN202311491387A CN117556921A CN 117556921 A CN117556921 A CN 117556921A CN 202311491387 A CN202311491387 A CN 202311491387A CN 117556921 A CN117556921 A CN 117556921A
Authority
CN
China
Prior art keywords
model
training
model parameters
node
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311491387.0A
Other languages
Chinese (zh)
Inventor
郭熹
贺鸣
秦守浩
程新洲
张珂珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202311491387.0A priority Critical patent/CN117556921A/en
Publication of CN117556921A publication Critical patent/CN117556921A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a model training method, a model training device, electronic equipment and a model training storage medium, relates to the technical field of artificial intelligence, and is used for solving the problem of low model training efficiency in an existing parallel computing framework. The method comprises the following steps: acquiring a first model parameter of a target model; transmitting first model parameters to each training node and each verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters; receiving a first verification result of a first model parameter sent by a verification node; the first verification result includes pass or fail; and under the condition that the first verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are first model parameters.

Description

Model training method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a model training method, a model training device, electronic equipment and a storage medium.
Background
The traditional computing mode is usually independently born by a central processing unit (central processing unit, CPU), but with the increase of the data scale and the computing complexity, the computing capability of a single CPU is often unable to meet the demands, while heterogeneous computing resources have the advantages of strong parallel processing capability and low energy consumption, and can play an important role in processing large-scale data and complex computing tasks. Thus, parallel computing is performed by using computing resources on different machines to perform parallel computing, so as to increase computing speed and processing capacity. In the current background of rapid development of large-scale data and artificial intelligence, a parallel computing framework gradually becomes an important means for realizing parallel computing.
However, the model training method in the parallel computing framework is long in time consumption and low in model training efficiency.
Disclosure of Invention
The application provides a model training method, a device, electronic equipment and a storage medium, which can effectively improve the model training efficiency in a parallel computing framework by synchronously carrying out a training process and a verification process and stopping training under the condition that a verification result is passed to obtain a final model.
In a first aspect, the present application provides a model training method, the method comprising: acquiring a first model parameter of a target model; transmitting first model parameters to each training node and each verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters; receiving a first verification result of a first model parameter sent by a verification node; the first verification result includes pass or fail; and under the condition that the first verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are first model parameters.
The technical scheme provided by the application at least brings the following beneficial effects: and sending the first model parameters to each training node and the verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters. And (3) carrying out model training through a training process, simultaneously carrying out verification on the model to be verified through a verification process at regular time, and terminating the training process and the verification process to obtain a final model under the condition that a verification result is passed. Compared with the mode that the training nodes complete the preset model training times in the model training method in the existing parallel computing framework and the parameter server obtains the final training model, the training process can be terminated under the condition that the verification result is passed, the preset model training times are not required to be completed, the time of model training can be effectively saved, the model training efficiency is improved, and therefore the computing efficiency of the parallel computing framework is improved.
In a possible implementation manner, obtaining a first model parameter of the target model includes: acquiring initial model parameters of a target model; transmitting the target model and the initial model parameters to each training node so that each training node trains the target model according to the initial model parameters; receiving initial gradient information sent by each training node; the initial gradient information is used for indicating the updating direction of the initial model parameters; and determining a first model parameter according to the initial gradient information.
In a possible implementation manner, determining the first model parameter according to the initial gradient information includes: and determining the model parameters updated for the first time as first model parameters according to the initial gradient information.
In a possible implementation manner, determining the first model parameter according to the initial gradient information includes: determining model parameters updated for the first time according to the initial gradient information; and updating the model parameters updated for the first time for one or more times to obtain model parameters updated for the nth time as first model parameters, wherein N is a positive integer greater than or equal to 2.
In one possible implementation, N is one update number in a preset update number set; the update times set includes a plurality of update times.
In one possible implementation, N is the number of times the model parameters were updated last time from the target moment; the target time is obtained by adding a preset period to the initial time; the initial time is the time of acquiring the initial model parameters, or the time of sending the initial parameter model to each training node.
In another possible implementation manner, under the condition that the training times of each training node reach M times, acquiring target model parameters; the target model parameters are model parameters adopted by the target training node in the M-th training; the target training node is a training node with the last training time reaching M times in a plurality of training nodes; sending the target model parameters to the verification node so that the verification node verifies the target model according to the target model parameters; receiving a second verification result of the target model parameters sent by the verification node; the second verification result includes pass or fail; and under the condition that the second verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are target model parameters.
In a second aspect, the present application provides a model training apparatus, the apparatus comprising: an acquisition module and a processing module.
And the acquisition module is used for acquiring the first model parameters of the target model.
The processing module is used for sending the first model parameters to each training node and the verification node so that each training node trains the target model according to the first model parameters and the verification node verifies the target model according to the first model parameters; receiving a first verification result of a first model parameter sent by a verification node; the first verification result includes pass or fail; and under the condition that the first verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are first model parameters.
Optionally, the obtaining module is specifically configured to obtain initial model parameters of the target model.
Optionally, the processing module is specifically configured to send the target model and the initial model parameters to each training node, so that each training node trains the target model according to the initial model parameters; receiving initial gradient information sent by each training node; the initial gradient information is used for indicating the updating direction of the initial model parameters; and determining a first model parameter according to the initial gradient information.
Optionally, the processing module is specifically configured to determine, according to the initial gradient information, a model parameter updated for the first time as a first model parameter.
Optionally, the processing module is specifically configured to determine, according to the initial gradient information, a model parameter updated for the first time; and updating the model parameters updated for the first time for one or more times to obtain model parameters updated for the nth time as first model parameters, wherein N is a positive integer greater than or equal to 2.
Optionally, the processing module is specifically configured to obtain the target model parameter when the training number of times of each training node reaches M times; the target model parameters are model parameters adopted by the target training node in the M-th training; the target training node is a training node with the last training time reaching M times in a plurality of training nodes; sending the target model parameters to the verification node so that the verification node verifies the target model according to the target model parameters; receiving a second verification result of the target model parameters sent by the verification node; the second verification result includes pass or fail; and under the condition that the second verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are target model parameters.
In a third aspect, the present application provides an electronic device, comprising: a processor and a memory; the memory stores instructions executable by the processor; the processor is configured to execute instructions that, when executed, cause the electronic device to implement a model training method as in the first aspect and any one of its possible implementations.
In a fourth aspect, the present application provides a computer-readable storage medium comprising: a software instruction; the software instructions, when executed in an electronic device, cause the electronic device to implement a model training method as in the first aspect and any one of its possible implementations.
The advantages of the second to fourth aspects described above may be referred to in the first aspect, and are not described here.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of a model training method in a conventional parallel computing framework;
FIG. 2 is a schematic diagram of the components of a model training system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of another embodiment of a model training system according to the present disclosure;
FIG. 4 is a schematic flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 5 is a schematic flow chart of another embodiment of a model training method according to the present disclosure;
FIG. 6 is a schematic flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 7 is a schematic flow chart of a model training method according to an embodiment of the present disclosure;
FIG. 8 is a schematic diagram of the components of a model training device according to an embodiment of the present disclosure;
fig. 9 is a schematic diagram of the composition of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.
In addition, in the description of the embodiments of the present application, "/" means or, unless otherwise indicated, for example, a/B may mean a or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.
In order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the terms "first", "second", and the like are used to distinguish the same item or similar items having substantially the same function and effect, and those skilled in the art will understand that the terms "first", "second", and the like are not limited in number and execution order.
The traditional computing mode is usually independently born by a central processing unit (central processing unit, CPU), but with the increase of the data scale and the computing complexity, the computing capability of a single CPU is often unable to meet the demands, while heterogeneous computing resources have the advantages of strong parallel processing capability and low energy consumption, and can play an important role in processing large-scale data and complex computing tasks. Thus, parallel computing is performed by using computing resources on different machines to perform parallel computing, so as to increase computing speed and processing capacity. In the current background of rapid development of large-scale data and artificial intelligence, a parallel computing framework gradually becomes an important means for realizing parallel computing.
FIG. 1 is a flow chart of a model training method in a conventional parallel computing framework. As shown in FIG. 1, the model training method comprises five steps of full component initialization, parameter distribution, node work training (gradient updating), gradient updating model gathering and master node blocking flow waiting model state. The training node workflow can specifically comprise parameter acquisition, training node model parameter modification, small batch data random acquisition, forward propagation calculation output, backward propagation calculation gradient and return gradient; the process of collecting the gradient update model may specifically include obtaining all training node gradients, calculating average gradients, modifying training model gradients, optimizing update parameters, and outputting model states.
However, this model training method is time-consuming and has low model training efficiency.
Based on this, the embodiment of the application provides a model training method, device, electronic equipment and storage medium, by enabling the training process and the verification process to be performed synchronously, stopping training when the verification result is passed, and obtaining a final model, the model training efficiency in the parallel computing framework can be effectively improved.
For ease of understanding, the following detailed description is made in connection with the accompanying drawings.
Fig. 2 is a schematic diagram of the composition of a model training system according to an embodiment of the present application. As shown in fig. 2, the system may include a master node 100, a parameter server 200, a verification node 300, and a heterogeneous working node 400, where the heterogeneous working node 400 may specifically include a plurality of heterogeneous working nodes (two heterogeneous working nodes of the heterogeneous working node 1 and the heterogeneous working node 2 are illustrated in fig. 1). The master node 100 and the parameter server 200, the parameter server 200 and the authentication node 300, and the parameter server 200 and the heterogeneous operating node 400 may be connected by a wired network or a wireless network.
The master node 100 may be a computing device with computing processing capabilities, such as a server or a specialized artificial intelligence (artificial intelligence, AI) chip (e.g., one or more graphics processors (graphics processing unit, GPU) or neural network processor (neural network processing unit, NPU)).
The server may be a single server, or may be a server cluster formed by a plurality of servers. In some implementations, the server cluster may also be a distributed cluster. Optionally, the server may also be implemented in a cloud platform, which may include, for example, a private cloud, public cloud, hybrid cloud, community cloud (community cloud), distributed cloud, inter-cloud, multi-cloud (multi-cloud), and the like, or any combination thereof.
The master node 100 is used to control the entire training process while monitoring and maintaining the state of global shared variables (e.g., hyper-parameters such as number of training steps, learning rate, etc.).
In some possible embodiments, the master node 100 may also be used to initialize the parameter server 200, the authentication node 300, and the heterogeneous working node 400. For example, the bias matrix and weights in the training model are initialized.
The specific form of the parameter server 200 may be described in the above description about the server in the master node 100, and will not be described again.
The parameter server 200 is configured to send model parameters of the training model to the verification node 300 and the heterogeneous working node 400, and receive the verification result (passed or failed) sent by the verification node 300 and the gradient information sent by the heterogeneous working node 400, and the specific processing procedure may be described with reference to the model training method provided in the following method embodiment, which is not described herein again.
In some possible embodiments, the parameter server 200 may be further configured to receive gradient information sent by heterogeneous working nodes, aggregate training model parameters, perform weighted integration on the training model parameters in a local CPU (for example, the parameter server may complete weighted integration on the training model parameters by using a mean weighting method), and replace the training model parameters to form an updated target model, which may be specifically referred to the related art and will not be described herein.
The specific form of the authentication node 300 may be described with reference to the master node 100, which is not described herein.
The validation node 300 is used to validate the training model.
Heterogeneous working node 400 may include a plurality of heterogeneous working nodes. The specific form of the heterogeneous operating node may be described with reference to the master node 100, which is not described herein.
In some possible embodiments, heterogeneous working node 1 may be a GPU working node that may be deployed on a GPU server to accelerate computer flow with a GPU. And the GPU working node acquires a part of training data, calculates corresponding gradient information by using the latest training model parameters provided by the parameter server, and finally returns the gradient information to the parameter server. In addition, the GPU working node can process a plurality of batch data simultaneously to realize gradient aggregation calculation.
In some possible embodiments, the heterogeneous operating node 2 may be an NPU operating node, which may be deployed on an NPU server, with the NPU accelerating the computer flow. The NPU working node can acquire a part of training data, then calculates corresponding gradient information by using the latest training model parameters provided by the parameter server, and finally returns the gradient information to the parameter server. In addition, the NPU working node can process a plurality of batch data at the same time to realize gradient aggregation calculation.
The heterogeneous working node 400 is configured to receive the training model parameters sent by the parameter server 200, complete a training task, update gradient information, and return updated gradient information to the parameter server 200.
In some possible embodiments, the heterogeneous working node 400 performs a model training task, and multiple accelerator cards (GPU or NPU) and local central processing units (central processing unit, CPU) may be used to integrate the heterogeneous working node, so as to achieve the effect of accelerating to complete the model training task, and improve the computing efficiency. For example, the heterogeneous working node 400 may transfer the received training model parameters to the accelerator card, so that the training task is performed in the accelerator card, and the gradient information updating operation is completed, and finally, the updated gradient information is transferred to the local CPU, and the CPU and the parameter server 200 perform remote network exchange, so as to transfer the gradient that the training model needs to be updated.
In some possible embodiments, the model training system provided in fig. 2 in embodiments of the present application may be an application layer of a parallel computing framework. In this case, fig. 3 is another schematic diagram of the composition of the model training system provided in the embodiment of the present application. As shown in fig. 3, the model training system may further include a resource scheduling layer, which may specifically include the intelligent computing power cluster 500, and a data layer, which may specifically include the data node 600, where the data node 600 may specifically include a plurality of data nodes (two data nodes, data node 1 and data node 2, are illustrated in fig. 3). Both the master node 100 and the intelligent power cluster 500 and the heterogeneous operating node 400 and the data node 600 may be connected by a wired network or a wireless network.
Wherein intelligent computing power cluster 500 may be a cluster comprised of a plurality of computing nodes. The specific form of these computing nodes may be described with reference to the master node 100 in fig. 1, and will not be described herein.
The intelligent computing power cluster 500 is used to provide computing power.
The details of the data node 600 may be described with reference to the master node 100 in fig. 1, and will not be described herein.
The data node 600 is used to calculate and sort data.
The execution subject of the model training method provided in the embodiment of the present application may be the parameter server 200 described above. Alternatively, the execution body of the model training method may be a processor (for example, a central processing unit (central processing unit, CPU)) of the parameter server 200, or the execution body of the model training method may also be an Application (APP) of a calculation processing function installed in the parameter server 200, or the execution body of the model training method may also be a software system or platform deployed in the parameter server 200, or the execution body of the model training method may also be a functional module having a model training function in the parameter server 200.
For simplicity of description, the following description will take the parameter server 200 as an execution body for example.
The model training method provided in the embodiment of the application is described below with reference to the accompanying drawings.
Fig. 4 is a flow chart of a model training method according to an embodiment of the present application. As shown in fig. 4, the model training method includes S101 to S104 applied to the parameter server 200.
S101, a parameter server acquires first model parameters of a target model.
The target model is a model which can be verified after being trained in a preset period or after being trained for a preset number of times; the first model parameters are model parameters (e.g., parameters such as weighting coefficients, bias terms, etc.) of the target model.
The specific process of S101 may be described with reference to S1011 to S1014 in fig. 5, which will not be described here.
S102, the parameter server sends first model parameters to each training node and each verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters.
The specific process of each training node for training the target model according to the first model parameter may be described with reference to S1012 in fig. 5, which is not described herein.
The verification of the target model by the verification node according to the first model parameter may be specifically implemented as verification of the accuracy of the target model, and the specific process may be described with reference to S202 in fig. 7 below, which is not described herein.
S103, the parameter server receives a first verification result of the first model parameter sent by the verification node.
Wherein the first verification result includes pass or fail.
And S104, obtaining the trained target model by the parameter server under the condition that the first verification result is passed.
The model parameters in the target model after training are first model parameters.
According to the model training method, first model parameters are sent to each training node and each verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters. And (3) carrying out model training through a training process, simultaneously carrying out verification on the model to be verified through a verification process at regular time, and terminating the training process and the verification process to obtain a final model under the condition that a verification result is passed. Compared with the mode that the training nodes complete the preset model training times in the model training method in the existing parallel computing framework and the parameter server obtains the final training model, the training process can be terminated under the condition that the verification result is passed, the preset model training times are not required to be completed, the time of model training can be effectively saved, the model training efficiency is improved, and therefore the computing efficiency of the parallel computing framework is improved.
S101 is described below.
Fig. 5 is another flow chart of a model training method according to an embodiment of the present application. As shown in fig. 5, S101 may specifically include S1011 to S1014.
S1011, the parameter server acquires initial model parameters of the target model.
Alternatively, as described above, the master node may also be used to initialize the parameter server. The parameter server presets an initial model (algorithm or calculation formula) as a target model according to task requirements in the process of initialization (such as hardware detection, virtual console setting, disk array creation and the like), and acquires initial model parameters (such as parameters of weighting coefficients, deviation items and the like) of the target model.
S1012, the parameter server sends the target model and the initial model parameters to each training node, so that each training node trains the target model according to the initial model parameters.
Illustratively, the parameter server sends initial model parameters of the target model to the heterogeneous operating nodes, which receive the initial model parameters, and modifies the model parameters in the heterogeneous operating nodes. And the heterogeneous working nodes acquire data of randomly selected small batches, forward propagation calculation output and backward propagation calculation gradient are respectively carried out according to initial model parameters, and the target model is trained.
S1013, the parameter server receives the initial gradient information sent by each training node.
The initial gradient information is used for indicating the updating direction of the initial model parameters. The initial gradient information may be obtained by the training node during the training process, and the specific obtaining process may be described with reference to the related art, which is not described in detail.
S1014, the parameter server determines a first model parameter according to the initial gradient information.
In one possible implementation, the parameter server may determine, as the first model parameter, a model parameter that is updated for the first time based on the initial gradient information.
For example, the parameter server may determine the first model parameters using an optimization algorithm. For example, a gradient descent method first calculates a gradient of the loss function with respect to the model parameters (i.e., a rate of change of the loss function with respect to the model parameters), and then updates the model parameters according to gradient information such that the value of the loss function is continuously decreased. In addition, there are optimization algorithms such as gradient rising method, newton method, conjugate gradient method, quasi-newton method, etc., which can be selected and adjusted according to actual requirements, and the embodiment of the present application is not limited thereto.
In another possible implementation, the parameter server may use the model parameter updated a preset number of times (more than 1 time) as the first model parameter. In this case, fig. 6 is a schematic flow chart of a model training method according to an embodiment of the present application. As shown in fig. 6, S1014 may specifically include S10141 to S10142.
S10141, the parameter server determines the model parameters updated for the first time according to the initial gradient information.
Alternatively, as described above, the parameter server may be further configured to receive gradient information in each heterogeneous working node, aggregate the training model parameters, and perform weighted integration on the training model parameters in the local CPU (e.g., the parameter server may perform weighted integration on the training model parameters using a mean weighting method). The parameter server can receive initial gradient information in each heterogeneous working node, aggregate training model parameters, and carry out weighted integration on the training model parameters in a local CPU to determine model parameters updated for the first time.
S10142, the parameter server updates the model parameter updated for the first time one or more times, and the model parameter updated for the nth time is obtained as the first model parameter.
Wherein N is a positive integer greater than or equal to 2.
In one possible implementation, N is one update number in a preset update number set; the update times set includes a plurality of update times.
Alternatively, the update times set may be an arithmetic series (e.g., {5, 10, 15, 20, …,100 }).
The update of the model parameters by the parameter server 200 may be as described at S10141 in fig. 6, and will not be described here. Taking the update times set as {5, 10, 15, 20, …,100} as an example, the parameter server may first send the model parameter after the 5 th update to the verification node as the first model parameter for verification, then may send the model parameter after the 10 th update to the verification node as the first model parameter for verification, then may send the model parameter after the 15 th update to the verification node as the first model parameter for verification, …, and so on until the update times of the model parameter reach the maximum update times in the update times set, or the verification result of the model parameter updated at a certain time is passed.
In another possible implementation, N is the number of times the model parameters were updated last time from the target moment; the target time is obtained by adding a preset period to the initial time; the initial time is the time of acquiring the initial model parameters, or the time of sending the initial parameter model to each training node.
Illustratively, taking a preset period of 10 minutes as an example, assuming that the heterogeneous operating node can be trained 3 times for 10 minutes, the parameter server 200 may take the number of times (3) that can be trained within 10 minutes as N. That is, the parameter server may first send the model parameter after training for 10 minutes (3 times) as the first model parameter to the verification node for verification, then may send the model parameter after training for 20 minutes (6 times) as the first model parameter to the verification node for verification, then may send the model parameter after training for 30 minutes (9 times) as the first model parameter to the verification node for verification, …, and so on until the update times of the model parameter reach the preset training times, or the verification result of the model parameter updated at a certain time is passed.
In some possible embodiments, the parameter server may obtain the trained model parameters and verify the trained model parameters when the training times reach the preset times. In this case, fig. 7 is a schematic flow chart of a model training method according to an embodiment of the present application. As shown in fig. 7, the method may further include S201 to S204.
S201, under the condition that training times of each training node reach M times, the parameter server acquires target model parameters.
The target model parameters are model parameters adopted by the target training node in the M-th training. The target training node is a training node with the last training time reaching M times in a plurality of training nodes. M may be preset by the administrator in the training node or parameter server for representing the training number threshold. For example, M may be 1000, 5000, 10000, or the like. The specific value of M is not limited in the embodiment of the present application.
The specific process of the parameter server obtaining the target model parameters may be described with reference to S1014 in fig. 5, which is not described herein.
S202, the parameter server sends the target model parameters to the verification node so that the verification node verifies the target model according to the target model parameters.
Illustratively, taking the accuracy rate of verifying the target model as an example, the process of verifying the target model by the verification node 300 may specifically include steps 1 to 6.
Step 1, before receiving the target model parameters, the verification node may initialize an accuracy stack (accStack).
And step 2, after receiving the target model parameters, the verification node replaces the existing training model parameters to obtain the target model.
And 3, the verification node uses a preset verification data set to verify the target model, outputs the accuracy rate, and stores the accuracy rate into an accuracy rate stack.
And step 4, the verification node judges whether the accuracy stack storing the accuracy meets the preset condition.
Alternatively, the preset condition may be that the value of the top of the accuracy rate stack is not increased continuously multiple times, or that the value of the top of the accuracy rate stack is continuously greater than the preset accuracy rate threshold. The embodiment of the application does not limit the method, and the preset conditions can be adjusted according to actual requirements.
And 5, under the condition that the accuracy stack meets the preset condition, the verification node stores the target model, determines that the verification result of the target model is passed, and interacts with the master node to enable the master node to finish the training task.
And step 6, under the condition that the accuracy stack does not meet the preset condition, the verification node determines that the verification result of the target model is not passed, and interacts with the master node to enable the master node to restart the training task.
S203, the parameter server receives a second verification result of the target model parameter sent by the verification node.
Wherein the second verification result includes pass or fail.
And S204, obtaining the trained target model by the parameter server under the condition that the second verification result is passed.
The model parameters in the trained target model are target model parameters.
Optionally, if the second verification result is passed, the parameter server may substitute the target model parameter into the target model to obtain a trained target model; alternatively, the parameter server may obtain the trained object model from the validation node.
Optionally, if the second verification result is not passed, the master node restarts the model training task to continue training the target model.
The foregoing description of the solution provided in the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. The technical aim may be to use different methods to implement the described functions for each particular application, but such implementation should not be considered beyond the scope of the present application.
In an exemplary embodiment, the embodiment of the application also provides a model training device. Fig. 8 is a schematic diagram of the composition of the model training device according to the embodiment of the present application. As shown in fig. 8, the model training apparatus includes: an acquisition module 801 and a processing module 802.
An obtaining module 801, configured to obtain a first model parameter of a target model;
a processing module 802, configured to send first model parameters to each training node and the verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters; receiving a first verification result of a first model parameter sent by a verification node; the first verification result includes pass or fail; and under the condition that the first verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are first model parameters.
In some possible embodiments, the obtaining module 801 is specifically configured to obtain initial model parameters of the target model.
In some possible embodiments, the processing module 802 is specifically configured to send the target model and the initial model parameters to each training node, so that each training node trains the target model according to the initial model parameters; receiving initial gradient information sent by each training node; the initial gradient information is used for indicating the updating direction of the initial model parameters; and determining a first model parameter according to the initial gradient information.
In some possible embodiments, the processing module 802 is specifically configured to determine, as the first model parameter, a model parameter that is updated for the first time according to the initial gradient information.
In some possible embodiments, the processing module 802 is specifically configured to determine, according to the initial gradient information, a model parameter that is updated for the first time; and updating the model parameters updated for the first time for one or more times to obtain model parameters updated for the nth time as first model parameters, wherein N is a positive integer greater than or equal to 2.
In some possible embodiments, the processing module 802 is specifically configured to obtain the target model parameter when the training number of each training node reaches M times; the target model parameters are model parameters adopted by the target training node in the M-th training; the target training node is a training node with the last training time reaching M times in a plurality of training nodes; sending the target model parameters to the verification node so that the verification node verifies the target model according to the target model parameters; receiving a second verification result of the target model parameters sent by the verification node; the second verification result includes pass or fail; and under the condition that the second verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are target model parameters.
In an exemplary embodiment, an electronic device is also provided in an embodiment of the present application. Fig. 9 is a schematic diagram of the composition of an electronic device according to an embodiment of the present application. As shown in fig. 9, the electronic device may include: processor 10, memory 20, communication line 30, and communication interface 40, and input-output interface 50.
The processor 10, the memory 20, the communication interface 40, and the input/output interface 50 may be connected by a communication line 30.
The processor 10 is configured to execute the instructions stored in the memory 20 to implement the model training method provided in the above embodiment of the present application. The processor 10 may be a CPU, general purpose processor network processor (network processor, NP), digital signal processor (digital signal processing, DSP), microprocessor, microcontroller (micro control unit, MCU)/single-chip microcomputer (single chip microcomputer)/single-chip microcomputer, programmable logic device (programmable logic device, PLD), or any combination thereof. The processor 10 may also be any other apparatus having a processing function, such as a circuit, a device, or a software module, which is not limited in this embodiment. In one example, processor 10 may include one or more CPUs, such as CPU0 and CPU1 in fig. 9. As an alternative implementation, the electronic device may include multiple processors, for example, and may include a processor 60 (illustrated in phantom in fig. 9) in addition to the processor 10.
Memory 20 for storing instructions. For example, the instructions may be a computer program. Alternatively, memory 20 may be a read-only memory (ROM) or other type of static storage device that may store static information and/or instructions, an access memory (random access memory, RAM) or other type of dynamic storage device that may store information and/or instructions, an electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), a compact disc read-only memory (compact disc read-only memory, CD-ROM) or other optical storage, optical storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media, or other magnetic storage devices, etc., as examples of which are not limited in this application.
It should be noted that, the memory 20 may exist separately from the processor 10 or may be integrated with the processor 10. The memory 20 may be located inside the electronic device or outside the electronic device, which is not limited in this embodiment of the present application.
A communication line 30 for communicating information between the components comprised by the electronic device.
A communication interface 40 for communicating with other devices (e.g., the parameter server 200 of fig. 2 described above) or other communication networks. The other communication network may be an ethernet, a radio access network (radio access network, RAN), a wireless local area network (wireless local area networks, WLAN), etc. The communication interface 40 may be a module, a circuit, a transceiver, or any device capable of enabling communication.
And an input-output interface 50 for implementing man-machine interaction between the user and the electronic device. Such as enabling action interactions or information interactions between a user and an electronic device.
The input/output interface 50 may be a mouse, a keyboard, a display screen, or a touch display screen, for example. The action interaction or information interaction between the user and the electronic equipment can be realized through a mouse, a keyboard, a display screen, a touch display screen or the like.
It should be noted that the structure shown in fig. 9 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown in fig. 9 (for example, only the processor 10 and the memory 20), or a combination of some components, or a different arrangement of components.
In an exemplary embodiment, the present application also provides a computer program product, which when run on a computer causes the computer to perform the above-mentioned related method steps to implement the model training method in the above-mentioned embodiments.
In an exemplary embodiment, the present application also provides a computer-readable storage medium having program instructions stored thereon; the program instructions, when executed by an electronic device, cause the electronic device to implement the method as described in the previous embodiments. The computer readable storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer-executable instructions. When the computer-executable instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer-executable instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, from one website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc. that can be integrated with the media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Although the present application has been described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed application, from a review of the figures, the disclosure, and the appended claims. In the claims, the word "Comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Although the present application has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely exemplary illustrations of the present application as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the present application. It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.
The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A model training method, characterized in that the method is applied to a parameter server; the parameter server is connected with a plurality of training nodes and verification nodes; the method comprises the following steps:
acquiring a first model parameter of a target model;
transmitting the first model parameters to each training node and the verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters;
receiving a first verification result of the first model parameter sent by the verification node; the first verification result comprises pass or fail;
and under the condition that the first verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are the first model parameters.
2. The method of claim 1, wherein the obtaining the first model parameters of the target model comprises:
acquiring initial model parameters of the target model;
transmitting the target model and the initial model parameters to each training node so that each training node trains the target model according to the initial model parameters;
receiving initial gradient information sent by each training node; the initial gradient information is used for indicating the updating direction of the initial model parameters;
and determining the first model parameters according to the initial gradient information.
3. The method of claim 2, wherein determining the first model parameters from the initial gradient information comprises:
and determining the model parameters updated for the first time as the first model parameters according to the initial gradient information.
4. The method of claim 2, wherein determining the first model parameters from the initial gradient information comprises:
determining model parameters updated for the first time according to the initial gradient information;
and updating the model parameters updated for the first time for one or more times to obtain model parameters updated for the nth time as the first model parameters, wherein N is a positive integer greater than or equal to 2.
5. A method according to claim 3, wherein N is one of a set of preset update times; the set of update times includes a plurality of update times.
6. A method according to claim 3, wherein N is the number of times the model parameters were updated the last time from the target instant; the target time is obtained by adding a preset period to the initial time; the initial time is the time of acquiring the initial model parameters, or the time of sending the initial parameter model to each training node.
7. The method according to any one of claims 1-6, further comprising:
under the condition that the training times of each training node reach M times, acquiring target model parameters; the target model parameters are model parameters adopted by the target training node in the M-th training; the target training node is a training node with the last training time reaching M times in the plurality of training nodes;
transmitting the target model parameters to the verification node so that the verification node verifies the target model according to the target model parameters;
Receiving a second verification result of the target model parameter sent by the verification node; the second verification result comprises pass or fail;
and under the condition that the second verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are the target model parameters.
8. A model training apparatus, the apparatus comprising:
the device comprises an acquisition module and a processing module;
the acquisition module is used for acquiring first model parameters of the target model;
the processing module is used for sending the first model parameters to each training node and the verification node, so that each training node trains the target model according to the first model parameters, and the verification node verifies the target model according to the first model parameters; receiving a first verification result of the first model parameter sent by the verification node; the first verification result comprises pass or fail; and under the condition that the first verification result is passed, obtaining a trained target model, wherein model parameters in the trained target model are the first model parameters.
9. An electronic device, the electronic device comprising: a processor and a memory;
the memory stores instructions executable by the processor;
the processor is configured to, when executing the instructions, cause the electronic device to implement the method of any one of claims 1 to 7.
10. A computer-readable storage medium, the computer-readable storage medium comprising: computer software instructions;
the computer software instructions, when run on an electronic device, cause the electronic device to implement the method of any one of claims 1 to 7.
CN202311491387.0A 2023-11-09 2023-11-09 Model training method and device, electronic equipment and storage medium Pending CN117556921A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311491387.0A CN117556921A (en) 2023-11-09 2023-11-09 Model training method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311491387.0A CN117556921A (en) 2023-11-09 2023-11-09 Model training method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117556921A true CN117556921A (en) 2024-02-13

Family

ID=89812059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311491387.0A Pending CN117556921A (en) 2023-11-09 2023-11-09 Model training method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117556921A (en)

Similar Documents

Publication Publication Date Title
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
US10924535B2 (en) Resource load balancing control method and cluster scheduler
EP3736692A1 (en) Using computational cost and instantaneous load analysis for intelligent deployment of neural networks on multiple hardware executors
WO2018099084A1 (en) Method, device, chip and system for training neural network model
WO2021175058A1 (en) Neural network architecture search method and apparatus, device and medium
US20200279187A1 (en) Model and infrastructure hyper-parameter tuning system and method
CN107133083B (en) Virtual data center resource providing method based on virtualization technology
CN111065999B (en) Power state control for mobile devices
KR102407220B1 (en) Artificial intelligence chip and instruction execution method for artificial intelligence chip
US11580458B2 (en) Method and system for performance tuning and performance tuning device
CN105144109A (en) Distributed data center technology
CN109542713A (en) A kind of verification method and verifying device
WO2023207035A1 (en) Data synchronization method and apparatus, and device and storage medium
CN108304926A (en) A kind of pond computing device and method suitable for neural network
CN110600020B (en) Gradient transmission method and device
CN113850394B (en) Federal learning method and device, electronic equipment and storage medium
CN116569194A (en) Joint learning
CN112130927A (en) Reliability-enhanced mobile edge computing task unloading method
US20150150011A1 (en) Self-splitting of workload in parallel computation
CN117556921A (en) Model training method and device, electronic equipment and storage medium
Zhang et al. Repeatable multi-dimensional virtual network embedding in cloud service platform
JP2020003860A (en) Learning system, processing device, processing method, and program
CN115412401A (en) Method and device for training virtual network embedding model and virtual network embedding
TWI770534B (en) Automatic machine learning system performance tuning method, device, electronic device and storage medium
CN112637032A (en) Service function chain deployment method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination