WO2020015734A1 - 更新参数的方法和装置 - Google Patents

更新参数的方法和装置 Download PDF

Info

Publication number
WO2020015734A1
WO2020015734A1 PCT/CN2019/096774 CN2019096774W WO2020015734A1 WO 2020015734 A1 WO2020015734 A1 WO 2020015734A1 CN 2019096774 W CN2019096774 W CN 2019096774W WO 2020015734 A1 WO2020015734 A1 WO 2020015734A1
Authority
WO
WIPO (PCT)
Prior art keywords
training
node
batchsize
samples
training samples
Prior art date
Application number
PCT/CN2019/096774
Other languages
English (en)
French (fr)
Inventor
杨威
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020015734A1 publication Critical patent/WO2020015734A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present invention relates to the field of computer technology, and in particular, to a method and device for updating parameters.
  • Deep learning has become a popular research direction due to its powerful representation and fitting capabilities. Deep learning models need to be trained before use.
  • the training can use gradient descent.
  • the training process can be as follows: input sample data into the model to be trained to obtain training values, and calculate the loss function loss according to the training values and the true values of the sample data. Calculate the gradient data of each parameter in the model to be trained according to the loss, calculate the update parameters corresponding to each parameter according to the gradient data, and update the parameters in the model to be trained. Repeat the above process until the loss is less than the preset loss threshold. End training.
  • the model to be trained may be stored on each training node participating in the training, and the training node may be a terminal or a server.
  • Multiple training nodes use the same amount of different sample data (which can be called BatchSize) to calculate the gradient data of each parameter separately, and then calculate the average of the gradient data of each parameter obtained by multiple training nodes, and then based on the average of the gradient data Determine the update parameters, and then each training node updates the parameters at the same time according to the update parameters. Then repeat the above process to continue training.
  • This technique can be called a deep learning synchronous data parallel training technique.
  • the computing capabilities of the training nodes are different. Therefore, when each of the above training nodes calculates gradient data based on the same BatchSize sample data at the same time, a training node with strong computing power will first complete the calculation and then wait for weak computing power In the case where the training node of the training node has completed the calculation and subsequent processing is performed, the computing resources of the training node will be wasted and the training efficiency will be reduced.
  • a method for updating parameters includes:
  • the performance parameters include at least one parameter of a central processing unit CPU model, a number of CPUs, a graphics processor GPU model, and a time spent processing a preset number of training samples.
  • the number of samples determines the number of training samples that can be processed by each training node within a preset time period, as the BatchSize corresponding to each training node.
  • the number of training samples that can be processed by each training node within a preset duration is determined according to the number of processing training samples within the unit duration, and as the BatchSize corresponding to each training node, includes :
  • the number of training samples processed within the unit duration is determined as the number of training samples that each training node can process within a preset duration, as the BatchSize corresponding to each training node.
  • the method further includes:
  • the update parameters are sent to each training node.
  • a method for updating parameters includes:
  • the training model is processed.
  • the training samples include sample input data and output reference data
  • the performing training processing on a model to be trained according to the obtained training samples includes:
  • an apparatus for updating parameters includes:
  • An acquisition module for acquiring performance parameters of each training node
  • a determining module configured to determine, according to the performance parameters of each training node, the number of training samples that each training node can process within a preset length of time, as the BatchSize corresponding to each training node;
  • a sending module is configured to send the determined BatchSize to the corresponding training nodes respectively.
  • the performance parameters include at least one parameter of a central processing unit CPU model, a number of CPUs, a graphics processor GPU model, and a time spent processing a preset number of training samples.
  • the determining module is configured to:
  • the number of samples determines the number of training samples that can be processed by each training node within a preset time period, as the BatchSize corresponding to each training node.
  • the determining module is configured to:
  • the number of training samples processed within the unit duration is determined as the number of training samples that each training node can process within a preset duration, as the BatchSize corresponding to each training node.
  • the apparatus further includes:
  • a receiving module configured to send the determined BatchSize to the corresponding training nodes, and then receive the gradient data sent by each training node;
  • a calculation module for calculating an average value of the received gradient data
  • the determining module is further configured to determine an update parameter of a model to be trained according to the average value
  • the sending module is further configured to send the update parameter to each training node.
  • an apparatus for updating parameters includes:
  • a receiving module for receiving the BatchSize corresponding to the training node sent by the central node
  • An acquisition module configured to acquire a corresponding number of training samples according to the BatchSize
  • a training module is configured to perform training processing on a model to be trained according to the obtained training samples.
  • the training samples include sample input data and output reference data
  • the training module is configured to:
  • a system for updating parameters includes a central node and a training node, where:
  • the central node for performing the method according to the first aspect
  • the training node is configured to execute the method described in the second aspect.
  • a computer device includes a processor, a communication interface, a memory, and a communication bus.
  • the processor, the communication interface, and the memory complete communication with each other through the bus.
  • the memory is used to store a computer.
  • a program ; a processor, configured to execute a program stored in a memory to implement the method steps described in any one of the first aspect or the second aspect.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, or the code.
  • the set or instruction set is loaded and executed by the processor to implement the method of updating parameters as described in the first or second aspect above.
  • the center node determines the batch data batch size of each training node according to the performance parameters of each training node, and the training node obtains a corresponding number of training samples according to the batch size, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use fewer training samples to calculate the gradient data at the same time, so that each training node consumes almost the same time, which can avoid the occurrence of training nodes with strong computing power. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • FIG. 1 is a schematic flowchart of a device interaction for updating parameters according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a device interaction for updating parameters according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a method for updating parameters according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a method for updating parameters according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a method for updating parameters according to an embodiment of the present invention.
  • FIG. 6 is a flowchart of a method for updating parameters according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an apparatus for updating parameters according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of an apparatus for updating parameters according to an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an apparatus for updating parameters according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a central node according to an embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a training node according to an embodiment of the present invention.
  • An embodiment of the present invention provides a method for updating parameters.
  • the method may be implemented by a central node and a training node together.
  • the central node may be a terminal or a server; the training node may be a terminal or a server.
  • the training node and the central node are both servers to form a server group, and in another case, the training node and the central node are both terminals.
  • the central node and the training node can be separate physical devices. As shown in Figure 1, the central node is only used for data interaction with the training node and does not itself participate in the process of training the model to be trained using the training samples; the central node and training
  • the node can also be a virtual module built in a physical device.
  • the central node can be integrated with the training node in a physical device. As shown in Figure 2, the physical device integrated by the central node and the training node must be integrated with other physical devices.
  • the training nodes perform data interaction and also participate in the process of training the model to be trained using the training samples, which is not limited in the present invention.
  • the central node may include components such as a processor, a memory, and a transceiver.
  • the processor may be a CPU (Central Processing Unit, etc.), and can determine the batch size of the training node according to the performance parameters, obtain the performance parameters of each training node, and calculate the average value of the gradient data received.
  • Memory which can be RAM (Random Access Memory), Flash (Flash memory), etc., can be used to store received data, data required for processing, data generated during processing, etc., such as each training
  • the transceiver can be used for data transmission with the terminal or other servers.
  • the transceiver can include antennas, matching circuits, modems, etc. .
  • the central node may further include a screen, an image detection component, an audio output component, an audio input component, and the like.
  • the screen can be used to display training results.
  • the image detection means may be a camera or the like.
  • Audio output components can be speakers, headphones, etc.
  • the audio input means may be a microphone or the like.
  • the training node may include components such as a processor, a memory, and a transceiver.
  • the processor may be a CPU (Central Processing Unit), etc., and may be used to obtain training samples according to BatchSize, obtain output data corresponding to training samples according to training samples, calculate gradient data, and update training parameters according to update parameters. deal with.
  • Memory which can be RAM (Random Access Memory), Flash (Flash), etc., can be used to store received data, data required for processing, data generated during processing, such as BatchSize, training Output data corresponding to samples, training samples, gradient data corresponding to the parameters to be trained, update parameters, etc.
  • the transceiver can be used for data transmission with the terminal or other servers, for example, sending the gradient data corresponding to the parameters to be trained to the central node, and receiving the corresponding BatchSize sent by the central node.
  • the transceiver can include an antenna, a matching circuit, and a modem.
  • the terminal may further include a screen, an image detection component, an audio output component, an audio input component, and the like. The screen can be used to display training results and so on.
  • the transceiver can be used for data transmission with other devices.
  • the device list and control page received by the receiving server can include antennas, matching circuits, and modems.
  • An embodiment of the present invention provides a method for updating parameters.
  • the method is applied to a central node.
  • a processing flow of the method may include the following steps:
  • step 301 performance parameters of each training node are obtained.
  • step 302 according to the performance parameters of each training node, the number of training samples that can be processed by each training node within a preset duration is determined as the BatchSize corresponding to each training node.
  • step 303 the determined BatchSize is sent to the corresponding training nodes respectively.
  • the central node determines the batch size of each training node according to the performance parameters of each training node, so that the training node obtains a corresponding number of training samples according to the batch size, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use fewer training samples to calculate the gradient data at the same time, so that each training node consumes almost the same time, which can avoid the occurrence of training nodes with strong computing power. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • An embodiment of the present invention provides a method for updating parameters.
  • the method is applied to a training node.
  • a processing flow of the method may include the following steps:
  • step 401 a BatchSize corresponding to the training node sent by the central node is received.
  • step 402 a corresponding number of training samples are obtained according to the BatchSize.
  • step 403 a training process is performed on the model to be trained according to the obtained training samples.
  • the training node obtains a BatchSize determined based on the computing performance of the training node, obtains a corresponding number of training samples according to the BatchSize, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • Many training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use a smaller number of training samples to calculate the gradient data at the same time. In this way, each training node consumes almost the same time, which can avoid training nodes with strong computing capabilities. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • An embodiment of the present invention provides a method for updating parameters.
  • the method is applied to a central node and a training node.
  • a processing flow of the method may include the following steps:
  • step 501 the central node obtains performance parameters of each training node.
  • the central node and the training node may be started first. After startup, the central node sends a performance parameter acquisition request to each training node.
  • the training node that received the performance parameter acquisition request stores and parses the performance parameter acquisition request, and then obtains the pre-stored performance parameter and its own node identifier, and combines the performance parameter and The node identification is sent to the central node.
  • the central node After receiving the performance parameters and node IDs sent by each training node, stores the performance parameters and node IDs of each training node and generates a node identification table for all training nodes participating in the training. This table includes Node flags and corresponding performance parameters.
  • the above performance parameters include at least one parameter of a central processing unit CPU model, a number of CPUs, a graphics processor GPU model, and a time spent processing a preset number of training samples.
  • the CPU model means that the CPU manufacturer will determine a model for the CPU product according to the market positioning of the CPU product for easy classification and management.
  • the CPU model can be said to be an important identifier for distinguishing the performance of the CPU.
  • a CPU model can represent a fixed number of CPU cores and a fixed core frequency.
  • the time taken to process a preset number of training samples is a parameter obtained and stored in advance by the training node.
  • This parameter may be obtained by a technician testing a training node according to a preset number of training samples in advance, and then recording and storing the time spent processing the preset number of training samples.
  • the parameter may also be that after each training node is started, each training node automatically obtains a preset number of training samples for testing, and automatically records and stores the time consumed for processing the preset number of training samples.
  • the above testing process may be: inputting a preset number of training sample sample data into the model to be trained to obtain output data, and determining each of the models to be trained according to the output data and the output reference data in the training sample.
  • the statistical time is the total time from the input sample data to the gradient data.
  • any solution that can acquire the time spent processing a preset number of training samples is acceptable, and the present invention is not limited thereto.
  • the central node determines the number of training samples that can be processed by each training node within a preset length of time, as the BatchSize corresponding to each training node.
  • the central node determines, according to the performance parameters of each training node, the number of training samples that each training node can process within a preset duration, as each training The BatchSize of the node. In this way, the number of training samples processed by each training node can be determined according to the computing performance of each training node. The number of training samples processed by the training node with stronger computing performance is higher, and the number of training samples processed by the training node with weaker computing performance is higher. The number of training samples processed twice is less.
  • the time consumed by each training node to process the training samples is approximately the same, avoiding that the training nodes consume different time for each training, so that training nodes with strong computing performance need to wait for weak computing performance. After the training nodes have finished processing the training samples, they can proceed to the next operation to avoid wasting computing resources and further improve training efficiency.
  • the central node may calculate the BatchSize of each training node according to the correspondence between the pre-stored performance parameters and the number of training samples processed in a unit duration.
  • the corresponding processing steps may be as follows: Correspondence between the number of processing training samples in the duration and the performance parameters of each training node, determine the corresponding number of processing training samples in the unit time, and determine that each training node The number of training samples that can be processed within the duration is used as the BatchSize corresponding to each training node.
  • BatchSize is used to instruct the training node to process the number of training samples simultaneously during one training process when training nodes are trained.
  • the preset duration is equivalent to the time consumed by each training node to complete a training session using training samples.
  • the following uses the performance parameters of one training node as an example for description.
  • the central node obtains the correspondence between the pre-stored performance parameters and the number of processing training samples per unit time.
  • the correspondence is the result of a large number of experiments performed by the technicians in advance, and is then stored in the central node in the form of a correspondence table.
  • a technician can use different training models of CPU models to conduct experiments, use these training nodes to process training samples to obtain gradient data, and obtain the training samples that can be processed by the training nodes in a unit time after testing.
  • Corresponding to the number of CPU models to get the correspondence table.
  • the correspondence table between the performance parameters and the number of processing training samples within a unit length may be multiple This table includes at least the correspondence between the CPU model and the number of training samples processed in the unit time (as shown in Table 1 below), and the correspondence between the number of CPUs and the number of processing training samples in the unit time (as shown in Table 2 below) ) And the corresponding table of the GPU model and the number of training samples processed per unit time (as shown in Table 3 below). If the performance parameter is the time taken to process a preset number of training samples, the number of training samples processed per unit time can be calculated by an algorithm.
  • the central node may determine the type of the performance parameter of the training node, then obtain the corresponding correspondence table, and query the correspondence table to query the number of training samples processed in the unit duration corresponding to the performance parameter of the training node. For example, if the central node determines the performance parameter of the training node and the CPU model is Intel i5, the central node obtains the correspondence table corresponding to Table 1 and queries the number of training samples processed per unit time corresponding to Intel i5 to determine the performance of the training node The number of training samples processed per unit time corresponding to the parameter is 380.
  • the performance parameters of the training node may be a combination of multiple parameters among the above parameters.
  • the number of processing training samples within a unit length corresponding to each parameter may be determined first, and then the determined multiple The average value of the number of processing training samples in each unit time is used as the number of processing training samples in the unit time of the training node.
  • different training nodes can be selected for experiments based on a variety of parameters to construct a correspondence table. For example, different training nodes can be selected for the two parameters of CPU model and GPU model for testing. These training nodes are used to process training samples to obtain gradient data. After the experiment, it is obtained that the training samples that can be processed by the training nodes within a unit length of time. Number, corresponding to the CPU model and GPU model, to get the correspondence table.
  • the number of training samples that the training node can process in a preset time is determined.
  • the product of the number of training samples processed in a unit time and a preset value can be determined as the number of training samples that each training node can process in a preset time period as the BatchSize corresponding to each training node; or, the unit length
  • the number of internally processed training samples is determined as the number of training samples that can be processed by each training node within a preset length of time, as the BatchSize corresponding to each training node.
  • the training node is 380 and the unit duration is 10s, and the preset duration is set to 30s.
  • the above steps use the central node to determine the batch size of a training node as an example.
  • the central node can process the above-mentioned processing steps when determining its batch size.
  • the present invention provides I won't go into details here.
  • step 503 the central node sends the determined BatchSize to the corresponding training nodes.
  • the central node after determining the BatchSize of each training node through the above steps, obtains the node ID of the training node and the BatchSize of the training node, and then obtains the node corresponding to the training node according to the node ID of the training node.
  • the address according to the node address, sends to the training node the BatchSize corresponding to the training node.
  • step 504 the training node receives the BatchSize corresponding to the training node sent by the central node.
  • the training node receives the BatchSize sent by the central node. Then, the previously stored model to be trained is initialized according to the received BatchSize and the initial parameters of the model.
  • the initial parameters of the model can be stored in the training node in advance, and when used, the training node can directly obtain it from its own memory.
  • the initial model parameters may also be stored in the central node, and the training node sends an initial parameter acquisition request to the central node.
  • the central node sends the pre-stored initial parameters of the model to the training node according to the initial parameter acquisition request sent by the training node, and the training node receives The initial parameters of the model are stored and obtained in this way. The invention does not limit this.
  • step 505 according to the BatchSize, the training node obtains a corresponding number of training samples.
  • the training node after receiving the BatchSize sent by the central node, the training node obtains a corresponding number of training samples according to the BatchSize, and starts processing the training samples for training.
  • training samples can be any kind of training data set that is known, specifically which training data set is selected according to the needs of the user (that is, a technician), and all the training that participates in training the model using the training samples Nodes, the training samples used are not the same.
  • the following provides several options for training nodes to obtain training samples:
  • Training samples can be stored in the central node in advance.
  • a training node needs to obtain training samples, it sends a training sample acquisition request to the central node.
  • the training sample acquisition request carries the node ID of the training node and BatchSize, the central node.
  • the training samples After receiving the training sample acquisition request, after obtaining a corresponding number of training samples according to the BatchSize, the training samples are sent to the training nodes according to the node identification.
  • the central node may adopt an sequential allocation algorithm, and assign training samples to each training node in the order of the training sample set, or may use any random non-repeating allocation algorithm. Assign non-repeating training samples to each training node.
  • the central node segments the stored training data set according to the number of training nodes, and each segment of the training data set corresponds to one training node.
  • any scheme that allows the central node to allocate different training samples to each training node is acceptable, and the present invention does not limit this.
  • the training samples can be stored in each training node in advance, and the training nodes can directly obtain a corresponding number of training samples in their respective memories according to the BatchSize.
  • the training samples stored in each training node may be different parts of the same training data set, that is, the training samples stored in each training node are different.
  • the same training data set is stored in each training node, but the training data sets in different training nodes carry different segmentation identifiers, and each training node can only obtain a part of the training data set distinguished by the segmentation identifiers. Training samples.
  • any scheme that allows each training node to obtain different training samples is acceptable, and the present invention does not limit this.
  • the training samples can be stored in an independent storage server in advance.
  • a training node needs to obtain training samples, it sends a training sample acquisition request to the storage server.
  • the training sample acquisition request carries the node ID and BatchSize of the training node.
  • the storage server obtains a corresponding number of training samples according to the BatchSize, and then sends the training samples to the training node according to the node identifier.
  • the processing speed and bandwidth of the storage server are higher than those of the central node. Therefore, the training efficiency of the scheme of storing training samples on the storage server is higher than that of the scheme of storing training samples on the central node.
  • step 506 the training node performs training processing on the model to be trained according to the obtained training samples.
  • the training node uses the obtained training samples to perform training processing on the model to be trained.
  • the above training samples may include sample input data and output reference data.
  • the processing steps for training the model to be trained by the training node and the central node may be as follows: In step 601, training The node inputs the sample input data in the obtained training sample into the model to be trained, and obtains the output data corresponding to the training sample. In step 602, according to the output reference data and output data in the training sample, the training node determines the model to be trained.
  • the gradient data corresponding to each of the parameters to be trained in step 603, the training node sends the gradient data to the central node; in step 604, the central node receives the gradient data sent by the training node; in step 605, the central node calculates and receives The average value of the gradient data obtained; in step 606, the center node determines the update parameters of the model to be trained according to the average value; in step 607, the center node sends the update parameters to each training node; in step 608, The training node receives the update parameters sent by the central node, and according to the update parameters, the training node Each parameter to be trained in the model to be trained is updated with parameters.
  • each training node obtains a corresponding number of training samples according to the corresponding BatchSize
  • the interaction between a training node and a central node is taken as an example.
  • the number of BatchSize is n, that is, the training node obtains n Training samples.
  • the training node inputs the sample input data in the n training samples obtained into the model to be trained.
  • the model to be trained processes the n sample input data at the same time and outputs n output data.
  • each training node consumes a corresponding number of training samples according to the BatchSize and processes these training samples at the same time.
  • the duration is the above-mentioned preset duration, that is, the time consumed by each training node is the same, which can avoid the waste of computing resources and further improve the training efficiency.
  • the Loss corresponding to each of the n output data is calculated according to the n output data and the output reference data in the corresponding training sample and the calculation formula for calculating the loss function Loss.
  • the output data and the output reference data may both be in the form of long vectors, and the vector lengths of the output data and the output reference data are the same.
  • Loss can be calculated according to the calculation formula of the cross-entropy loss function, such as the following formula (1).
  • y represents the output reference data
  • p represents the output data
  • N represents the output reference data or the vector length of the output data.
  • n Loss After calculating the n Loss, calculate the average of the n Loss as the common Loss of this batch of training samples. Then, you can judge whether you need to continue training based on the common Loss.
  • a feasible judgment scheme may be to determine whether the common Loss is less than a preset Loss threshold. If the common Loss is less than a preset Loss threshold, it means that the gap between the output data corresponding to the sample input data and the sample reference data is very small.
  • Another feasible judgment scheme can be to determine whether the common Loss has converged, that is, whether the common Loss of this iteration and the common Loss of the previous iteration have changed. If there is no change, it means that the common Loss has converged. Continuing training cannot make the common Loss common. Loss becomes smaller, so stop training and end training.
  • the training node can send a notification message to the central node to notify the central node that the current model meets the end of training conditions. After receiving the notification message, the central node can end the pairing. Model training.
  • the common Loss is greater than or equal to a preset Loss threshold, or the common Loss does not converge, it means that continuing the training can reduce the common Loss and improve the accuracy of the model, so the training can be continued.
  • the gradient data of each parameter to be trained is calculated.
  • the calculation formula for calculating the gradient data can be shown as the following formula (2), and then the obtained gradient data and its own The node identification is sent to the central node.
  • W i to be trained for the parameters of the current parameter values Loss common Loss
  • ⁇ W i is the gradient data.
  • the central node After the central node receives the gradient data and the node identifier sent by the training node, it compares the received at least one node identifier with the node identifier in the previously generated node identifier table. If the received node identifier is compared with the node in the node identifier table, The identification is the same, it means that all the training nodes participating in the model training have completed the processing of the training samples and returned the gradient data.
  • the central node calculates the average of the gradient data corresponding to each of the parameters to be trained among all the gradient data received, and finally obtains the average of the gradient data of all the parameters to be trained. Then, the center node calculates an update parameter of each to-be-trained parameter according to an average value of gradient data of each to-be-trained parameter and a preset learning rate, and a calculation formula may be as the following formula (3).
  • W i + 1 is an update parameter of the parameter to be trained
  • is a preset learning rate
  • the central node After obtaining the updated parameters of each parameter to be trained, the central node sends the updated parameters to each training node separately. After receiving the updated parameters, the training node updates each to-be-trained parameter in the model to be trained with the received updated parameter. In this way, a parallel training of deep learning synchronous data is completed. After each training node updates the model parameters, it can continue to obtain the next set of BatchSize training samples, and perform the above training process again based on the newly obtained training samples.
  • some training samples can be divided into test samples, and the model is tested using the test samples.
  • the obtained test data is compared with the corresponding output reference data to determine whether the obtained test data is correct.
  • the ratio of the correct test data to the incorrect test data in the calculated test data is the training accuracy.
  • the test data is the output data obtained by inputting the sample input data in the test sample into the model.
  • the training accuracy indicates the accuracy of a model output. When the training accuracy reaches the preset training accuracy threshold, it indicates that the model is sufficiently accurate and no further training is required. Therefore, when the training accuracy reaches the preset training accuracy threshold, the training is stopped and the training process ends. At this time, the training node may send a notification message to the center node to notify the center node that the current model meets the training end condition, and the center node may end the training of the model after receiving the notification message.
  • the training node may send a notification message to the center node to notify the center node that the current model meets the training end condition, and the center node may end the training of the model after receiving the notification message.
  • a threshold for the number of iterations can also be set in advance, and the training is stopped when the number of iterations for the training reaches the threshold for the number of iterations, and the training process ends.
  • the training node may send a notification message to the center node to notify the center node that the current model meets the training end condition, and the center node may end the training of the model after receiving the notification message.
  • any method that can effectively judge the stop of the training is not limited to the above-mentioned methods, and the present invention does not limit this.
  • the above training process is repeated until the above conditions for stopping training are reached, and the training is stopped, and each training node gets the same, trained model.
  • the above-mentioned central node may be a terminal or a server.
  • the training node can be a terminal or a server.
  • the central node and the training node can be separate physical devices.
  • the central node is only used for data interaction with the training node and does not participate in the process of training the model to be trained using the training samples; the central node and the training node can also be in the entity.
  • the virtual module established in the device, the central node and the training node can be integrated in a physical device, then the physical device integrated by the central node and the training node must both interact with the training nodes in other physical devices and participate in the use of training samples The process of training the model to be trained.
  • the data communication between the physical device where the central node is located and the physical device where other training nodes are located can use a parallel communication protocol.
  • the central node determines the batch size of each training node according to the performance parameters of each training node, and the training node obtains a corresponding number of training samples according to the batch size, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • Many training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use fewer training samples to calculate the gradient data at the same time, so that each training node consumes almost the same time, which can avoid the occurrence of training nodes with strong computing power. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • an embodiment of the present invention further provides a device for updating parameters.
  • the device may be a central node in the foregoing embodiment. As shown in FIG. 7, the device includes an obtaining module 710, a determining module 720, and Sending module 730.
  • the obtaining module 710 is configured to obtain performance parameters of each training node
  • the determining module 720 is configured to determine, according to the performance parameters of each training node, the number of training samples that each training node can process within a preset length of time, as the BatchSize corresponding to each training node;
  • the sending module 730 is configured to send the determined BatchSize to the corresponding training nodes, respectively.
  • the performance parameters include at least one parameter of a central processing unit CPU model, a number of CPUs, a graphics processor GPU model, and a time spent processing a preset number of training samples.
  • the determining module 720 is configured to:
  • the number of samples determines the number of training samples that can be processed by each training node within a preset time period, as the BatchSize corresponding to each training node.
  • the determining module 720 is configured to:
  • the number of training samples processed within the unit duration is determined as the number of training samples that each training node can process within a preset duration, as the BatchSize corresponding to each training node.
  • the apparatus further includes:
  • the receiving module 740 is configured to send the determined BatchSize to the corresponding training nodes respectively, and then receive the gradient data sent by each training node;
  • a calculation module 750 configured to calculate an average value of the received gradient data
  • the determining module 720 is further configured to determine update parameters of a model to be trained according to the average value
  • the sending module 730 is further configured to send the update parameter to each training node.
  • the central node determines the number of batches of sample data of each training node according to the performance parameters of each training node, so that the training node obtains a corresponding number of training samples according to the BatchSize, and performs training processing on the model to be trained. .
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use fewer training samples to calculate the gradient data at the same time, so that each training node consumes almost the same time, which can avoid the occurrence of training nodes with strong computing power. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • an embodiment of the present invention further provides a device for updating parameters.
  • the device may be a training node in the foregoing embodiment.
  • the device includes a receiving module 910, an obtaining module 920, and Training module 930.
  • the receiving module 910 is configured to receive a BatchSize corresponding to the training node sent by the central node;
  • the obtaining module 920 is configured to obtain a corresponding number of training samples according to the BatchSize;
  • the training module 930 is configured to perform training processing on a model to be trained according to the obtained training samples.
  • the training samples include sample input data and output reference data
  • the training module 930 is configured to:
  • the training node obtains a BatchSize determined based on the computing performance of the training node, obtains a corresponding number of training samples according to the BatchSize, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • Many training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use a smaller number of training samples to calculate the gradient data at the same time. In this way, each training node consumes almost the same time, which can avoid training nodes with strong computing capabilities. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • an embodiment of the present invention further provides a system for updating parameters.
  • the system includes a central node and a training node, where:
  • the central node is used to obtain the performance parameters of each training node; and according to the performance parameters of each training node, the number of training samples that can be processed by each training node within a preset length of time is determined as all Describe the BatchSize corresponding to each training node; send the determined BatchSize to the corresponding training nodes respectively;
  • the training node is configured to receive a BatchSize corresponding to the training node sent by the central node; obtain a corresponding number of training samples according to the BatchSize; and perform training processing on the model to be trained according to the obtained training samples.
  • the central node determines the batch size of each training node according to the performance parameters of each training node, and the training node obtains a corresponding number of training samples according to the batch size, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • Many training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use fewer training samples to calculate the gradient data at the same time, so that each training node consumes almost the same time, which can avoid the occurrence of training nodes with strong computing power. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • the device for updating parameters provided in the foregoing embodiment only uses the above-mentioned division of function modules as an example to describe when updating the parameters.
  • the above-mentioned functions may be allocated by different function modules as required. That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the device for updating parameters and the method for updating parameters provided by the foregoing embodiments belong to the same concept. For specific implementation processes, refer to the method embodiments, and details are not described herein again.
  • FIG. 10 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
  • the computer device may be a central node in the foregoing embodiment.
  • the computer device 1000 may have a relatively large difference due to different configurations or performance, and may include one or more processors (central processing units) 1001 and one or more memories 1002.
  • the memory 1002 stores therein There are at least one instruction, which is loaded and executed by the processor 1001 to implement the following method steps for updating parameters:
  • the at least one instruction is loaded and executed by the processor 1001 to implement the following method steps:
  • the number of samples determines the number of training samples that can be processed by each training node within a preset time period, as the BatchSize corresponding to each training node.
  • the at least one instruction is loaded and executed by the processor 1001 to implement the following method steps:
  • the number of training samples processed within the unit duration is determined as the number of training samples that each training node can process within a preset duration, as the BatchSize corresponding to each training node.
  • the at least one instruction is loaded and executed by the processor 1001 to implement the following method steps:
  • the update parameters are sent to each training node.
  • the central node determines the batch size of each training node according to the performance parameters of each training node, so that the training node obtains a corresponding number of training samples according to the batch size, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use fewer training samples to calculate the gradient data at the same time, so that each training node consumes almost the same time, which can avoid the occurrence of training nodes with strong computing power. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • FIG. 11 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
  • the computer device may be a training node in the foregoing embodiment.
  • the computer device 1100 may have a relatively large difference due to different configurations or performance, and may include one or more processors (central processing units) (CPU) 1101 and one or more memories 1102.
  • the memory 1102 stores therein There are at least one instruction that is loaded and executed by the processor 1101 to implement the following method steps for updating parameters:
  • the training model is processed.
  • the at least one instruction is loaded and executed by the processor 1101 to implement the following method steps:
  • the training node obtains a BatchSize determined based on the computing performance of the training node, obtains a corresponding number of training samples according to the BatchSize, and performs training processing on the model to be trained.
  • the BatchSize of the training node is determined according to the computing performance of the training node.
  • the BatchSize of the training node with the stronger computing performance is larger, and the BatchSize of the training node with the weaker computing performance is smaller.
  • Many training samples are used to calculate the gradient data, and the training nodes with weaker computing performance use a smaller number of training samples to calculate the gradient data at the same time. In this way, each training node consumes almost the same time, which can avoid training nodes with strong computing capabilities. Completing the calculation, and then waiting for the training nodes with weak computing power to complete the calculation before performing subsequent processing, avoiding wasting the computing resources of the training nodes, thereby improving the training efficiency.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, at least one instruction, at least one program, code set, or instruction set. Loaded and executed by the processor to implement the method of identifying action categories in the above embodiments.
  • the computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
  • the program may be stored in a computer-readable storage medium.
  • the storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种更新参数的方法和装置,属于计算机技术领域。所述方法包括:获取每个训练节点的性能参数;根据每个训练节点的性能参数,分别确定每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize;将确定出的BatchSize,分别发送给对应的训练节点。采用本发明,可以提高模型训练的效率。

Description

更新参数的方法和装置
本申请要求于2018年7月20日提交的申请号为201810803723.3、发明名称为“更新参数的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,特别涉及一种更新参数的方法和装置。
背景技术
深度学习由于具有强大的表征能力和拟合能力,成为当前热门的研究方向。深度学习模型在使用前需要进行训练,训练可以采用梯度下降法,训练过程可以如下:将样本数据输入到待训练模型中,得到训练值,根据训练值与样本数据的真值计算损失函数loss,根据loss计算待训练模型中的每个参数的梯度数据,根据梯度数据计算每个参数对应的更新参数,并对待训练模型中的参数进行更新,重复上述过程,直到loss小于预设loss阈值时,结束训练。
为了缩短训练花费的时间,提高训练的效率,以一次训练为例,可以将待训练模型分别存储到参与训练的各个训练节点上,该训练节点可以是终端也可以是服务器。多个训练节点使用相同数量(可称为BatchSize)的不同样本数据分别计算各个参数的梯度数据,然后计算多个训练节点得到的每个参数的梯度数据的平均值,然后根据梯度数据的平均值确定更新参数,然后每个训练节点根据更新参数同时进行参数更新。然后再重复上述处理,继续进行训练。这种技术可以被称为深度学习同步数据并行训练技术。
在实现本发明的过程中,发明人发现相关技术至少存在以下问题:
由于训练节点的配置不同,导致训练节点的计算能力不同,因此,上述各个训练节点同时根据相同BatchSize的样本数据计算梯度数据时,会出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,这样,会浪费训练节点的计算资源,使得训练的效率降低。
发明内容
为了解决相关技术的问题,本发明实施例提供了一种更新参数的方法和装置。所述技术方案如下:
第一方面,提供了一种更新参数的方法,所述方法包括:
获取每个训练节点的性能参数;
根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;
将确定出的BatchSize,分别发送给对应的训练节点。
可选地,所述性能参数包括中央处理器CPU型号、CPU个数、图形处理器GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数。
可选地,所述根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize,包括:
根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及所述每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,所述根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize,包括:
将所述单位时长内处理训练样本的数目与预设数值的乘积,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;或者,
将所述单位时长内处理训练样本的数目,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,所述将确定出的BatchSize,分别发送给对应的训练节点之后,还包括:
接收所述每个训练节点发送的梯度数据;
计算接收到的梯度数据的平均值;
根据所述平均值,确定待训练的模型的更新参数;
将所述更新参数发送给每个训练节点。
第二方面,提供了一种更新参数的方法,所述方法包括:
接收中心节点发送的本训练节点对应的BatchSize;
根据所述BatchSize,获取对应数目的训练样本;
根据获取的训练样本,对待训练的模型进行训练处理。
可选地,所述训练样本包括样本输入数据和输出参考数据;
所述根据获取的训练样本,对待训练的模型进行训练处理,包括:
将获取的训练样本中的样本输入数据输入待训练的模型,得到所述训练样本对应的输出数据;
根据所述训练样本中的输出参考数据和所述输出数据,确定所述待训练的模型中的每个待训练参数对应的梯度数据;
将所述梯度数据发送给所述中心节点;
接收所述中心节点发送的更新参数,根据所述更新参数,对所述待训练的模型中的每个待训练参数进行参数更新。
第三方面,提供了一种更新参数的装置,所述装置包括:
获取模块,用于获取每个训练节点的性能参数;
确定模块,用于根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;
发送模块,用于将确定出的BatchSize,分别发送给对应的训练节点。
可选地,所述性能参数包括中央处理器CPU型号、CPU个数、图形处理器GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数。
可选地,所述确定模块,用于:
根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及所述每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,所述确定模块,用于:
将所述单位时长内处理训练样本的数目与预设数值的乘积,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;或者,
将所述单位时长内处理训练样本的数目,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,所述装置还包括:
接收模块,用于将确定出的BatchSize,分别发送给对应的训练节点之后,接收所述每个训练节点发送的梯度数据;
计算模块,用于计算接收到的梯度数据的平均值;
所述确定模块,还用于根据所述平均值,确定待训练的模型的更新参数;
所述发送模块,还用于将所述更新参数发送给每个训练节点。
第四方面,提供了一种更新参数的装置,所述装置包括:
接收模块,用于接收中心节点发送的本训练节点对应的BatchSize;
获取模块,用于根据所述BatchSize,获取对应数目的训练样本;
训练模块,用于根据获取的训练样本,对待训练的模型进行训练处理。
可选地,所述训练样本包括样本输入数据和输出参考数据;
所述训练模块,用于:
将获取的训练样本中的样本输入数据输入待训练的模型,得到所述训练样本对应的输出数据;
根据所述训练样本中的输出参考数据和所述输出数据,确定所述待训练的模型中的每个待训练参数对应的梯度数据;
将所述梯度数据发送给所述中心节点;
接收所述中心节点发送的更新参数,根据所述更新参数,对所述待训练的模型中的每个待训练参数进行参数更新。
第五方面,提供了一种更新参数的系统,所述系统包括中心节点和训练节点,其中:
所述中心节点,用于执行第一方面所述的方法
所述训练节点,用于执行第二方面所述的方法。
第六方面,提供了一种计算机设备,所述计算机设备包括处理器、通信接口、存储器和通信总线,其中,处理器,通信接口,存储器通过总线完成相互间的通信;存储器,用于存放计算机程序;处理器,用于执行存储器上所存放 的程序,实现上述第一方面或第二方面中任一所述的方法步骤。
第七方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上述第一方面或第二方面所述的更新参数的方法。
本发明实施例提供的技术方案带来的有益效果至少包括:
本发明实施例中,中心节点根据每个训练节点的性能参数,确定每个训练节点的样本数据批次数目BatchSize,训练节点根据BatchSize,获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,使得每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种更新参数的设备交互的流程示意图;
图2是本发明实施例提供的一种更新参数的设备交互的流程示意图;
图3是本发明实施例提供的一种更新参数的方法的流程图;
图4是本发明实施例提供的一种更新参数的方法的流程图;
图5是本发明实施例提供的一种更新参数的方法的流程图;
图6是本发明实施例提供的一种更新参数的方法的流程图;
图7是本发明实施例提供的一种更新参数的装置的结构示意图;
图8是本发明实施例提供的一种更新参数的装置的结构示意图;
图9是本发明实施例提供的一种更新参数的装置的结构示意图;
图10是本发明实施例提供的一种中心节点结构示意图;
图11是本发明实施例提供的一种训练节点结构示意图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
本发明实施例提供了一种更新参数的方法,该方法可以由中心节点和训练节点共同实现。其中,中心节点可以是终端,也可以是服务器;训练节点可以是终端,也可以是服务器。例如,一种情况,训练节点和中心节点均为服务器,组成服务器组,另一种情况,训练节点和中心节点均为终端。中心节点和训练节点可以是各自独立的实体设备,如图1所示,中心节点仅用于和训练节点进行数据交互,本身不参与使用训练样本对待训练的模型进行训练的过程;中心节点和训练节点也可以是在实体设备中建立的虚拟模块,中心节点可以和训练节点集成在一个实体设备中,如图2所示,则中心节点和训练节点集成的实体设备既要与其他实体设备中的训练节点进行数据交互,也要参与使用训练样本对待训练的模型进行训练的过程,本发明对此不做限定。
中心节点可以包括处理器、存储器、收发器等部件。处理器,可以为CPU(Central Processing Unit,中央处理单元)等,可以根据性能参数确定训练节点的BatchSize、获取每个训练节点的性能参数、计算接收到的梯度数据的平均值等处理。存储器,可以为RAM(Random Access Memory,随机存取存储器),Flash(闪存)等,可以用于存储接收到的数据、处理过程所需的数据、处理过程中生成的数据等,如每个训练节点的性能参数、每个训练节点的BatchSize、训练节点发送的梯度数据、梯度数据的平均值、待训练的模型的更新参数等。收发器,可以用于与终端或其它服务器进行数据传输,例如,向训练节点发送待训练的模型的更新参数,向每个训练节点发送对应的BatchSize,收发器可以包括天线、匹配电路、调制解调器等。中心节点还可以包括屏幕、图像检测部件、音频输出部件和音频输入部件等。屏幕可以用于显示训练结果。图像检测部件可以是摄像头等。音频输出部件可以是音箱、耳机等。音频输入部件可以是麦克风等。
训练节点可以包括处理器、存储器、收发器等部件。处理器,可以为CPU(Central Processing Unit,中央处理单元)等,可以用于根据BatchSize获取训练样本、根据训练样本得到训练样本对应的输出数据、计算梯度数据、根据更新参数对待训练参数进行更新等处理。存储器,可以为RAM(Random Access Memory,随机存取存储器),Flash(闪存)等,可以用于存储接收到的数据、处理过程所需的数据、处理过程中生成的数据等,如BatchSize、训练样本、训练样本对应的输出数据、待训练参数对应的梯度数据、更新参数等。收发器,可以用于与终端或其它服务器进行数据传输,例如,向中心节点发送待训练参数对应的梯度数据,接收中心节点发送的对应的BatchSize,收发器可以包括天线、匹配电路、调制解调器等。终端还可以包括屏幕、图像检测部件、音频输出部件和音频输入部件等。屏幕可以用于显示训练结果等。收发器,可以用于与其它设备进行数据传输,例如,接收服务器发送的设备列表和控制页面,可以包括天线、匹配电路、调制解调器等。
本发明实施例提供了一种更新参数的方法,该方法应用于中心节点,如图3所示,该方法的处理流程可以包括如下的步骤:
在步骤301中,获取每个训练节点的性能参数。
在步骤302中,根据每个训练节点的性能参数,分别确定每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize。
在步骤303中,将确定出的BatchSize,分别发送给对应的训练节点。
本发明实施例中,中心节点根据每个训练节点的性能参数,确定每个训练节点的BatchSize,使得训练节点根据BatchSize,获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,使得每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
本发明实施例提供了一种更新参数的方法,该方法应用于训练节点,如图4所示,该方法的处理流程可以包括如下的步骤:
在步骤401中,接收中心节点发送的本训练节点对应的BatchSize。
在步骤402中,根据BatchSize,获取对应数目的训练样本。
在步骤403中,根据获取的训练样本,对待训练的模型进行训练处理。
本发明实施例中,训练节点获取基于本训练节点的计算性能确定的BatchSize,并根据该BatchSize获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,这样每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
本发明实施例提供了一种更新参数的方法,该方法应用于中心节点和训练节点,如图5所示,该方法的处理流程可以包括如下的步骤:
在步骤501中,中心节点获取每个训练节点的性能参数。
一个可能的实施例中,用户想要通过中心节点和多个训练节点对待训练的模型进行训练时,可以先启动中心节点和训练节点。启动后,中心节点向每个训练节点发送性能参数获取请求,接收到性能参数获取请求的训练节点存储并解析性能参数获取请求,然后获取预先存储的性能参数以及自身的节点标识,将性能参数和节点标识发送给中心节点。中心节点接收到各训练节点发送的性能参数和节点标识后,将各训练节点的性能参数和节点标识对应存储,并生成一个参与训练的所有训练节点的节点标识表,该表包括所有训练节点的节点标志和对应的性能参数。
可选地,上述性能参数包括中央处理器CPU型号、CPU个数、图形处理器GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数。
其中,CPU型号是指CPU厂商会根据CPU产品的市场定位来给CPU产品确定一个型号以便于分类和管理,一般而言CPU型号可以说是用于区分CPU性能的重要标识。一般来说,一个CPU型号可以代表固定的CPU的内核数量以及 固定的核心频率。
处理预设数目个训练样本所耗时长是训练节点预先获得并存储的参数。该参数可以是技术人员预先根据预设数目个训练样本对训练节点进行测试,然后记录处理预设数目个训练样本所消耗的时长并存储得到的。另外,该参数也可以是在各训练节点启动后,各训练节点自动获取预设数目个训练样本进行测试,自动记录并存储处理预设数目个训练样本所消耗的时长。上述测试过程可以是:将预设数目个训练样本的样本输入数据,分别输入待训练的模型中,得到输出数据,根据输出数据和训练样本中的输出参考数据,确定待训练的模型中的每个待训练参数对应的梯度数据。统计的时长是由输入样本输入数据到得到梯度数据的总耗时。除了上述的方法之外,任一可以获取处理预设数目个训练样本所耗时长的方案皆可,本发明对此不做限定。
在步骤502中,根据每个训练节点的性能参数,中心节点分别确定每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize。
一个可能的实施例中,获取每个训练节点的性能参数后,中心节点根据每个训练节点的性能参数,确定每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize。这样,可以根据每个训练节点的计算性能,确定它们每次训练处理的训练样本的数目,计算性能强一些的训练节点每次处理的训练样本的数目多一些,计算性能弱一些的训练节点每次处理的训练样本的数目少一些,这样,各训练节点处理训练样本所消耗的时间大致相同,避免出现训练节点每次训练消耗的时间不相同,使得计算性能强的训练节点需要等待计算性能弱的训练节点完成训练样本处理后,才能进行下一步操作的情况,避免出现计算资源的浪费,可以进一步提高训练效率。
可选地,中心节点可以根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,计算得到每个训练节点的BatchSize,相应的处理步骤可以如下:根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据单位时长内处理训练样本的数目,确定每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize。
其中,BatchSize用于指示训练节点在使用训练样本训练时,一次训练过程中同时处理训练样本的数目。预设时长相当于每个训练节点使用训练样本完成 一次训练所消耗的时长。
一个可能的实施例中,获取每个训练节点的性能参数后,下面以一个训练节点的性能参数为例进行说明。
中心节点获取预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,该对应关系为技术人员预先进行大量试验得到的结果,然后以对应关系表的形式存储在中心节点。例如,以CPU型号为例,技术人员可以使用不同CPU型号的训练节点分别进行试验,使用这些训练节点分别处理训练样本得到梯度数据,经过试验得出这些训练节点在单位时长内可以处理的训练样本的数目,与CPU型号对应记录,得到对应关系表。基于上述性能参数包括CPU型号、CPU个数、GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数,因此性能参数与单位时长内处理训练样本的数目的对应关系表可以是多个表,至少包括CPU型号与单位时长内处理训练样本的数目的对应关系表(如下表1所示)、CPU个数与单位时长内处理训练样本的数目的对应关系表(如下表2所示)以及GPU型号与单位时长内处理训练样本的数目的对应关系表(如下表3所示)这三个表。如果性能参数为处理预设数目个训练样本所耗时长,可以通过算法计算得到单位时长内处理训练样本的数目,举例来说,假设某一训练节点的性能参数为处理预设数目个训练样本所耗时长,其中,预设数目为100,所耗时长为10s,而待计算的单位时长内处理训练样本的数目中的单位时长为20s,则可以计算20s÷10s×100=200,可以得到单位时长内处理训练样本的数目为200。
表1
CPU型号 单位时长处理训练样本的数目
Intel i3 300
Intel i5 380
…… ……
AMD Ryzen 7 2700 330
表2
CPU内核数目 单位时长处理训练样本的数目
2 150
4 270
6 380
8 500
表3
GPU型号 单位时长处理训练样本的数目
Intel(R)HD Graphics 630 350
AMD Radeon(TM)R9 270 480
…… ……
GeForce GTX 760 200
中心节点可以确定训练节点的性能参数的种类,然后获取对应的对应关系表,在对应关系表中查询训练节点的性能参数对应的单位时长内处理训练样本的数目。例如,中心节点确定训练节点的性能参数CPU型号为Intel i5,则中心节点获取表1对应的对应关系表,并查询Intel i5对应的单位时长内处理训练样本的数目,可以确定该训练节点的性能参数对应的单位时长内处理训练样本的数目为380。
需要说明的是,训练节点的性能参数可以是上述参数中的多种参数共同组成,这种情况下,可以先确定每种参数对应的单位时长内处理训练样本的数目,然后计算确定出的多个单位时长内处理训练样本的数目的平均值,作为该训练节点的单位时长内处理训练样本的数目。或者,可以基于多种参数选取不同的训练节点进行试验,构建对应关系表。例如,可以针对CPU型号和GPU型号这两个参数选取不同的训练节点进行试验,使用这些训练节点分别处理训练样本得到梯度数据,经过试验得出这些训练节点在单位时长内可以处理的训练样本的数目,与CPU型号和GPU型号对应记录,得到对应关系表。
确定出训练节点的单位时长内处理训练样本的数目后,根据确定的单位时 长内处理训练样本的数目,确定该训练节点在预设时长内能够处理的训练样本的数目。可以将单位时长内处理训练样本的数目与预设数值的乘积,确定为每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize;或者,将单位时长内处理训练样本的数目,确定为每个训练节点在预设时长内能够处理的训练样本的数目,作为每个训练节点对应的BatchSize。例如,确定出训练节点的单位时长内处理训练样本的数目为380,且单位时长为10s,而上述预设时长设置为30s,则可以计算30s÷10s×380=1140,即该训练节点在预设时长内能够处理的训练样本的数目为1140。将计算得到的结果作为该训练节点对应的BatchSize。
需要说明的是,上述步骤以中心节点确定一个训练节点的BatchSize为例进行说明,对于每个参与训练的训练节点,中心节点在确定其BatchSize时,均可按照上述处理步骤进行处理,本发明在此不做赘述。
在步骤503中,中心节点将确定出的BatchSize,分别发送给对应的训练节点。
一个可能的实施例中,通过上述步骤确定每个训练节点的BatchSize后,中心节点获取训练节点的节点标识以及该训练节点的BatchSize,然后根据该训练节点的节点标识,获取该训练节点对应的节点地址,根据该节点地址,向该训练节点发送该训练节点对应的BatchSize。
在步骤504中,训练节点接收中心节点发送的本训练节点对应的BatchSize。
一个可能的实施例中,通过上述步骤中心节点向各训练节点发送对应的BatchSize后,训练节点接收中心节点发送的BatchSize。然后根据接收到的BatchSize以及模型初始参数对预先存储的待训练的模型进行初始化。模型初始参数可以预先存储在训练节点中,使用时训练节点直接从自身的存储器中获取即可。模型初始参数也可以是存储在中心节点中,训练节点向中心节点发送初始参数获取请求,中心节点根据训练节点发送的初始参数获取请求,将预先存储的模型初始参数发送给训练节点,训练节点接收并存储模型初始参数,以此方式得到的。本发明对此不做限制。
在步骤505中,根据BatchSize,训练节点获取对应数目的训练样本。
一个可能的实施例中,接收到中心节点发送的BatchSize后,训练节点根据BatchSize获取对应数目的训练样本,开始处理训练样本进行训练。
需要说明的是,训练样本可以已知的任何一种训练数据集,具体是哪一种 训练数据集根据用户(即技术人员)的需求进行挑选,且所有参与使用训练样本对模型进行训练的训练节点,使用的训练样本均不相同。下面提供几种可选的训练节点获取训练样本的方案:
方案一、训练样本可以预先存储在中心节点中,当训练节点需要获取训练样本时,向中心节点发送训练样本获取请求,该训练样本获取请求中携带有该训练节点的节点标识以及BatchSize,中心节点接收到训练样本获取请求后,根据BatchSize获取到对应数目的训练样本后,根据节点标识向训练节点发送训练样本。其中,为了给每个训练节点分配不同的训练样本,中心节点可以采用顺序分配算法,按照训练样本集中的顺序给每个训练节点分配训练样本,也可以采用任何一种随机不重复分配算法,随机给各训练节点分配不重复的训练样本。或者,中心节点按照训练节点的数目,将存储的训练数据集分段,每段训练数据集对应一个训练节点。除了上述方案之外,任一可以使中心节点为每个训练节点分配不同的训练样本的方案皆可,本发明对此不做限制。
方案二、训练样本可以预先存储在每个训练节点中,训练节点可以根据BatchSize直接在各自的存储器中获取对应数目的训练样本。其中,为了使每个训练节点获取不同的训练样本,每个训练节点中存储的训练样本可以是同一个训练数据集中的不同部分,即每个训练节点中存储的训练样本不相同。或者,在每个训练节点中存储相同的训练数据集,但不同训练节点中的训练数据集带有不同的分段标识,每个训练节点只能够获取分段标识区分出来的训练数据集中的部分训练样本。除了上述方案之外,任一可以使每个训练节点获取不同的训练样本的方案皆可,本发明对此不做限制。
方案三、训练样本可以预先存储在独立的存储服务器中,当训练节点需要获取训练样本时,向存储服务器发送训练样本获取请求,该训练样本获取请求中携带有该训练节点的节点标识以及BatchSize,存储服务器接收到训练样本获取请求后,根据BatchSize获取到对应数目的训练样本后,根据节点标识向训练节点发送训练样本。存储服务器给每个训练节点分配不同训练样本的方案可以参考上述方案一的方案,本发明在此不做赘述。通常来说,存储服务器的处理速度和带宽比中心节点要高一些,因此,将训练样本存储在存储服务器的方案的训练效率要比将训练样本存储在中心节点的方案的训练效率高。
在步骤506中,根据获取的训练样本,训练节点对待训练的模型进行训练处理。
一个可能的实施例中,通过上述步骤获取到训练样本后,训练节点使用获取到的训练样本,对待训练的模型进行训练处理。
可选地,上述训练样本可以包括样本输入数据和输出参考数据,如图6所示,根据训练样本,训练节点和中心节点对待训练的模型进行训练的处理步骤可以如下:在步骤601中,训练节点将获取的训练样本中的样本输入数据输入待训练的模型,得到训练样本对应的输出数据;在步骤602中,根据训练样本中的输出参考数据和输出数据,训练节点确定待训练的模型中的每个待训练参数对应的梯度数据;在步骤603中,训练节点将梯度数据发送给中心节点;在步骤604中,中心节点接收训练节点发送的梯度数据;在步骤605中,中心节点计算接收到的梯度数据的平均值;在步骤606中,根据平均值,中心节点确定待训练的模型的更新参数;在步骤607中,中心节点将更新参数发送给每个训练节点;在步骤608中,训练节点接收中心节点发送的更新参数,根据更新参数,对待训练的模型中的每个待训练参数进行参数更新。
一个可能的实施例中,各训练节点根据对应的BatchSize获取对应数目的训练样本后,以一个训练节点与中心节点的交互为例,假设BatchSize的数目为n,即该训练节点一次获取到n个训练样本,该训练节点将获取到的n个训练样本中的样本输入数据输入到待训练的模型中,待训练的模型同时对n个样本输入数据进行处理,输出n个输出数据。
需要说明的是,对于多个训练节点来说,由于上述步骤是根据每个训练节点的计算性能确定每个训练节点的BatchSize,计算性能强的训练节点同时处理的训练样本的数目多一些,计算性能弱的训练节点同时处理的训练样本的数目少一些,且根据计算每个训练节点的BatchSize的过程可以得知,每个训练节点根据BatchSize获取对应数目的训练样本并同时处理这些训练样本所耗时长为上述的预设时长,即每个训练节点所耗时间相同,这样可以避免出现计算资源的浪费,可以进一步提高训练效率。
然后,根据n个输出数据分别与各自对应的训练样本中的输出参考数据,以及计算损失函数值Loss的计算公式,计算n个输出数据各自对应的Loss。其中,输出数据和输出参考数据均可以是长向量的形式,且输出数据和输出参考数据的向量长度相同。优选地,可以根据交叉熵损失函数的计算公式来计算Loss,如下述公式(1)。
Figure PCTCN2019096774-appb-000001
其中,y表示输出参考数据,p表示输出数据,N表示输出参考数据或输出数据的向量长度。
计算得到n个Loss后,计算n个Loss的平均值,作为这批训练样本的共同Loss。然后,可以根据共同Loss来判断是否需要继续训练。一种可行的判断方案可以是,判断共同Loss是否小于预设的Loss阈值,如果共同Loss小于预设的Loss阈值,则说明样本输入数据对应的输出数据和样本参考数据的差距很小,可以认为该模型已经训练完成,则停止训练,并将当前模型中的各待训练参数的参数值作为训练好的模型中的参数值,训练过程结束。另外一种可行的判断方案可以是,判断共同Loss是否收敛,即此次迭代的共同Loss与上次迭代的共同Loss是否发生变化,如果没有变化,则说明共同Loss收敛,继续训练也不能使共同Loss变小,因此,停止训练,训练结束。
如果共同Loss小于预设的Loss阈值,或共同Loss收敛,则此时训练节点可以向中心节点发送通知消息,以通知中心节点当前的模型满足训练结束条件,中心节点接收到通知消息后可以结束对模型的训练。
如果共同Loss大于或等于预设的Loss阈值,或共同Loss没收敛,则说明继续训练可以减小共同Loss,使得模型的精度提高,因此,可以继续训练。根据共同Loss以及模型中每个待训练参数的初始值,计算每个待训练参数的梯度数据,计算梯度数据的计算公式可以如下述公式(2)所示,然后将得到的梯度数据以及自身的节点标识发送给中心节点。
Figure PCTCN2019096774-appb-000002
其中,W i为该待训练参数当前的参数值,Loss为共同Loss,ΔW i为梯度数据。
中心节点接收到训练节点发送的梯度数据和节点标识后,将接收到的至少一个节点标识与先前生成的节点标识表中的节点标识进行对比,如果接收到的节点标识与节点标识表中的节点标识相同,则说明参与模型训练的所有训练节点都完成了对训练样本的处理并返回了梯度数据。中心节点分别计算接收到的所有梯度数据中,每个待训练参数对应的梯度数据的平均值,最终得到所有待训练参数的梯度数据的平均值。然后,中心节点根据每个待训练参数的梯度数据的平均值以及预设的学习率,计算每个待训练参数的更新参数,计算公式可以如下述公式(3)。
Figure PCTCN2019096774-appb-000003
其中,W i+1为该待训练参数的更新参数,α为预设的学习率,
Figure PCTCN2019096774-appb-000004
为梯度数据的平均值。
得到每个待训练参数的更新参数后,中心节点将更新参数分别发送给每个训练节点。训练节点接收到更新参数后,将待训练的模型中的各待训练参数更新为接收到的更新参数。这样,便完成了一次深度学习同步数据并行训练。各训练节点在对模型进行参数更新后,可以继续获取下一组BatchSize个训练样本,基于新获取的训练样本再次执行上述的训练过程。
需要说明的是,除了上述根据共同Loss判断是否需要继续训练之外,还可以根据训练精度或迭代次数来判断。
在对模型进行训练的过程中,可以将部分训练样本划分成测试样本,使用测试样本对模型进行测试,将得到的测试数据与对应的输出参考数据进行对比,确定得到的测试数据是否正确,然后计算得到的测试数据中正确的测试数据与错误的测试数据的比值,即为训练精度。其中,测试数据即为将测试样本中的样本输入数据输入到模型中得到的输出数据。训练精度表示一个模型输出的正确率。当训练精度达到预设的训练精度阈值时,说明该模型已经足够精确,无需继续训练,因此当训练精度达到预设的训练精度阈值时,停止训练,训练过程结束。此时训练节点可以向中心节点发送通知消息,以通知中心节点当前的模型满足训练结束条件,中心节点接收到通知消息后可以结束对模型的训练。
也可以根据训练精度是否收敛来确定是否需要继续训练,如果训练精度收敛,说明继续训练也无法提升训练精度,因此,当训练精度收敛时停止训练,训练过程结束。此时训练节点可以向中心节点发送通知消息,以通知中心节点当前的模型满足训练结束条件,中心节点接收到通知消息后可以结束对模型的训练。
另外,还可以预先设定迭代次数阈值,当训练的迭代次数达到迭代次数阈值时停止训练,训练过程结束。此时训练节点可以向中心节点发送通知消息,以通知中心节点当前的模型满足训练结束条件,中心节点接收到通知消息后可以结束对模型的训练。
除上述判断停止训练的方法外,任一可以有效判断停止训练的方法皆可,不限于上述例举的几种方法,本发明对此不做限制。
重复上述训练过程,直到达到上述停止训练的条件时,停止训练,每个训练节点均得到相同的、训练好的模型。
需要说明的是,上述中心节点可以是终端,也可以是服务器。训练节点可以是终端,也可以是服务器。中心节点和训练节点可以是各自独立的实体设备,中心节点仅用于和训练节点进行数据交互,本身不参与使用训练样本对待训练的模型进行训练的过程;中心节点和训练节点也可以是在实体设备中建立的虚拟模块,中心节点可以和训练节点集成在一个实体设备中,则中心节点和训练节点集成的实体设备既要与其他实体设备中的训练节点进行数据交互,也要参与使用训练样本对待训练的模型进行训练的过程,在这种情况下,中心节点所在的实体设备与其他训练节点所在的实体设备之间的数据交互可以采用并行通讯协议。
本发明实施例中,中心节点根据每个训练节点的性能参数,确定每个训练节点的BatchSize,训练节点根据BatchSize,获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,使得每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
基于相同的技术构思,本发明实施例还提供了一种更新参数的装置,该装置可以为上述实施例中的中心节点,如图7所示,该装置包括:获取模块710,确定模块720和发送模块730。
该获取模块710被配置为获取每个训练节点的性能参数;
该确定模块720被配置为根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;
该发送模块730被配置为将确定出的BatchSize,分别发送给对应的训练节点。
可选地,所述性能参数包括中央处理器CPU型号、CPU个数、图形处理器GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数。
可选地,所述确定模块720被配置为:
根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及所述每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,所述确定模块720被配置为:
将所述单位时长内处理训练样本的数目与预设数值的乘积,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;或者,
将所述单位时长内处理训练样本的数目,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,如图8所示,所述装置还包括:
接收模块740,被配置为将确定出的BatchSize,分别发送给对应的训练节点之后,接收所述每个训练节点发送的梯度数据;
计算模块750,被配置为计算接收到的梯度数据的平均值;
所述确定模块720,还被配置为根据所述平均值,确定待训练的模型的更新参数;
所述发送模块730,还被配置为将所述更新参数发送给每个训练节点。
本发明实施例中,中心节点根据每个训练节点的性能参数,确定每个训练节点的样本数据批次数目BatchSize,使得训练节点根据BatchSize,获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,使得每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
基于相同的技术构思,本发明实施例还提供了一种更新参数的装置,该装置可以为上述实施例中的训练节点,如图9所示,该装置包括:接收模块910, 获取模块920和训练模块930。
接收模块910,被配置为接收中心节点发送的本训练节点对应的BatchSize;
获取模块920,被配置为根据所述BatchSize,获取对应数目的训练样本;
训练模块930,被配置为根据获取的训练样本,对待训练的模型进行训练处理。
可选地,所述训练样本包括样本输入数据和输出参考数据;
所述训练模块930,被配置为:
将获取的训练样本中的样本输入数据输入待训练的模型,得到所述训练样本对应的输出数据;
根据所述训练样本中的输出参考数据和所述输出数据,确定所述待训练的模型中的每个待训练参数对应的梯度数据;
将所述梯度数据发送给所述中心节点;
接收所述中心节点发送的更新参数,根据所述更新参数,对所述待训练的模型中的每个待训练参数进行参数更新。
本发明实施例中,训练节点获取基于本训练节点的计算性能确定的BatchSize,并根据该BatchSize获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,这样每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
基于相同的技术构思,本发明实施例还提供了一种更新参数的系统,所述系统包括中心节点和训练节点,其中:
所述中心节点,用于获取每个训练节点的性能参数;根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;将确定出的BatchSize,分别发送给对应的训练节点;
所述训练节点,用于接收中心节点发送的本训练节点对应的BatchSize;根 据所述BatchSize,获取对应数目的训练样本;根据获取的训练样本,对待训练的模型进行训练处理。
在更新参数的系统中,中心节点和训练节点的功能可以参考上述实施例中的更新参数的方法中关于中心节点和训练节点的描述。
本发明实施例中,中心节点根据每个训练节点的性能参数,确定每个训练节点的BatchSize,训练节点根据BatchSize,获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,使得每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
需要说明的是:上述实施例提供的更新参数的装置在更新参数时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的更新参数的装置与更新参数的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图10是本发明实施例提供的一种计算机设备的结构示意图,该计算机设备可以是上述实施例中的中心节点。该计算机设备1000可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)1001和一个或一个以上的存储器1002,其中,所述存储器1002中存储有至少一条指令,所述至少一条指令由所述处理器1001加载并执行以实现下述更新参数的方法步骤:
获取每个训练节点的性能参数;
根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;
将确定出的BatchSize,分别发送给对应的训练节点。
可选的,所述至少一条指令由所述处理器1001加载并执行以实现下述方法步骤:
根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及所述每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选地,所述至少一条指令由所述处理器1001加载并执行以实现下述方法步骤:
将所述单位时长内处理训练样本的数目与预设数值的乘积,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;或者,
将所述单位时长内处理训练样本的数目,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
可选的,所述至少一条指令由所述处理器1001加载并执行以实现下述方法步骤:
接收所述每个训练节点发送的梯度数据;
计算接收到的梯度数据的平均值;
根据所述平均值,确定待训练的模型的更新参数;
将所述更新参数发送给每个训练节点。
本发明实施例中,中心节点根据每个训练节点的性能参数,确定每个训练节点的BatchSize,使得训练节点根据BatchSize,获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,使得每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算 完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
图11是本发明实施例提供的一种计算机设备的结构示意图,该计算机设备可以是上述实施例中的训练节点。该计算机设备1100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)1101和一个或一个以上的存储器1102,其中,所述存储器1102中存储有至少一条指令,所述至少一条指令由所述处理器1101加载并执行以实现下述更新参数的方法步骤:
接收中心节点发送的本训练节点对应的BatchSize;
根据所述BatchSize,获取对应数目的训练样本;
根据获取的训练样本,对待训练的模型进行训练处理。
可选的,所述至少一条指令由所述处理器1101加载并执行以实现下述方法步骤:
将获取的训练样本中的样本输入数据输入待训练的模型,得到所述训练样本对应的输出数据;
根据所述训练样本中的输出参考数据和所述输出数据,确定所述待训练的模型中的每个待训练参数对应的梯度数据;
将所述梯度数据发送给所述中心节点;
接收所述中心节点发送的更新参数,根据所述更新参数,对所述待训练的模型中的每个待训练参数进行参数更新。
本发明实施例中,训练节点获取基于本训练节点的计算性能确定的BatchSize,并根据该BatchSize获取对应数目的训练样本,对待训练的模型进行训练处理。这样,根据训练节点的计算性能确定训练节点的BatchSize,计算性能强一些的训练节点的BatchSize大一些,计算性能弱一些的训练节点的BatchSize小一些,即计算性能强一些的训练节点同时使用数目较多的训练样本来计算梯度数据,计算性能弱一些的训练节点同时使用数目较少的训练样本来计算梯度数据,这样每个训练节点消耗的时间几乎相同,可以避免出现计算能力强的训练节点先完成计算,然后等待计算能力弱的训练节点计算完成才能进行后续处理的情况,避免浪费训练节点的计算资源,进而提高了训练的效率。
在示例性实施例中,还提供了一种计算机可读存储介质,存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述实施例中的识别动作类别的方法。例如,所述计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (15)

  1. 一种更新参数的方法,其特征在于,所述方法包括:
    获取每个训练节点的性能参数;
    根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的样本数据批次数目BatchSize;
    将确定出的BatchSize,分别发送给对应的训练节点。
  2. 根据权利要求1所述的方法,其特征在于,所述性能参数包括中央处理器CPU型号、CPU个数、图形处理器GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize,包括:
    根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及所述每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize,包括:
    将所述单位时长内处理训练样本的数目与预设数值的乘积,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;或者,
    将所述单位时长内处理训练样本的数目,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
  5. 根据权利要求1所述的方法,其特征在于,所述将确定出的BatchSize,分别发送给对应的训练节点之后,还包括:
    接收所述每个训练节点发送的梯度数据;
    计算接收到的梯度数据的平均值;
    根据所述平均值,确定待训练的模型的更新参数;
    将所述更新参数发送给每个训练节点。
  6. 一种更新参数的方法,其特征在于,所述方法包括:
    接收中心节点发送的本训练节点对应的BatchSize;
    根据所述BatchSize,获取对应数目的训练样本;
    根据获取的训练样本,对待训练的模型进行训练处理。
  7. 根据权利要求6所述的方法,其特征在于,所述训练样本包括样本输入数据和输出参考数据;
    所述根据获取的训练样本,对待训练的模型进行训练处理,包括:
    将获取的训练样本中的样本输入数据输入待训练的模型,得到所述训练样本对应的输出数据;
    根据所述训练样本中的输出参考数据和所述输出数据,确定所述待训练的模型中的每个待训练参数对应的梯度数据;
    将所述梯度数据发送给所述中心节点;
    接收所述中心节点发送的更新参数,根据所述更新参数,对所述待训练的模型中的每个待训练参数进行参数更新。
  8. 一种更新参数的装置,其特征在于,所述装置包括:
    获取模块,用于获取每个训练节点的性能参数;
    确定模块,用于根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的样本数据批次数目BatchSize;
    发送模块,用于将确定出的BatchSize,分别发送给对应的训练节点。
  9. 根据权利要求8所述的装置,其特征在于,所述性能参数包括中央处理器CPU型号、CPU个数、图形处理器GPU型号、处理预设数目个训练样本所耗时长中的至少一个参数。
  10. 根据权利要求8所述的装置,其特征在于,所述确定模块,用于:
    根据预先存储的性能参数与单位时长内处理训练样本的数目的对应关系,以及所述每个训练节点的性能参数,确定对应的单位时长内处理训练样本的数目,根据所述单位时长内处理训练样本的数目,确定所述每个训练节点在预设 时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
  11. 根据权利要求10所述的装置,其特征在于,所述确定模块,用于:
    将所述单位时长内处理训练样本的数目与预设数值的乘积,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize;或者,
    将所述单位时长内处理训练样本的数目,确定为所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的BatchSize。
  12. 根据权利要求8所述的装置,其特征在于,所述装置还包括:
    接收模块,用于将确定出的BatchSize,分别发送给对应的训练节点之后,接收所述每个训练节点发送的梯度数据;
    计算模块,用于计算接收到的梯度数据的平均值;
    所述确定模块,还用于根据所述平均值,确定待训练的模型的更新参数;
    所述发送模块,还用于将所述更新参数发送给每个训练节点。
  13. 一种更新参数的装置,其特征在于,所述装置包括:
    接收模块,用于接收中心节点发送的本训练节点对应的BatchSize;
    获取模块,用于根据所述BatchSize,获取对应数目的训练样本;
    训练模块,用于根据获取的训练样本,对待训练的模型进行训练处理。
  14. 根据权利要求13所述的装置,其特征在于,所述训练样本包括样本输入数据和输出参考数据;
    所述训练模块,用于:
    将获取的训练样本中的样本输入数据输入待训练的模型,得到所述训练样本对应的输出数据;
    根据所述训练样本中的输出参考数据和所述输出数据,确定所述待训练的模型中的每个待训练参数对应的梯度数据;
    将所述梯度数据发送给所述中心节点;
    接收所述中心节点发送的更新参数,根据所述更新参数,对所述待训练的模型中的每个待训练参数进行参数更新。
  15. 一种更新参数的系统,其特征在于,所述系统包括中心节点和训练节 点,其中:
    所述中心节点,用于获取每个训练节点的性能参数;根据所述每个训练节点的性能参数,分别确定所述每个训练节点在预设时长内能够处理的训练样本的数目,作为所述每个训练节点对应的样本数据批次数目BatchSize;将确定出的BatchSize,分别发送给对应的训练节点;
    所述训练节点,用于接收中心节点发送的本训练节点对应的BatchSize;根据所述BatchSize,获取对应数目的训练样本;根据获取的训练样本,对待训练的模型进行训练处理。
PCT/CN2019/096774 2018-07-20 2019-07-19 更新参数的方法和装置 WO2020015734A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810803723.3A CN110737446B (zh) 2018-07-20 2018-07-20 更新参数的方法和装置
CN201810803723.3 2018-07-20

Publications (1)

Publication Number Publication Date
WO2020015734A1 true WO2020015734A1 (zh) 2020-01-23

Family

ID=69165013

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096774 WO2020015734A1 (zh) 2018-07-20 2019-07-19 更新参数的方法和装置

Country Status (2)

Country Link
CN (1) CN110737446B (zh)
WO (1) WO2020015734A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298569A (zh) * 2010-06-24 2011-12-28 微软公司 在线学习算法的并行化
CN106203395A (zh) * 2016-07-26 2016-12-07 厦门大学 基于多任务深度学习的人脸属性识别方法
CN107203809A (zh) * 2017-04-20 2017-09-26 华中科技大学 一种基于Keras的深度学习自动化调参方法及系统
US20170308789A1 (en) * 2014-09-12 2017-10-26 Microsoft Technology Licensing, Llc Computing system for training neural networks

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184367B (zh) * 2014-06-09 2018-08-14 讯飞智元信息科技有限公司 深度神经网络的模型参数训练方法及系统
CN104035751B (zh) * 2014-06-20 2016-10-12 深圳市腾讯计算机系统有限公司 基于多图形处理器的数据并行处理方法及装置
CN106033371B (zh) * 2015-03-13 2019-06-21 杭州海康威视数字技术股份有限公司 一种视频分析任务的调度方法及系统
CN106991072B (zh) * 2016-01-21 2022-12-06 杭州海康威视数字技术股份有限公司 在线自学习事件检测模型更新方法及装置
CN107622274B (zh) * 2016-07-15 2020-06-02 北京市商汤科技开发有限公司 用于图像处理的神经网络训练方法、装置以及计算机设备
CN107665349B (zh) * 2016-07-29 2020-12-04 腾讯科技(深圳)有限公司 一种分类模型中多个目标的训练方法和装置
CN106293942A (zh) * 2016-08-10 2017-01-04 中国科学技术大学苏州研究院 基于多机多卡的神经网络负载均衡优化方法和系统
US20180144244A1 (en) * 2016-11-23 2018-05-24 Vital Images, Inc. Distributed clinical workflow training of deep learning neural networks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298569A (zh) * 2010-06-24 2011-12-28 微软公司 在线学习算法的并行化
US20170308789A1 (en) * 2014-09-12 2017-10-26 Microsoft Technology Licensing, Llc Computing system for training neural networks
CN106203395A (zh) * 2016-07-26 2016-12-07 厦门大学 基于多任务深度学习的人脸属性识别方法
CN107203809A (zh) * 2017-04-20 2017-09-26 华中科技大学 一种基于Keras的深度学习自动化调参方法及系统

Also Published As

Publication number Publication date
CN110737446B (zh) 2021-10-12
CN110737446A (zh) 2020-01-31

Similar Documents

Publication Publication Date Title
CN112561078B (zh) 分布式的模型训练方法及相关装置
WO2021056390A1 (zh) 卷积神经网络模型同步训练方法、集群及可读存储介质
CN107885595B (zh) 一种资源分配方法、相关设备及系统
US11436050B2 (en) Method, apparatus and computer program product for resource scheduling
WO2021088964A1 (zh) 推理系统、推理方法、电子设备及计算机存储介质
CN109660367B (zh) 基于改进Raft算法的共识达成方法、装置与电子设备
US8032900B2 (en) Conducting client-server inter-process communication
CN106020984B (zh) 电子设备中进程的创建方法及装置
CN112862112A (zh) 联邦学习方法、存储介质、终端、服务器、联邦学习系统
CN111355814A (zh) 一种负载均衡方法、装置及存储介质
CN107544845B (zh) Gpu资源调度方法及装置
WO2020015734A1 (zh) 更新参数的方法和装置
WO2022110796A1 (zh) 云服务请求响应方法及装置、电子设备和存储介质
CN114579311B (zh) 执行分布式计算任务的方法、装置、设备以及存储介质
CN115563160A (zh) 数据处理方法、装置、计算机设备和计算机可读存储介质
JP2020003860A (ja) 学習システム、処理装置、処理方法、およびプログラム
CN114021715A (zh) 基于Tensorflow框架的深度学习训练方法
Surya et al. Dynamic resource allocation for distributed Tensorflow training in kubernetes cluster
CN115860114B (zh) 深度学习模型的训练方法、装置、电子设备及存储介质
CN112445698A (zh) 虚拟服务节点性能测试方法、装置和计算机可读存储介质
CN111553379B (zh) 基于异步训练的图像数据处理方法和系统
WO2022120993A1 (zh) 在线场景的资源分配方法、装置及电子设备
CN117519910B (zh) 用于虚拟机的计算快速链接内存确定方法和装置
CN114238004B (zh) 互联电路的数据传输正确性检查方法及装置、电子设备
WO2020062303A1 (zh) 训练神经网络的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19838506

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19838506

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06.08.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19838506

Country of ref document: EP

Kind code of ref document: A1