WO2023024252A1 - 网络模型训练方法、装置、电子设备及可读存储介质 - Google Patents

网络模型训练方法、装置、电子设备及可读存储介质 Download PDF

Info

Publication number
WO2023024252A1
WO2023024252A1 PCT/CN2021/127535 CN2021127535W WO2023024252A1 WO 2023024252 A1 WO2023024252 A1 WO 2023024252A1 CN 2021127535 W CN2021127535 W CN 2021127535W WO 2023024252 A1 WO2023024252 A1 WO 2023024252A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
data
model
network
parameter
Prior art date
Application number
PCT/CN2021/127535
Other languages
English (en)
French (fr)
Inventor
周镇镇
李峰
刘红丽
张潇澜
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/259,682 priority Critical patent/US20240265251A1/en
Publication of WO2023024252A1 publication Critical patent/WO2023024252A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of computer technology, in particular to a network model training method, a network model training device, electronic equipment, and a computer-readable storage medium.
  • the network model is usually built and trained on electronic devices with strong computing power such as servers, and after the training is completed, it is sent to terminal devices such as mobile phones and personal computers to run; Train on one device, execute on another. Since server devices and terminal devices have different computing capabilities for the same type of network layer, the execution delays of various network layers in the same network are usually different on different types of devices, which makes the network model trained on one device Higher latency when running on another device.
  • the purpose of this application is to provide a network model training method, network model training device, electronic equipment and computer-readable storage medium, so that the target model obtained from the final training has the minimum Delay.
  • the application provides a network model training method, including:
  • the initial model includes an embedding layer, and the embedding layer is constructed based on preset network layer delay information, and the preset network layer delay information includes network layer types corresponding to each other and at least two types of delay data, and each type of delay Data corresponding to different device types;
  • the target model is obtained based on the initial model.
  • the generating process of the preset network layer delay information includes:
  • the preset network layer delay information is generated by using the correspondence between the second delay data, the network layer type of the network layer, and the device type.
  • the calculating a target loss value by using the target delay data, the training data and the output data includes:
  • the weighted summation is performed by using the precision loss value and the target delay data to obtain the target loss value.
  • the initial model is based on the training rules of the neural network model, using the hyperparameter network constructed by using the search space, the network architecture of the initial model corresponds to the target directed acyclic graph, and the target directed acyclic graph having several directed edges, each of which has several branches;
  • Said inputting the training data into the initial model to obtain the output data includes:
  • the parameter adjustment of the initial model by using the target loss value includes:
  • the target parameter corresponding to the activation branch is updated by using the target loss value; wherein, the historical parameter updated last time is different from the parameter type of the target parameter.
  • the determining the activation branch corresponding to each of the directed edges according to the target parameter includes:
  • the target parameter is a weight parameter, randomly determine the activation branch
  • the activation branch is selected according to a multinomial distribution sampling principle.
  • using the target loss value to update the target parameter corresponding to the activation branch includes:
  • the target parameter is a weight parameter, using the target loss value to update the weight parameter of the activation branch by stochastic gradient descent;
  • the target parameter is an architecture parameter
  • the obtaining the target model based on the initial model includes:
  • target delay data corresponding to other device types including:
  • the target data includes input data scale and/or Target device type.
  • the application provides a network model training device, including:
  • the input module is used to obtain training data, and input the training data into the initial model to obtain output data;
  • the initial model includes an embedding layer, and the embedding layer is constructed based on preset network layer delay information, and the preset network layer delay information includes network layer types corresponding to each other and at least two types of delay data, and each type of delay Data corresponding to different device types;
  • a delay acquisition module configured to input the current device type and the target network layer type of the target network layer in the initial model into the embedding layer to obtain target delay data corresponding to other device types;
  • a parameter adjustment module configured to use the target delay data, the training data and the output data to calculate a target loss value, and use the target loss value to perform parameter adjustment on the initial model
  • the model generation module is used to obtain the target model based on the initial model if the training completion condition is satisfied.
  • the present application provides an electronic device, including a memory and a processor, wherein:
  • the memory is used to store computer programs
  • the processor is configured to execute the computer program to implement the above-mentioned network model training method.
  • the present application provides a computer-readable storage medium for storing a computer program, wherein the above-mentioned network model training method is implemented when the computer program is executed by a processor.
  • the network model training method obtained by this application obtains training data, and inputs the training data into the initial model to obtain output data; wherein, the initial model includes an embedding layer, and the embedding layer is constructed based on preset network layer delay information, and the preset network layer delay information Including the corresponding network layer types and at least two types of delay data, each type of delay data corresponds to different device types; input the current device type and the target network layer type of the target network layer in the initial model into the embedding layer to obtain other device types
  • the target delay data use the target delay data, training data and output data to calculate the target loss value, and use the target loss value to adjust the parameters of the initial model; if the training completion conditions are met, the target model is obtained based on the initial model.
  • the initial model includes an embedding layer, and the embedding layer is constructed based on preset network layer delay information, and the preset network layer delay information includes network layer types corresponding to each other and at least two types of delay data.
  • the parameters need to be adjusted, and the loss value is the benchmark for adjusting the parameters.
  • the current device type and each target network layer in the initial model can be The network layer type is input into the embedding layer, so that the embedding layer can obtain the target delay data corresponding to the target network layer on the non-current device according to the preset network layer delay information, and then use the target delay data to participate in the calculation of the target loss value. That is, when calculating the target loss value, it is not based on the real delay generated by the initial model when processing the training data on the current device, but uses the theoretical initial model to process the training data on the device corresponding to other device types.
  • Theoretical latency replaces real latency.
  • the target loss value obtained by using it can reflect the execution delay of the initial model on devices corresponding to other device types.
  • Using it to adjust the parameters of the initial model can make the initial model more compatible with devices corresponding to other device types. Simulate the effect of directly training the initial model on devices corresponding to other device types, so that the final trained target model has the smallest delay when running on devices corresponding to other device types. The problem of relatively large delay existing in related technologies is solved.
  • the present application also provides a network model training device, an electronic device, and a computer-readable storage medium, which also have the above beneficial effects.
  • Fig. 1 is a flow chart of a network model training method provided by the embodiment of the present application.
  • FIG. 2 is a schematic diagram of a specific neural network architecture search process provided by the embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a network model training device provided in an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a flow chart of a network model training method provided in an embodiment of the present application.
  • the method includes:
  • S101 Obtain training data, and input the training data into the initial model to obtain output data.
  • the target model is trained on a certain type of device and invoked on another type of device, and the device for training the model and the device for invoking the model are of different device types.
  • the type, structure and application scenario of the target model are not limited, therefore, the structure and type of the initial model corresponding to the target model are also not limited.
  • the initial model refers to the unfinished target model, which can be determined as the target model during the training process of the initial model or after the initial model itself satisfies the training completion condition.
  • the initial model is a model with a fixed structure.
  • the structure of the target model is also fixed; in another embodiment, the initial model is the initial hyperparameter of the neural network architecture search process Network model, in this case, the structure of the target model is different from that of the initial model, and the model structure of the initial model after neural network architecture search is the structure of the target model.
  • the content and type of the training data may be different according to different application scenarios of the target model. Specifically, it can be classified according to the purpose of the target.
  • the target model can be an image processing model for image processing, or an audio processing model for audio processing, or an
  • the classification model may be a clustering model for clustering, or a recommendation model for recommendation, etc.
  • the content of the training data can be different, for example, it can be images, audio, or data that meets the requirements of the model.
  • NAS Neural Architecture Search
  • the training data is used as the input of the initial model, and the initial model is used to process it to obtain the output data.
  • the output data may only include the final output data of the initial model.
  • the output data may also include intermediate data obtained during the initial model's processing of the training data.
  • S102 Input the current device type and the target network layer type of the target network layer in the initial model into the embedding layer to obtain target delay data corresponding to other device types.
  • the initial model includes an embedding layer
  • the embedding layer is constructed based on preset network layer delay information
  • the preset network layer delay information includes corresponding network layer types and at least two types of delay data, and each type of delay data corresponds to different device types. That is, the preset network layer delay information includes multiple sets of information, and each set of information records at least two types of delay data and the corresponding relationship between network layer types corresponding to the two types of delay data.
  • the delay data refers to the data representing the delay of the network layer type running on the corresponding device, and its specific form is not limited. Because the same type of network layer runs on different electronic devices, the delay is different, and the parameters of the network layer also have an impact on the running delay.
  • the model When a model is trained on the A device, the model is usually adjusted according to the execution delay of each network layer on the A device, so after the training is completed, when the model is run on a B device of a different type from the A device, The latency of the operation often does not meet the minimum latency.
  • this application pre-generates preset network layer delay information, and builds an embedding layer based on it.
  • the embedding layer is located in the initial model, which can obtain the target delay data corresponding to other device types according to the current device type and the target network layer type of each target network layer in the initial model, so that the target delay data can be used to construct the target loss value later, and then The model parameters are tuned using the target loss value.
  • the current device type refers to the type of device used to train the initial model to obtain the target model, and its specific form is not limited.
  • it can be a server, a personal computer, etc.; personal computer etc.
  • the target network layer refers to the network layer in the initial model
  • the target network layer type refers to the specific type of the target network layer, for example, it can be a 3*3 convolutional layer, or it can be a pooling layer, etc.
  • the preset network layer delay information can be mapped to obtain a corresponding one-dimensional array, which is equivalent to reducing the dimensionality of the sparse matrix corresponding to the preset network layer delay information, and building an embedding layer based on the vector.
  • the embedding layer can retrieve and map according to the input information, and obtain corresponding target delay data of other device types.
  • Other device types refer to non-current device types.
  • the preset network layer delay information only includes two types of delay data. In this case, after the current device type is input, only one type of delay data remains as the target delay data. In another embodiment, the preset network layer delay information may include more than or equal to two types of delay data. In this case, after the current device type is input, there are still at least two types of delay data remaining as target delay data. At this time, one of the at least two types of delayed data may also be selected as the target delayed data as required.
  • the process of inputting the current device type and the target network layer type of the target network layer in the initial model into the embedding layer to obtain target delay data corresponding to other device types may include the following steps:
  • Step 11 Input the current device type, each target network layer type, and target data into the embedding layer to obtain target delay data corresponding to the target data.
  • target data in addition to inputting the current device type and target network layer type into the embedding layer, target data may also be input into the embedding layer, so as to select appropriate target delay data as required.
  • the target data includes input data scale and/or target device type.
  • the target device type refers to the type of the device whose target network is invoked. Since the different scales of the input data will also affect the delay of the network layer, the target data may also include the scale of the input data in order to obtain more accurate target delay data.
  • the generation process of preset network layer delay information includes:
  • Step 21 Determine several network layers and several preset network models with each network layer.
  • Step 22 Train each preset network model on a device corresponding to each device type to obtain first delay data corresponding to each device type.
  • Step 23 Using the first delay data to obtain second delay data corresponding to the network layer.
  • Step 24 Using the correspondence between the second delay data, the network layer type of the network layer, and the device type, generate preset network layer delay information.
  • the execution delay of the network layer is affected by its parameters, it is impossible to directly use a single network layer to determine its delay, and the entire network model needs to be called to determine it.
  • several types of network layers are selected to generate preset network layer delay information.
  • One type of network layer may be owned by multiple different types of network models. Therefore, after the network layer is determined, it can be further determined based on the network layer.
  • Several preset network models for the network layer By training the preset network model on several different types of devices, the overall call delay of each preset network model on each device can be determined.
  • the first delay data refers to call delay data of a preset network model.
  • the delay of the network layer is a part of the first glance data, and the second delay data corresponding to each type of network layer can be obtained by using the first delay data.
  • the preset network layer delay information is generated by utilizing the corresponding relationship between the second delay data and the network layer type and device type of the network layer.
  • the model with conv3*3 can be determined as the preset network model.
  • conv3*3 with different input data scales in each preset network model can also be regarded as different types of network layers.
  • the input data scale may include channel number (C-channel), height (H-height), width (W-width) and depth (D-depth).
  • the delay (Latency) corresponding to various types of conv3*3 can be obtained through training, and the statistics can be shown in Table 1:
  • L server represents server-side delay
  • L user represents client-side delay, both of which are second delay data.
  • the input data scales of different network layers are different, that is, the network layer R-CNN-1conv3*3 is the same as R-CNN-1conv3*3.
  • the input data scale of the network layer CNN-2conv3*3 is different. Based on the above method, the same processing is performed on other selected network layers to obtain complete preset network layer delay information.
  • S103 Calculate a target loss value by using the target delay data, training data and output data, and use the target loss value to adjust parameters of the initial model.
  • the target delay data After the target delay data is obtained, it can be used to calculate the target loss value with the training data and output data, and then use the target loss value to adjust the parameters of the initial model.
  • the specific calculation method of the target loss value is not limited in this embodiment.
  • the accuracy loss value can be calculated by using the training data and output data
  • the target loss value can be generated by using the accuracy loss value and the target delay data, that is, using the target delay data, training Data and output data
  • the process of calculating the target loss value includes:
  • Step 31 Obtain the precision loss value by using the training data and the output data.
  • Step 32 Perform weighted summation using the precision loss value and the target delay data to obtain the target loss value.
  • the target loss value is calculated by means of weighted summation. In other implementation manners, other calculation methods may also be used to generate the target loss value.
  • the training completion condition refers to the condition for determining the end of the training process of the initial model, which may limit the initial model itself, or may limit the training process of the initial model. For example, it may be a condition that restricts the degree of convergence of the initial model, recognition accuracy, etc., or it may be a condition that restricts the training duration, training rounds, and the like. If the training completion condition is satisfied, the target model can be obtained based on the initial model. This embodiment does not limit the specific method of obtaining the target model. For example, the embedding layer in the initial model can be removed to obtain the target model.
  • the initial model includes an embedding layer, and the embedding layer is constructed based on preset network layer delay information, and the preset network layer delay information includes corresponding network layer types and at least two types of delay data.
  • the parameters need to be adjusted, and the loss value is the benchmark for adjusting the parameters.
  • the current device type and each target network layer in the initial model can be The network layer type is input into the embedding layer, so that the embedding layer can obtain the target delay data corresponding to the target network layer on the non-current device according to the preset network layer delay information, and then use the target delay data to participate in the calculation of the target loss value. That is, when calculating the target loss value, it is not based on the real delay generated by the initial model when processing the training data on the current device, but uses the theoretical initial model to process the training data on the device corresponding to other device types.
  • Theoretical latency replaces real latency.
  • the target loss value obtained by using it can reflect the execution delay of the initial model on devices corresponding to other device types.
  • Using it to adjust the parameters of the initial model can make the initial model more compatible with devices corresponding to other device types. Simulate the effect of directly training the initial model on devices corresponding to other device types, so that the final trained target model has the smallest delay when running on devices corresponding to other device types. The problem of relatively large delay existing in related technologies is solved.
  • the initial model is based on the training rules of the neural network model, and the hyperparameter network obtained by using the search space is constructed.
  • the network architecture of the initial model corresponds to the target directed acyclic graph, and the target has A directed acyclic graph has several directed edges, and each directed edge has several branches.
  • each branch can be evaluated, and finally selected and cropped to obtain the target model.
  • the neural network model training rule is used to generate the most initial model, that is, the initial model, when performing network architecture search.
  • the search space refers to the type of neural network that can be searched, and also defines how to describe the neural network structure.
  • the search space includes MBConv3*3_1 (indicating that the convolution kernel is 3*3 and the step size is 1), MBConv3*3_2, MBConv3*3_3, MBConv3*3_4, MBConv3*3_5, MBConv3*3_6, MBConv5 *5_1, MBConv5*5_2, MBConv5*5_3, MBConv5*5_4, MBConv5*5_5, MBConv5*5_6, MBConv7*7_1, MBConv7*7_2, MBConv7*7_3, MBConv7*7_4, MBConv7*7_5, MBConv7*7_6, Identity, Zero and other network layers, where Identity is
  • m o NAS (x) is the output.
  • a i is an architecture parameter
  • an architecture parameter refers to a parameter used for model architecture selection, which participates in the selection and pruning of branches.
  • Each branch corresponds to an architecture parameter.
  • FIG. 2 is a schematic diagram of a specific neural network architecture search process provided by an embodiment of the present application.
  • the training data is input into the initial model, and the process of obtaining the output data may include:
  • Step 41 Determine the target parameters.
  • Step 42 Determine the activation branch corresponding to each directed edge according to the target parameter, and use the activation branch to process the training data to obtain output data.
  • the target parameter may be a weight parameter or an architecture parameter
  • the weight parameter refers to a parameter representing a weight of a branch, which is used to select a branch in cooperation with the architecture parameter.
  • one of the branches of each directed edge is selected as the activation branch, and the training data is processed by the activation branch to obtain the output data.
  • a binarization gate function may be used to select and activate branches.
  • the binarization gate function is:
  • p 1 to p n are the probability values generated by using the architecture parameters of each branch, and p is the parameter used to select the activation function. According to the relationship between the p value and the probability value, determine the specific content of g, and then use it to indicate whether each branch is activated.
  • m o binary (x) is the simplified m o NAS (x).
  • specific selection method of p may be different according to the type of the target parameter.
  • the process of determining the activation branch corresponding to each directed edge includes:
  • Step 51 If the target parameter is a weight parameter, randomly determine the activation branch.
  • Step 52 If the target parameter is an architecture parameter, select an activation branch according to the multinomial distribution sampling principle.
  • the activation branch is randomly selected. Specifically, the random number generator, p 1 to p n can be used to randomly select a value in the set as the p value, and then determine the value g of the binarization gate function to complete the selection of the activation branch. If the target parameter is an architecture parameter, the activation function can be selected according to the multinomial distribution sampling principle, and two of the N branches are selected each time as the activation branch, and the other branches are masked.
  • the process of adjusting the parameters of the initial model using the target loss value may include:
  • Step 43 Utilize the target loss value to update the target parameter corresponding to the active branch.
  • the parameter type of the last updated historical parameter is different from that of the target parameter, that is, the weight parameter and the architecture parameter are updated alternately. It can be understood that by ensuring that the number of active branches is much smaller than the number of all branches, the consumption required for the training process of the initial model is reduced to the level of training a common network model, reducing the consumption of computing resources.
  • the process of using the target loss value to update the target parameter corresponding to the activation branch includes:
  • Step 61 If the target parameter is a weight parameter, use the target loss value to update the weight parameter of the active branch by stochastic gradient descent.
  • Step 62 If the target parameter is an architecture parameter, calculate an update parameter according to a preset update rule using the target loss value, and use the update parameter to update the architecture parameter of the active branch.
  • the target parameter is a weight parameter
  • it can be updated using the stochastic gradient descent method.
  • the target parameter is an architecture parameter
  • a preset update rule is preset, and the preset update rule specifies a specific calculation method of the update parameter. This embodiment does not limit the specific content of the preset update rule. In one implementation, it is specifically:
  • the specific calculation process of the target loss value can be as follows:
  • T is the network layer type
  • f(T, C, H, W, D) is a mapping matrix formed by preset network layer delay information, which is essentially a lookup table for all network layers.
  • the delay of the module is:
  • E[latency i ] represents the delay of the i-th module
  • F represents the above mapping matrix
  • F(o i j ) represents the prediction delay
  • E[latency i ] can be equivalent to:
  • Embedding represents embedding layer processing. Therefore, for the total delay of all modules, that is, the target delay data, it is equal to:
  • E[latency] is the target latency data.
  • the application synthesizes the delay loss and precision loss, and adds weight attenuation at the same time to generate the target loss value.
  • the target loss value is specifically:
  • loss is the above L, which is the target loss value.
  • ⁇ 1 and ⁇ 2 are the weight values, ⁇ is the weight decay constant, and loss CE is the precision loss.
  • the process of obtaining the target model based on the initial model may include:
  • Step 71 Calculate the branch weight corresponding to each branch by using the architecture parameter and the weight parameter.
  • Step 72 Determine the highest branch weight of each directed edge, and cut out the branches corresponding to the non-highest branch weight in the initial model to obtain the target model.
  • This embodiment does not limit the specific calculation method of the branch weight, for example, it can be obtained by multiplying the architecture parameter and the weight parameter. After obtaining the branch weights, the model structure composed of the branches corresponding to the highest branch weight is the best, so the branches corresponding to the non-highest branch weights are cut off, and the model formed by the remaining branches is the target model.
  • the NAS method can Perform network architecture search directly on the CIFAR10 dataset or ImageNet dataset and target hardware.
  • the backbone network (that is, the backbone network) of the hyperparameterized network uses PyramidNet, DenseNet, MobileNet or other classic networks, and makes beneficial changes to it, for example, the 3*3Conv layer (3*3 convolutional layer) in the PyramidNet network Optimized as a tree-structured, its depth is 3, and the branch number of leaf nodes is 2.
  • the training set and the verification set are divided into a ratio of 0.7 to 0.3, and the network architecture search is performed.
  • the optimizer used can choose algorithms such as Adam, SGD, Momentum, NAG, AdaGrad, etc., using The gradient-based algorithm derives the loss function and iteratively updates the parameters of the hyperparameterized network and architecture.
  • the network model training device provided in the embodiment of the present application is introduced below, and the network model training device described below and the network model training method described above can be referred to in correspondence.
  • FIG. 3 is a schematic structural diagram of a network model training device provided in an embodiment of the present application, including:
  • the input module 110 is used to obtain training data, and input the training data into the initial model to obtain output data;
  • the initial model includes an embedding layer, and the embedding layer is constructed based on preset network layer delay information, and the preset network layer delay information includes corresponding network layer types and at least two types of delay data, and each type of delay data corresponds to different device types;
  • the delay acquisition module 120 is used to input the target network layer type of the current device type and the target network layer in the initial model into the embedding layer to obtain target delay data corresponding to other device types;
  • the parameter adjustment module 130 is used to calculate the target loss value by using the target delay data, training data and output data, and use the target loss value to adjust the parameters of the initial model;
  • the model generation module 140 is configured to obtain the target model based on the initial model if the training completion condition is satisfied.
  • a preset network model determination module configured to determine several network layers and several preset network models with each network layer
  • the first delay data acquisition module is used to train each preset network model on the device corresponding to each device type, and obtain the first delay data corresponding to each device type;
  • the second delay data acquisition module is used to obtain the second delay data corresponding to the network layer by using the first delay data
  • the preset network layer delay information generation module is used to generate preset network layer delay information by utilizing the correspondence between the second delay data, the network layer type of the network layer, and the device type.
  • the parameter adjustment module 130 includes:
  • a precision loss value calculation unit is used to obtain a precision loss value by using the training data and the output data;
  • the weighted summing unit is configured to perform weighted summation using the precision loss value and the target delay data to obtain the target loss value.
  • the initial model is a hyperparameter network constructed based on the training rules of the neural network model and using the search space.
  • the network architecture of the initial model corresponds to the target directed acyclic graph, and the target directed acyclic graph has several directed edges , each directed edge has several branches;
  • a parameter determination unit configured to determine target parameters
  • the branch activation unit is used to determine the activation branch corresponding to each directed edge according to the target parameter, and use the activation branch to process the training data to obtain output data;
  • the parameter adjustment module 130 includes:
  • the update unit is configured to use the target loss value to update the target parameter corresponding to the activation branch; wherein, the parameter type of the last updated historical parameter is different from that of the target parameter.
  • the branch activation unit includes:
  • a random activation subunit is used to randomly determine the activation branch if the target parameter is a weight parameter
  • the distribution sampling subunit is used to select the activation branch according to the multinomial distribution sampling principle if the target parameter is an architecture parameter.
  • update the unit to include:
  • the stochastic gradient update subunit is used to update the weight parameter of the activation branch by using the target loss value and using the stochastic gradient descent method if the target parameter is a weight parameter;
  • the rule updating sub-unit is configured to use the target loss value to calculate an update parameter according to a preset update rule if the target parameter is an architecture parameter, and use the update parameter to update the architecture parameter of the active branch.
  • model generation module 140 includes:
  • a weight calculation unit configured to calculate branch weights corresponding to each branch by using architecture parameters and weight parameters
  • the clipping unit is configured to determine the highest branch weight of each directed edge, and clip branches corresponding to non-highest branch weights in the initial model to obtain the target model.
  • a preset network layer delay information generating module includes:
  • the delay data selection unit is configured to input the current device type, each target network layer type and target data into the embedding layer to obtain target delay data corresponding to the target data; wherein the target data includes input data scale and/or target device type.
  • the electronic device provided by the embodiment of the present application is introduced below, and the electronic device described below and the network model training method described above can be referred to in correspondence.
  • the electronic device 100 may include a processor 101 and a memory 102 , and may further include one or more of a multimedia component 103 , an information input/information output (I/O) interface 104 and a communication component 105 .
  • a multimedia component 103 may be included in the electronic device 100 .
  • I/O information input/information output
  • the processor 101 is used to control the overall operation of the electronic device 100, so as to complete all or part of the steps in the above-mentioned network model training method;
  • the memory 102 is used to store various types of data to support the operation of the electronic device 100, these Data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data.
  • the memory 102 can be realized by any type of volatile or non-volatile storage device or their combination, such as Static Random Access Memory (Static Random Access Memory, SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (Read-Only Memory, One or more of Only Memory, ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • Static Random Access Memory Static Random Access Memory
  • SRAM Static Random Access Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • PROM Programmable Read-Only Memory
  • Read-Only Memory One or more of Only Memory, ROM
  • magnetic memory flash memory
  • flash memory magnetic disk or optical disk.
  • Multimedia components 103 may include screen and audio components.
  • the screen can be, for example, a touch screen, and the audio component is used for outputting and/or inputting audio signals.
  • an audio component may include a microphone for receiving external audio signals.
  • the received audio signal may be further stored in the memory 102 or sent via the communication component 105 .
  • the audio component also includes at least one speaker for outputting audio signals.
  • the I/O interface 104 provides an interface between the processor 101 and other interface modules, which may be a keyboard, a mouse, buttons, and the like. These buttons can be virtual buttons or physical buttons.
  • the communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices.
  • Wireless communication such as Wi-Fi, Bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G or 4G, or one or a combination of them, so the corresponding communication component 105 may include: Wi-Fi parts, Bluetooth parts, NFC parts.
  • the electronic device 100 may be implemented by one or more Application Specific Integrated Circuit (ASIC for short), Digital Signal Processor (DSP for short), Digital Signal Processing Device (DSPD for short), Programmable Logic Device (Programmable Logic Device, PLD for short), Field Programmable Gate Array (Field Programmable Gate Array, FPGA for short), controller, microcontroller, microprocessor or other electronic components are implemented for implementing the above embodiments The network model training method given.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • DSPD Digital Signal Processing Device
  • PLD Programmable Logic Device
  • Field Programmable Gate Array Field Programmable Gate Array
  • FPGA Field Programmable Gate Array
  • the computer-readable storage medium provided by the embodiment of the present application is introduced below, and the computer-readable storage medium described below and the network model training method described above can be referred to in correspondence.
  • the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above-mentioned network model training method are realized.
  • the computer-readable storage medium may include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk, etc., which can store program codes. medium.
  • each embodiment in this specification is described in a progressive manner, each embodiment focuses on the difference from other embodiments, and the same or similar parts of each embodiment can be referred to each other.
  • the description is relatively simple, and for the related information, please refer to the description of the method part.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or any other Any other known storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Computer And Data Communications (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种网络模型训练方法、装置、电子设备及计算机可读存储介质,该方法包括获取训练数据,并将训练数据输入初始模型,得到输出数据;其中,初始模型包括嵌入层,嵌入层基于预设网络层延迟信息构建,预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类延迟数据对应于不同的设备类型;将当前设备类型、初始模型中目标网络层的目标网络层类型输入嵌入层,得到其他设备类型对应的目标延迟数据;利用目标延迟数据、训练数据和输出数据计算目标损失值,并利用目标损失值对初始模型进行参数调节;若满足训练完成条件,则基于初始模型得到目标模型;该方法使得目标模型在其他设备类型对应的设备上运行时具有最小的延迟。

Description

网络模型训练方法、装置、电子设备及可读存储介质
本申请要求在2021年08月24日提交中国专利局、申请号为202110971264.1、发明名称为“网络模型训练方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种网络模型训练方法、网络模型训练装置、电子设备及计算机可读存储介质。
背景技术
当前,为了提高网络模型的训练速度,网络模型通常在服务器等计算能力强的电子设备上构建和训练,在训练完毕后将其发送至手机、个人电脑等终端设备上运行;或者根据需要指定在一种设备上训练,在另一种设备上执行。由于服务器设备与终端设备对于同一类型的网络层的计算能力不同,因此同一个网络中各个的网络层在不同种类的设备上的执行延迟通常不同,这使得在一种设备上训练好的网络模型在另一种设备上运行时延迟较大。
发明内容
有鉴于此,本申请的目的在于提供一种网络模型训练方法、网络模型训练装置、电子设备及计算机可读存储介质,使得最终训练得到的目标模型在其他设备类型对应的设备上运行时具有最小的延迟。
为解决上述技术问题,本申请提供了一种网络模型训练方法,包括:
获取训练数据,并将所述训练数据输入初始模型,得到输出数据;
其中,所述初始模型包括嵌入层,所述嵌入层基于预设网络层延迟信息构建,所述预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类所述延迟数据对应于不同的设备类型;
将当前设备类型、所述初始模型中目标网络层的目标网络层类型输入所述嵌入层,得到其他设备类型对应的目标延迟数据;
利用所述目标延迟数据、所述训练数据和所述输出数据计算目标损失值,并利用所述目标损失值对所述初始模型进行参数调节;
若满足训练完成条件,则基于所述初始模型得到目标模型。
可选地,所述预设网络层延迟信息的生成过程,包括:
确定若干个网络层,以及具有各个所述网络层的若干个预设网络模型;
将各个所述预设网络模型在各个所述设备类型对应的设备上进行训练,得到各个所述设备类型对应第一延迟数据;
利用所述第一延迟数据得到与所述网络层对应的第二延迟数据;
利用所述第二延迟数据、所述网络层的网络层类型、所述设备类型之间的对应关系,生成所述预设网络层延迟信息。
可选地,所述利用所述目标延迟数据、所述训练数据和所述输出数据计算目标损失值,包括:
利用所述训练数据和所述输出数据得到精度损失值;
利用所述精度损失值和所述目标延迟数据进行加权求和,得到所述目标损失值。
可选地,所述初始模型为基于神经网络模型训练规则,利用搜索空间构建得到的超参数网络,所述初始模型的网络架构与目标有向无环图对应,所述目标有向无环图具有若干个有向边,各个所述有向边具有若干个分支;
所述将所述训练数据输入初始模型,得到输出数据,包括:
确定目标参数;
根据目标参数,确定各个所述有向边对应的激活分支,并利用所述激活分支对所述训练数据进行处理,得到所述输出数据;
相应的,所述利用所述目标损失值对所述初始模型进行参数调节,包括:
利用所述目标损失值对所述激活分支对应的所述目标参数进行更新;其中,上一次更新的历史参数与所述目标参数的参数类型不同。
可选地,所述根据目标参数,确定各个所述有向边对应的激活分支,包括:
若所述目标参数为权重参数,则随机确定所述激活分支;
若所述目标参数为架构参数,则根据多项式分布采样原则选择所述激活分支。
可选地,所述利用所述目标损失值对所述激活分支对应的目标参数进行更新,包括:
若所述目标参数为权重参数,则利用所述目标损失值,利用随机梯度下降法更新所述激活分支的权重参数;
若所述目标参数为架构参数,则利用所述目标损失值,按照预设更新规则计算更新参数,并利用所述更新参数更新所述激活分支的架构参数。
可选地,所述基于所述初始模型得到目标模型,包括:
利用架构参数和权重参数计算各个所述分支对应的分支权重;
确定各个所述有向边的最高分支权重,并裁剪所述初始模型中非最高分支权重对应的分支,得到所述目标模型。
可选地,所述将当前设备类型、所述初始模型中目标网络层的目标网络层类型输入所述嵌入层,得到其他设备类型对应的目标延迟数据,包括:
将所述当前设备类型、各个所述目标网络层类型以及目标数据输入所述嵌入层,得到与所述目标数据对应的所述目标延迟数据;其中,所述目标数据包括输入数据尺度和/或目标设备类型。
本申请提供了一种网络模型训练装置,包括:
输入模块,用于获取训练数据,并将所述训练数据输入初始模型,得到输出数据;
其中,所述初始模型包括嵌入层,所述嵌入层基于预设网络层延迟信息构建,所述预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类所述延迟数据对应于不同的设备类型;
延迟获取模块,用于将当前设备类型、所述初始模型中目标网络层的目标网络层类型输入所述嵌入层,得到其他设备类型对应的目标延迟数据;
参数调节模块,用于利用所述目标延迟数据、所述训练数据和所述输出数据计算目标损失值,并利用所述目标损失值对所述初始模型进行参数调节;
模型生成模块,用于若满足训练完成条件,则基于所述初始模型得到目标模型。
本申请提供了一种电子设备,包括存储器和处理器,其中:
所述存储器,用于保存计算机程序;
所述处理器,用于执行所述计算机程序,以实现上述的网络模型训练方法。
本申请提供了一种计算机可读存储介质,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现上述的网络模型训练方法。
本申请提供的网络模型训练方法,获取训练数据,并将训练数据输入初始模型,得到输出数据;其中,初始模型包括嵌入层,嵌入层基于预设网络层延迟信息构建,预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类延迟数据对应于不同的设备类型;将当前设备类型、初始模型中目标网络层的目标网络层类型输入嵌入层,得到其他设备类型对应的目标延迟数据;利用目标延迟数据、训练数据和输出数据计算目标损失值,并利用目标损失值对初始模型进行参数调节;若满足训练完成条件,则基于初始模型得到目标模型。
可见,该方法中,初始模型包括嵌入层,嵌入层基于预设网络层延迟信息构建,预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据。在对初始模型进行训练时,需要对其中的参数进行调节,损失值即为调节参数的基准。由于不同设备对不同类型的网络层的执行延迟不同,在当前设备上进行训练时,为了得到能够在其他设备上延迟较小的目标模型,可以将当前设备类型、初始模型中各个目标网络层的网络层类型输入嵌入层,以便嵌入层根据预设网络层延迟信息,得到在非当前设备上目标网络层对应的目标延迟数据,进而利用目标延迟数据参与目标损失值的计算。即,在计算目标损失值时,并不基于初始模型在当前设备上对训练数据处理时产生的真实延迟,而是利用理论上初始模型在其它设备类型对应的设备上对训练数据处理时产生的理论延迟替代真实延迟。由于采用的目标延迟数据并不与当前设备类型对应,而是与其它设备类型相匹配。因此利用其得到的目标损失值能够反映初始模型在其他设备类型对应的设备上的执行延迟,利用其对初始模型进行参数调整,能够使得初始模型能够与其他设备类型对应的设备更加匹配,起到模拟在其他设备类型对应设备上直接训练初始模型的效果,使得最终训练得到的目标模型在其他设备类型对应的设备上运行时具有最小的延迟。解决了相关技术存在的延迟较大的问题。
此外,本申请还提供了一种网络模型训练装置、电子设备及计算机可读存储介质,同样具有上述有益效果。
附图说明
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种网络模型训练方法流程图;
图2为本申请实施例提供的一种具体的神经网络架构搜索过程示意图;
图3为本申请实施例提供的一种网络模型训练装置的结构示意图;
图4为本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参考图1,图1为本申请实施例提供的一种网络模型训练方法流程图。该方法包括:
S101:获取训练数据,并将训练数据输入初始模型,得到输出数据。
需要说明的是,本申请中,目标模型在某一类设备上训练得到,在另一类设备上被调用,且训练模型的设备与调用模型的设备具有不同的设备类型。目标模型的类型、结构以及应用场景不做限定,因此,与目标模型相对应的初始模型的结构以及类型同样不做限定。初始模型,是指训练未完成的目标模型,在初始模型的训练过程或者初始模型自身满足训练完成条件后,则可以将其确定为目标模型。在一种实施方式中,初始模型为结构固定的模型,在这种情况下,目标模型的结构也固定不变;在另一种实施方式中,初始模型为神经网络架构搜索过程的初始超参数网络模型,在这种情况下,目标模型与初始模型的结构不同,初始模型经过神经网络架构搜索后的模型结构即为目标模型的结构。
对于训练数据的具体内容,可以理解的是,根据目标模型的应用场景不同,训练数据的内容和类型可以不同。具体的,可以根据目标性的用途将其分类,例如目标模型具体可以为用于进行图像处理的图像处理模型,或者可以为用于进行音频处理的音频处理模型,或者可以为用于进行分类的分类模型,或者可以为用于进行聚类的聚类模型,或者可以为用于进行推荐的推荐模型等。根据目标模型的用途不同,训练数据的内容可以不同,例如可以为图像、音频或符合模型用途需求的数据等。
其中,随着深度学习的繁荣,尤其是神经网络的发展,颠覆了传统机器学习特征工程的时代,将人工智能的浪潮推到了历史最高点。然而,尽管各种神经网络模型层出不穷,但往往模型性能越高,对超参数的要求也越来越严格,稍有不同就无法复现论文的结果。而网络结构作为一种特殊的超参数,在深度学习整个环节中扮演着举足轻重的角色。在图 像分类任务上大放异彩的ResNet模型、在机器翻译任务上称霸的Transformer模型等网络结构无一不来自专家的精心设计。这些精细的网络结构的背后是深刻的理论研究和大量广泛的实验,这无疑给人们带来了新的挑战。神经网络架构搜索(Neural Architecture Search,NAS)是一种自动设计神经网络的技术,可以通过算法根据样本集自动设计出高性能的网络结构,可以有效的降低神经网络的使用和实现成本。
将训练数据作为初始模型的输入,利用初始模型对其进行处理,得到输出数据。输出数据可以仅包括初始模型最终输出的数据,在另一种实施方式中,输出数据还可以包括初始模型对训练数据的处理过程中得到的中间数据。
S102:将当前设备类型、初始模型中目标网络层的目标网络层类型输入嵌入层,得到其他设备类型对应的目标延迟数据。
其中,初始模型包括嵌入层,嵌入层基于预设网络层延迟信息构建,预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类延迟数据对应于不同的设备类型。即,预设网络层延迟信息包括多组信息,每组信息记录了至少两类延迟数据,以及与这两类延迟数据对应的网络层类型之间的对应关系。需要说明的是,延迟数据,是指表征网络层类型在对应的设备上运行时的延迟的数据,其具体形式不做限定。由于同一类型的网络层在不同的电子设备上运行的延迟不同,且网络层的参数对运行延迟同样有影响。当一个模型在A设备上训练时,通常会根据各个网络层在A设备上的执行的延迟对模型进行调整,因此在训练结束后,该模型在与A设备类型不同的B设备上运行时,运行的延迟通常无法达到最小延迟。
为了解决上述问题,使得在A设备上训练的模型在B设备上也能达到最小延迟,本申请预先生成了预设网络层延迟信息,并基于其构建了嵌入层。嵌入层位于初始模型中,其可以根据当前设备类型以及初始模型中各个目标网络层的目标网络层类型,得到其他设备类型对应的目标延迟数据,以便在后续利用目标延迟数据构建目标损失值,进而利用目标损失值对模型参数进行调节。
其中,当前设备类型,是指用于训练初始模型得到目标模型的设备的类型,其具体形式不做限定,例如可以为服务器、个人电脑等;或者可以为某一型号的服务器、某一型号的个人电脑等。目标网络层,是指初始模型中的网络层,目标网络层类型是指目标网络层的具体类型,例如可以为3*3卷积层,或者可以为池化层等。预设网络层延迟信息可以经过映射处理得到对应的一维数组,相当于将预设网络层延迟信息对应的稀疏矩阵进行降维,并基于该向量构建嵌入层。嵌入层可以根据输入的信息进行检索映射,得到对应的其他设备类型的目标延迟数据。其它设备类型,是指非当前设备类型。
在一种实施方式中,预设网络层延迟信息中仅包括两类延迟数据,在这种情况下,在输入当前设备类型后,仅剩一类延迟数据可以作为目标延迟数据。在另一种实施方式中,预设网络层延迟信息中可以包括大于等于两类的延迟数据,在这种情况下,输入当前设备类型后,还剩余至少两类延迟数据可以作为目标延迟数据。此时,还可以根据需要在这至少两类延迟数据中选择一类作为目标延迟数据。具体的,将当前设备类型、初始模型中目标网络层的目标网络层类型输入嵌入层,得到其他设备类型对应的目标延迟数据的过程可以包括如下步骤:
步骤11:将当前设备类型、各个目标网络层类型以及目标数据输入嵌入层,得到与目标数据对应的目标延迟数据。
在本实施方式中,除了将当前设备类型以及目标网络层类型输入嵌入层后,还可以将目标数据输入嵌入层,以便根据需要选择合适的目标延迟数据。其中,目标数据包括输入数据尺度和/或目标设备类型。目标设备类型,是指目标网络被调用的设备的类型。由于输入数据的尺度不同同样会影响网络层的延迟,因此目标数据还可以包括输入数据尺度,以便获取更加准确地目标延迟数据。
可以理解的是,在基于预设网络层延迟信息够贱嵌入层之前,需要先获取各个设备类型、网络层类型以及对应的延迟,以便生成预设网络层延迟信息。具体的,预设网络层延迟信息的生成过程,包括:
步骤21:确定若干个网络层,以及具有各个网络层的若干个预设网络模型。
步骤22:将各个预设网络模型在各个设备类型对应的设备上进行训练,得到各个设备类型对应第一延迟数据。
步骤23:利用第一延迟数据得到与网络层对应的第二延迟数据。
步骤24:利用第二延迟数据、网络层的网络层类型、设备类型之间的对应关系,生成预设网络层延迟信息。
由于网络层的执行延迟受到其参数影响,因此无法直接利用单个网络层进行确定其延迟,需要调用整个网络模型来确定。具体的,选择若干个类型网络层用于生成预设网络层延迟信息,一个类型的网络层可能被多个不同类型的网络模型所有,因此在确定网络层后,可以基于网络层进一步确定具有该网络层的若干个预设网络模型。通过将预设网络模型在若干个不同类型的设备上进行训练,可以确定各个设备上各个预设网络模型的总体调用延迟。第一延迟数据,是指预设网络模型的调用延迟数据。网络层的延迟为第一眼吃数据中的一部分,可以利用第一延迟数据得到各个类型的网络层对应的第二延迟数据。在得到第二延迟数据后,利用其与网络层的网络层类型、设备类型之间的对应关系,生成预设网络层延迟信息。
例如,对于conv3*3(即3*3卷积层)这一网络层来说,其在R-CNN(Region-CNN,CNN是指Convolutional Neural Networks,卷积神经网络)、Fast R-CNN(R-CNN算法的升级版)等模型中存在,在这种情况下,可以将具有conv3*3的模型确定为预设网络模型。在本实施例中,为了得到更准确的目标延迟数据,还可以将各个预设网络模型中具有不同的输入数据尺度的conv3*3视为不同类型的网络层。对于图像来说,输入数据尺度可以包括通道数(C-channel)、高(H-height)、宽(W-width)和深(D-depth)。通过训练可以得到各类conv3*3对应的延迟(Latency),将其进行统计可以如表1所示:
表1 conv3*3的延迟统计
Figure PCTCN2021127535-appb-000001
其中,L server表示服务器端延迟,L user表示客户端延迟,二者均为第二延迟数据。表1示出的设备类型共有两种,分别为客户端以及服务器端,层数统计一栏中,不同网络层的输入数据尺度不同,即R-CNN-1conv3*3这一网络层与R-CNN-2conv3*3这一网络层的输入数据尺度不同。基于上述方式,对其他选择的网络层进行同样的处理,得到完整的预设网络层延迟信息。
S103:利用目标延迟数据、训练数据和输出数据计算目标损失值,并利用目标损失值对初始模型进行参数调节。
在得到目标延迟数据后,可以利用其与训练数据和输出数据计算目标损失值,进而利用目标损失值对初始模型进行参数调节。对于目标损失值的具体计算方式,本实施例不做限定,例如可以利用训练数据和输出数据计算精度损失值,并利用精度损失值和目标延迟数据生成目标损失值,即利用目标延迟数据、训练数据和输出数据计算目标损失值的过程包括:
步骤31:利用训练数据和输出数据得到精度损失值。
步骤32:利用精度损失值和目标延迟数据进行加权求和,得到目标损失值。
本实施方式中,通过加权求和的方式计算目标损失值。在其他的实施方式中,还可以采用其他的计算方式生成目标损失值。
S104:若满足训练完成条件,则基于初始模型得到目标模型。
训练完成条件,是指确定初始模型训练过程结束的条件,其可以对初始模型本身进行限定,或者可以对初始模型的训练过程进行限定。例如,可以为对初始模型的收敛程度、识别准确率等进行限制的条件,或者可以为对训练时长、训练轮次等进行限制的条件。若满足训练完成条件,则可以基于初始模型得到目标模型,本实施例并不限定得到目标模型的具体方式,例如,可以将初始模型中的嵌入层去除,得到目标模型。
应用本申请实施例提供的网络模型训练方法,初始模型包括嵌入层,嵌入层基于预设网络层延迟信息构建,预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据。在对初始模型进行训练时,需要对其中的参数进行调节,损失值即为调节参数的基准。由于不同设备对不同类型的网络层的执行延迟不同,在当前设备上进行训练时,为了得到能够在其他设备上延迟较小的目标模型,可以将当前设备类型、初始模型中各个目标网络层的网络层类型输入嵌入层,以便嵌入层根据预设网络层延迟信息,得到在非当前设备上目标网络层对应的目标延迟数据,进而利用目标延迟数据参与目标损失值的计算。即,在计算目标损失值时,并不基于初始模型在当前设备上对训练数据处理时产生的真实延迟,而是利用理论上初始模型在其它设备类型对应的设备上对训练数据处理时产生的理论延迟替代真实延迟。由于采用的目标延迟数据并不与当前设备类型对应,而是与其它设备类型相匹配。因此利用其得到的目标损失值能够反映初始模型在其他设备类型对应的设备上的执行延迟,利用其对初始模型进行参数调整,能够使得初始模型能够与其他设备类型对应的设备更加匹配,起到模拟在其他设备类型对应设备上直接训练初始模型的效果,使得最终训练得到的目标模型在其他设备类型对应的设备上运行时具有最小的延迟。解决了相关技术存在的延迟较大的问题。
基于上述实施例,在一种具体的实施方式中,初始模型为基于神经网络模型训练规则,利用搜索空间构建得到的超参数网络,初始模型的网络架构与目标有向无环图对应,目标有向无环图具有若干个有向边,各个有向边具有若干个分支。初始模型训练过程中,可以评估各个分支,最终从中选择裁剪,得到目标模型。
神经网络模型训练规则,是指用于在进行网络架构搜索时生成最初始的模型,即初始模型。搜索空间,是指可以搜索的神经网络的类型,同时也定义了如何描述神经网络结构。在一种实施方式中,搜索空间包含MBConv3*3_1(表示卷积核为3*3,步长为1),MBConv3*3_2,MBConv3*3_3,MBConv3*3_4,MBConv3*3_5,MBConv3*3_6,MBConv5*5_1,MBConv5*5_2,MBConv5*5_3,MBConv5*5_4,MBConv5*5_5,MBConv5*5_6,MBConv7*7_1,MBConv7*7_2,MBConv7*7_3,MBConv7*7_4,MBConv7*7_5,MBConv7*7_6,Identity,Zero等网络层,其中,Identity为占位层,Zero为零操作层。通过在搜索空间中添加0操作层,可以允许采用跳跃连接的方式构建更深的网络,保持网络深度和宽度的平衡,二者的平衡使得模型能够具有更高的精度。
其中,初始模型可以被定义为N(e 1,…,e n),其中e i表示有向无环图的特定的边,O={o i},i∈(1,N)代表N个可选基础操作,即N个分支。为了构建一个包含搜索空间的所有架构的超参数化的网络,本实施例使用每个边来定义基础操作的方法,把每个边定义 为含有N条平行路径的一系列混合操作,可以记作m o,整个初始模型记作N(e 1=m o 1,…,e n=m o n),则对于输入x来说,混合操作m o可以定义为N条路径的输出,即:
Figure PCTCN2021127535-appb-000002
其中,m o NAS(x)即为输出。其中,a i为架构参数,架构参数,是指用于进行模型架构选择的参数,其参与分支的选择裁剪。每个分支对应一个架构参数。可以看出,训练普通的网络模型仅需要对一个分支进行计算和存储,和对上述初始模型采用传统的训练方式进行训练时,需要处采用N倍的GPU(graphics processing unit,图形处理单元)显存和GPU计算时长。
当前,由于模型的分支较多,在训练时,为了减少训练时间和训练资源的消耗,通常训练单一的模块,并利用代理任务的少量训练数据对该模块中的各个分支进行训练。在训练完毕后,对该模块进行重复利用,得到最终的模型,然而即便训练的模块相比模型的规模更小,采用少量训练数据进行训练的资源消耗较少,模型的生成过程仍会消耗较多的计算资源。
请参考图2,图2为本申请实施例提供的一种具体的神经网络架构搜索过程示意图。为了解决上述问题,本申请在对上述的超参数网络进行训练时,将训练数据输入初始模型,得到输出数据的过程可以包括:
步骤41:确定目标参数。
步骤42:根据目标参数,确定各个有向边对应的激活分支,并利用激活分支对训练数据进行处理,得到输出数据。
其中,目标参数可以为权重参数或架构参数,权重参数,是指表示分支权重的参数,其用于与架构参数相配合进行分支选择。在每次训练时,选择本次训练需要被更新的参数为目标参数。在确定目标参数后,从各个有向边的各个分支中选择一个作为激活分支,并利用激活分支对训练数据进行处理,得到输出数据。具体的,可以利用二值化门函数对分支进行选择和激活。具体的,二值化门函数为:
Figure PCTCN2021127535-appb-000003
其中,p 1至p n为利用各个分支的架构参数生成的概率值,p为用于选择激活函数的参数。根据p值与概率值的关系,确定g的具体内容,进而利用其对各个分支是否激活进行表示。利用二值化门函数对混合操作m o函数进行简化后可得:
Figure PCTCN2021127535-appb-000004
m o binary(x)为简化后的m o NAS(x)。其中,p的具体选择方式可以根据目标参数的类型而不同,例如在一种实施方式中,根据目标参数,确定各个有向边对应的激活分支的过程包括:
步骤51:若目标参数为权重参数,则随机确定激活分支。
步骤52:若目标参数为架构参数,则根据多项式分布采样原则选择激活分支。
若目标参数为架构参数,则随机选择激活分支。具体的,可以利用随机数生成器,p 1至p n在组成的集合中随机选择一个值作为p值,进而确定二值化门函数的值g,完成激活分支的选择。若目标参数为架构参数,则可以根据多项式分布采样原则对激活函数进行选择,每次从N个分支中选择两个作为激活分支,对其他的分支进行掩膜处理。
相应的,利用目标损失值对初始模型进行参数调节的过程可以包括:
步骤43:利用目标损失值对激活分支对应的目标参数进行更新。
其中,上一次更新的历史参数与目标参数的参数类型不同,即权重参数和架构参数交替更新。可以理解的是,通过保证激活分支数量远小于所有分支的数量,使得初始模型的训练过程所需的消耗降低至训练普通网络模型的水平,降低了计算资源的消耗。
可以理解的是,对于不同类型的目标模型,可以采用不同的方式进行参数更新。具体的,利用目标损失值对激活分支对应的目标参数进行更新的过程包括:
步骤61:若目标参数为权重参数,则利用目标损失值,利用随机梯度下降法更新激活分支的权重参数。
步骤62:若目标参数为架构参数,则利用目标损失值,按照预设更新规则计算更新参数,并利用更新参数更新激活分支的架构参数。
具体的,若目标参数为权重参数,则可以采用随机梯度下降法更新。若目标参数为架构参数,预设有预设更新规则,该预设更新规则规定了更新参数的具体计算方式。本实施例并不限定预设更新规则的具体内容,在一种实施方式中,其具体为:
Figure PCTCN2021127535-appb-000005
其中,当i=j时,δ ij=1,当i≠j时,δ ij=0。L为目标损失值。
具体的,目标损失值的具体计算过程可以如下:
以表1所示的情况为例,对于嵌入层来说,其作用为根据输入得到客户端延迟,即:
L User=f(T,C,H,W,D)*L Server
其中,T为网络层类型,f(T,C,H,W,D)为预设网络层延迟信息形成的映射矩阵,其实质为所有网络层的查找表。
经过上述映射,对于初始模型中的一个模块来说,该模块的延迟为:
Figure PCTCN2021127535-appb-000006
E[latency i]表示第i个模块的延迟,F表示上述映射矩阵,F(o i j)表示预测延迟。
在本实施例中,由于采用嵌入层进行目标延迟数据的确定,因此E[latency i]可以等效于:
Figure PCTCN2021127535-appb-000007
Embedding表示嵌入层处理。因此,对于所有的模块构成的延迟总量,即目标延迟数据来说,其等于:
E[latency]=∑ iE[latency i]
E[latency]为目标延迟数据。在得到目标延迟数据后,本申请综合延迟损失和精度损失,同时加入权重衰减,生成目标损失值。目标损失值具体为:
Figure PCTCN2021127535-appb-000008
其中,loss即为上述L,即目标损失值。λ 1和λ 2为权重值,ω为权重衰减常数,loss CE为精度损失。
在满足训练完成条件后,基于初始模型得到目标模型的过程可以包括:
步骤71:利用架构参数和权重参数计算各个分支对应的分支权重。
步骤72:确定各个有向边的最高分支权重,并裁剪初始模型中非最高分支权重对应的分支,得到目标模型。
本实施例并不限定分支权重的具体计算方式,例如可以为架构参数与权重参数相乘得到。在得到分支权重后,其中最高分支权重对应的分支组成的模型结构最佳,因此将非最高分支权重对应的分支裁减掉,剩余的分支所构成的模型即为目标模型。
需要说明的是,由于本申请提到的神经网络架构搜索过程所消耗的计算资源较少,因此可以无需采用代理任务的训练方式,而是直接采用目标模型所需要完成的目标任务对应的全量训练数据进行训练,提高目标模型的性能。
例如,在使用CIFAR10和ImageNet数据集进行图像分类任务时,不同于传统NAS方式先在CIFAR10上进行少量模块训练,然后迁移至ImageNet数据集并且堆砌模块构成模型的方式,本申请提供的NAS方式可直接在CIFAR10数据集或者ImageNet数据集及目标硬件上进行网络架构搜索。
超参数化网络的backbone网络(即主干网络)选用PyramidNet、DenseNet、MobileNet或者其他经典网络,同时对其做有益的改动,如,将PyramidNet网络中的3*3Conv层(3*3卷积层)优化为树结构(tree-structured),其深度为3,叶子节点的分支数为2。
在本实施例中,随机采样数千张图像,以0.7比0.3的比例划分训练集与验证集,进行网络架构搜索,采用的优化器可选Adam、SGD、Momentum、NAG、AdaGrad等算法,使用基于梯度的算法求导损失函数,进行超参数化网络和架构参数的迭代更新。
下面对本申请实施例提供的网络模型训练装置进行介绍,下文描述的网络模型训练装置与上文描述的网络模型训练方法可相互对应参照。
请参考图3,图3为本申请实施例提供的一种网络模型训练装置的结构示意图,包括:
输入模块110,用于获取训练数据,并将训练数据输入初始模型,得到输出数据;
其中,初始模型包括嵌入层,嵌入层基于预设网络层延迟信息构建,预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类延迟数据对应于不同的设备类型;
延迟获取模块120,用于将当前设备类型、初始模型中目标网络层的目标网络层类型输入嵌入层,得到其他设备类型对应的目标延迟数据;
参数调节模块130,用于利用目标延迟数据、训练数据和输出数据计算目标损失值,并利用目标损失值对初始模型进行参数调节;
模型生成模块140,用于若满足训练完成条件,则基于初始模型得到目标模型。
可选地,包括:
预设网络模型确定模块,用于确定若干个网络层,以及具有各个网络层的若干个预设网络模型;
第一延迟数据获取模块,用于将各个预设网络模型在各个设备类型对应的设备上进行训练,得到各个设备类型对应第一延迟数据;
第二延迟数据获取模块,用于利用第一延迟数据得到与网络层对应的第二延迟数据;
预设网络层延迟信息生成模块,用于利用第二延迟数据、网络层的网络层类型、设备类型之间的对应关系,生成预设网络层延迟信息。
可选地,参数调节模块130,包括:
精度损失值计算单元,用于利用训练数据和输出数据得到精度损失值;
加权求和单元,用于利用精度损失值和目标延迟数据进行加权求和,得到目标损失值。
可选地,初始模型为基于神经网络模型训练规则,利用搜索空间构建得到的超参数网络,初始模型的网络架构与目标有向无环图对应,目标有向无环图具有若干个有向边,各个有向边具有若干个分支;
输入模块110,包括:
参数确定单元,用于确定目标参数;
分支激活单元,用于根据目标参数,确定各个有向边对应的激活分支,并利用激活分支对训练数据进行处理,得到输出数据;
相应的,参数调节模块130,包括:
更新单元,用于利用目标损失值对激活分支对应的目标参数进行更新;其中,上一次更新的历史参数与目标参数的参数类型不同。
可选地,分支激活单元,包括:
随机激活子单元,用于若目标参数为权重参数,则随机确定激活分支;
分布采样子单元,用于若目标参数为架构参数,则根据多项式分布采样原则选择激活分支。
可选地,更新单元,包括:
随机梯度更新子单元,用于若目标参数为权重参数,则利用目标损失值,利用随机梯度下降法更新激活分支的权重参数;
规则更新子单元,用于若目标参数为架构参数,则利用目标损失值,按照预设更新规则计算更新参数,并利用更新参数更新激活分支的架构参数。
可选地,模型生成模块140,包括:
权重计算单元,用于利用架构参数和权重参数计算各个分支对应的分支权重;
裁剪单元,用于确定各个有向边的最高分支权重,并裁剪初始模型中非最高分支权重对应的分支,得到目标模型。
可选地,预设网络层延迟信息生成模块,包括:
延迟数据选择单元,用于将当前设备类型、各个目标网络层类型以及目标数据输入嵌入层,得到与目标数据对应的目标延迟数据;其中,目标数据包括输入数据尺度和/或目标设备类型。
下面对本申请实施例提供的电子设备进行介绍,下文描述的电子设备与上文描述的网络模型训练方法可相互对应参照。
请参考图4,图4为本申请实施例提供的一种电子设备的结构示意图。其中电子设备100可以包括处理器101和存储器102,还可以进一步包括多媒体组件103、信息输入/信息输出(I/O)接口104以及通信组件105中的一种或多种。
其中,处理器101用于控制电子设备100的整体操作,以完成上述的网络模型训练方法中的全部或部分步骤;存储器102用于存储各种类型的数据以支持在电子设备100的操作,这些数据例如可以包括用于在该电子设备100上操作的任何应用程序或方法的指令,以及应用程序相关的数据。该存储器102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,例如静态随机存取存储器(Static Random Access Memory,SRAM)、电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、只读存储器(Read-Only Memory,ROM)、磁存储器、快闪存储器、磁盘或光盘中的一种或多种。
多媒体组件103可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏,音频组件用于输出和/或输入音频信号。例如,音频组件可以包括一个麦克风,麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器102或通过通信组件105发送。音频组件还包括至少一个扬声器,用于输出音频信号。I/O接口104为处理器101和其他接口模块之间提供接口,上述其他接口模块可以是键盘,鼠标,按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件105用于电子设备100与其他设备之间进行有线或无线通信。无线通信,例如Wi-Fi,蓝牙,近场通信(Near Field Communication,简称NFC),2G、3G或4G,或它们中的一种或几种的组合,因此相应的该通信组件105可以包括:Wi-Fi部件,蓝牙部件,NFC部件。
电子设备100可以被一个或多个应用专用集成电路(Application Specific Integrated Circuit,简称ASIC)、数字信号处理器(Digital Signal Processor,简称DSP)、数字信号处理设备(Digital Signal Processing Device,简称DSPD)、可编程逻辑器件(Programmable Logic Device,简称PLD)、现场可编程门阵列(Field Programmable Gate Array,简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述实施例给出的网络模型训练方法。
下面对本申请实施例提供的计算机可读存储介质进行介绍,下文描述的计算机可读存储介质与上文描述的网络模型训练方法可相互对应参照。
本申请还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现上述的网络模型训练方法的步骤。
该计算机可读存储介质可以包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其它实施例的不同之处,各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
本领域技术人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应该认为超出本申请的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系属于仅仅用来将一个实体或者操作与另一个实体或者操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语包括、包含或者其他任何变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。
本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (11)

  1. 一种网络模型训练方法,其特征在于,包括:
    获取训练数据,并将所述训练数据输入初始模型,得到输出数据;
    其中,所述初始模型包括嵌入层,所述嵌入层基于预设网络层延迟信息构建,所述预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类所述延迟数据对应于不同的设备类型;
    将当前设备类型、所述初始模型中目标网络层的目标网络层类型输入所述嵌入层,得到其他设备类型对应的目标延迟数据;
    利用所述目标延迟数据、所述训练数据和所述输出数据计算目标损失值,并利用所述目标损失值对所述初始模型进行参数调节;
    若满足训练完成条件,则基于所述初始模型得到目标模型。
  2. 根据权利要求1所述的网络模型训练方法,其特征在于,所述预设网络层延迟信息的生成过程,包括:
    确定若干个网络层,以及具有各个所述网络层的若干个预设网络模型;
    将各个所述预设网络模型在各个所述设备类型对应的设备上进行训练,得到各个所述设备类型对应第一延迟数据;
    利用所述第一延迟数据得到与所述网络层对应的第二延迟数据;
    利用所述第二延迟数据、所述网络层的网络层类型、所述设备类型之间的对应关系,生成所述预设网络层延迟信息。
  3. 根据权利要求1所述的网络模型训练方法,其特征在于,所述利用所述目标延迟数据、所述训练数据和所述输出数据计算目标损失值,包括:
    利用所述训练数据和所述输出数据得到精度损失值;
    利用所述精度损失值和所述目标延迟数据进行加权求和,得到所述目标损失值。
  4. 根据权利要求1至3任一项所述的网络模型训练方法,其特征在于,所述初始模型为基于神经网络模型训练规则,利用搜索空间构建得到的超参数网络,所述初始模型的网络架构与目标有向无环图对应,所述目标有向无环图具有若干个有向边,各个所述有向边具有若干个分支;
    所述将所述训练数据输入初始模型,得到输出数据,包括:
    确定目标参数;
    根据目标参数,确定各个所述有向边对应的激活分支,并利用所述激活分支对所述训练数据进行处理,得到所述输出数据;
    相应的,所述利用所述目标损失值对所述初始模型进行参数调节,包括:
    利用所述目标损失值对所述激活分支对应的所述目标参数进行更新;其中,上一次更新的历史参数与所述目标参数的参数类型不同。
  5. 根据权利要求4所述的网络模型训练方法,其特征在于,所述根据目标参数,确定各个所述有向边对应的激活分支,包括:
    若所述目标参数为权重参数,则随机确定所述激活分支;
    若所述目标参数为架构参数,则根据多项式分布采样原则选择所述激活分支。
  6. 根据权利要求5所述的网络模型训练方法,其特征在于,所述利用所述目标损失值对所述激活分支对应的目标参数进行更新,包括:
    若所述目标参数为权重参数,则利用所述目标损失值,利用随机梯度下降法更新所述激活分支的权重参数;
    若所述目标参数为架构参数,则利用所述目标损失值,按照预设更新规则计算更新参数,并利用所述更新参数更新所述激活分支的架构参数。
  7. 根据权利要求4所述的网络模型训练方法,其特征在于,所述基于所述初始模型得到目标模型,包括:
    利用架构参数和权重参数计算各个所述分支对应的分支权重;
    确定各个所述有向边的最高分支权重,并裁剪所述初始模型中非最高分支权重对应的分支,得到所述目标模型。
  8. 根据权利要求1所述的网络模型训练方法,其特征在于,所述将当前设备类型、所述初始模型中目标网络层的目标网络层类型输入所述嵌入层,得到其他设备类型对应的目标延迟数据,包括:
    将所述当前设备类型、各个所述目标网络层类型以及目标数据输入所述嵌入层,得到与所述目标数据对应的所述目标延迟数据;其中,所述目标数据包括输入数据尺度和/或目标设备类型。
  9. 一种网络模型训练装置,其特征在于,包括:
    输入模块,用于获取训练数据,并将所述训练数据输入初始模型,得到输出数据;
    其中,所述初始模型包括嵌入层,所述嵌入层基于预设网络层延迟信息构建,所述预设网络层延迟信息包括相互对应的网络层类型和至少两类延迟数据,各类所述延迟数据对应于不同的设备类型;
    延迟获取模块,用于将当前设备类型、所述初始模型中目标网络层的目标网络层类型输入所述嵌入层,得到其他设备类型对应的目标延迟数据;
    参数调节模块,用于利用所述目标延迟数据、所述训练数据和所述输出数据计算目标损失值,并利用所述目标损失值对所述初始模型进行参数调节;
    模型生成模块,用于若满足训练完成条件,则基于所述初始模型得到目标模型。
  10. 一种电子设备,其特征在于,包括存储器和处理器,其中:
    所述存储器,用于保存计算机程序;
    所述处理器,用于执行所述计算机程序,以实现如权利要求1至8任一项所述的网络模型训练方法。
  11. 一种计算机可读存储介质,其特征在于,用于保存计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述的网络模型训练方法。
PCT/CN2021/127535 2021-08-24 2021-10-29 网络模型训练方法、装置、电子设备及可读存储介质 WO2023024252A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/259,682 US20240265251A1 (en) 2021-08-24 2021-10-29 Network Model Training Method and Apparatus, Electronic Apparatus and Computer-readable Storage Medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110971264.1A CN113420880B (zh) 2021-08-24 2021-08-24 网络模型训练方法、装置、电子设备及可读存储介质
CN202110971264.1 2021-08-24

Publications (1)

Publication Number Publication Date
WO2023024252A1 true WO2023024252A1 (zh) 2023-03-02

Family

ID=77719530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/127535 WO2023024252A1 (zh) 2021-08-24 2021-10-29 网络模型训练方法、装置、电子设备及可读存储介质

Country Status (3)

Country Link
US (1) US20240265251A1 (zh)
CN (1) CN113420880B (zh)
WO (1) WO2023024252A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116684480A (zh) * 2023-07-28 2023-09-01 支付宝(杭州)信息技术有限公司 信息推送模型的确定及信息推送的方法及装置
CN118378726A (zh) * 2024-06-25 2024-07-23 之江实验室 一种模型训练系统、方法、存储介质及电子设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420880B (zh) * 2021-08-24 2021-11-19 苏州浪潮智能科技有限公司 网络模型训练方法、装置、电子设备及可读存储介质
CN113742089B (zh) * 2021-11-04 2022-02-18 苏州浪潮智能科技有限公司 异构资源中神经网络计算任务的分配方法、装置和设备
CN115081628B (zh) * 2022-08-15 2022-12-09 浙江大华技术股份有限公司 一种深度学习模型适配度的确定方法及装置
CN118070879B (zh) * 2024-04-17 2024-07-23 浪潮电子信息产业股份有限公司 一种模型更新方法、装置、设备及可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667024A (zh) * 2020-06-30 2020-09-15 腾讯科技(深圳)有限公司 内容推送方法、装置、计算机设备和存储介质
CN111723901A (zh) * 2019-03-19 2020-09-29 百度在线网络技术(北京)有限公司 神经网络模型的训练方法及装置
US20210012239A1 (en) * 2019-07-12 2021-01-14 Microsoft Technology Licensing, Llc Automated generation of machine learning models for network evaluation
US11080596B1 (en) * 2017-06-14 2021-08-03 Amazon Technologies, Inc. Prediction filtering using intermediate model representations
CN113420880A (zh) * 2021-08-24 2021-09-21 苏州浪潮智能科技有限公司 网络模型训练方法、装置、电子设备及可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110992935B (zh) * 2014-09-12 2023-08-11 微软技术许可有限责任公司 用于训练神经网络的计算系统
CN110599557B (zh) * 2017-08-30 2022-11-18 深圳市腾讯计算机系统有限公司 图像描述生成方法、模型训练方法、设备和存储介质
CN112016666A (zh) * 2019-05-31 2020-12-01 微软技术许可有限责任公司 深度学习模型的执行
CN111787066B (zh) * 2020-06-06 2023-07-28 王科特 一种基于大数据与ai的物联网数据平台
CN113159284A (zh) * 2021-03-31 2021-07-23 华为技术有限公司 一种模型训练方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11080596B1 (en) * 2017-06-14 2021-08-03 Amazon Technologies, Inc. Prediction filtering using intermediate model representations
CN111723901A (zh) * 2019-03-19 2020-09-29 百度在线网络技术(北京)有限公司 神经网络模型的训练方法及装置
US20210012239A1 (en) * 2019-07-12 2021-01-14 Microsoft Technology Licensing, Llc Automated generation of machine learning models for network evaluation
CN111667024A (zh) * 2020-06-30 2020-09-15 腾讯科技(深圳)有限公司 内容推送方法、装置、计算机设备和存储介质
CN113420880A (zh) * 2021-08-24 2021-09-21 苏州浪潮智能科技有限公司 网络模型训练方法、装置、电子设备及可读存储介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116684480A (zh) * 2023-07-28 2023-09-01 支付宝(杭州)信息技术有限公司 信息推送模型的确定及信息推送的方法及装置
CN116684480B (zh) * 2023-07-28 2023-10-31 支付宝(杭州)信息技术有限公司 信息推送模型的确定及信息推送的方法及装置
CN118378726A (zh) * 2024-06-25 2024-07-23 之江实验室 一种模型训练系统、方法、存储介质及电子设备

Also Published As

Publication number Publication date
US20240265251A1 (en) 2024-08-08
CN113420880A (zh) 2021-09-21
CN113420880B (zh) 2021-11-19

Similar Documents

Publication Publication Date Title
WO2023024252A1 (zh) 网络模型训练方法、装置、电子设备及可读存储介质
KR102302609B1 (ko) 신경망 아키텍처 최적화
JP7210531B2 (ja) ニューラルアーキテクチャ検索
US20220391771A1 (en) Method, apparatus, and computer device and storage medium for distributed training of machine learning model
CN111406267B (zh) 使用性能预测神经网络的神经架构搜索
US12008445B2 (en) Black-box optimization using neural networks
US10984319B2 (en) Neural architecture search
US11449744B2 (en) End-to-end memory networks for contextual language understanding
US20190026639A1 (en) Neural architecture search for convolutional neural networks
WO2023103308A1 (zh) 模型训练、文本预测方法、装置、电子设备及介质
US20240127058A1 (en) Training neural networks using priority queues
CN110663049A (zh) 神经网络优化器搜索
JP7542793B2 (ja) 人工知能推論モデルを軽量化する方法およびシステム
CN113010312A (zh) 一种超参数调优方法、装置及存储介质
CN110874635A (zh) 一种深度神经网络模型压缩方法及装置
CN116385059A (zh) 行为数据预测模型的更新方法、装置、设备及存储介质
Chen et al. A user dependent web service QoS collaborative prediction approach using neighborhood regularized matrix factorization
CN113408702A (zh) 音乐神经网络模型预训练方法及电子设备和存储介质
CN111898389B (zh) 信息确定方法、装置、计算机设备及存储介质
Huang et al. Sampling adaptive learning algorithm for mobile blind source separation
CN118093097B (zh) 数据存储集群资源调度方法、装置、电子设备和介质
US20220019869A1 (en) Hardware-optimized neural architecture search
CN118394768A (zh) 索引更新的方法、装置、电子设备及存储介质
CN117494661A (zh) 用于编码的方法、装置、设备和可读介质
CN114764455A (zh) 获取流媒体对象的表示向量的方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21954772

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21954772

Country of ref document: EP

Kind code of ref document: A1