WO2023273934A1 - 一种模型超参数的选择方法及相关装置 - Google Patents

一种模型超参数的选择方法及相关装置 Download PDF

Info

Publication number
WO2023273934A1
WO2023273934A1 PCT/CN2022/099779 CN2022099779W WO2023273934A1 WO 2023273934 A1 WO2023273934 A1 WO 2023273934A1 CN 2022099779 W CN2022099779 W CN 2022099779W WO 2023273934 A1 WO2023273934 A1 WO 2023273934A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
weights
topographic map
training
hyperparameters
Prior art date
Application number
PCT/CN2022/099779
Other languages
English (en)
French (fr)
Inventor
高寒
宋元巍
欧功畅
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023273934A1 publication Critical patent/WO2023273934A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • G06T17/05Geographic models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a method for selecting model hyperparameters and related devices.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Deep learning method is a key driving force in the development of artificial intelligence in recent years. It is currently widely used in feature extraction and inference prediction of complex data. For example, deep learning methods are applied to computer vision tasks, which can achieve image enhancement; another example, deep learning methods are applied to data classification tasks, which can classify data such as text or speech.
  • Deep learning methods usually process data through trained neural network models.
  • the hyperparameters such as learning rate or optimizer
  • the hyperparameters set by the user for the model can affect the generalization accuracy of the model. Therefore, in order to improve the generalization accuracy of the model as much as possible, people usually need to choose reasonable hyperparameters before model training.
  • the selection method of hyperparameters is mainly to select part of the training data in the training set of the model, and train the model with different hyperparameters based on part of the training data until the training converges; the corresponding hyperparameters. Since this hyperparameter selection method needs to train the model to convergence, long-term training is often required, and the hyperparameter selection efficiency is low.
  • the present application provides a method for selecting model hyperparameters, which can improve the selection efficiency of model hyperparameters.
  • the first aspect of the present application provides a method for selecting model hyperparameters, which is applied in the technical field of artificial intelligence.
  • the method includes the following steps: the electronic device obtains multiple sets of hyperparameters of the neural network model.
  • the hyperparameter is a parameter used to control the model training process.
  • Hyperparameters are parameters that are set before starting the model training process, rather than parameter data obtained through training.
  • each set of hyperparameters includes one or more hyperparameters, and the types of hyperparameters included in each set of hyperparameters are the same.
  • the value of at least one hyperparameter in the two sets of hyperparameters is different.
  • the multiple sets of hyperparameters acquired by the electronic device all include two hyperparameters, and the two hyperparameters are respectively a learning rate and an optimizer.
  • each set of weights in the multiple sets of weights includes weights obtained through multiple iterations of training.
  • the electronic device can perform multiple sets of repeated training on the neural network model based on multiple sets of hyperparameters. In each group of training for the neural network model, the training times and training data of the neural network model are the same.
  • the electronic device performs drawing of a plurality of topographic maps, wherein each topographic map in the plurality of topographic maps is drawn based on a set of the multiple sets of weights, and each topographic map in the plurality of topographic maps Both are used to represent the change trend of the loss function of the model during the training process.
  • the position of the loss function of the model is determined based on a set of weights corresponding to the topographic map. Since each set of weights includes weights corresponding to multiple iterations of training, and a value of a loss function can be obtained based on the weights obtained by each iteration of training, each set of weights has corresponding values of multiple loss functions.
  • the electronic device may draw a topographic map corresponding to each set of weights, so as to obtain multiple topographic maps corresponding to the multiple sets of weights one-to-one. For example, in the case that the electronic device obtains 5 sets of weights, the electronic device can draw 5 topographic maps, and each topographic map is drawn based on a set of weights.
  • the target hyperparameters are a group of hyperparameters corresponding to the target topographic map, the target hyperparameters are used to train the model, and the target topographic map is the topographic map with the highest smoothness among the plurality of topographic maps, so
  • the flatness is used to represent the variation degree of the loss function of the model in the terrain map. The greater the change of the loss function of the model in the topographic map, the smaller the smoothness of the topographic map; the smaller the change of the loss function of the model in the topographic map, the higher the smoothness of the topographic map.
  • the topographic map can represent the change trend of the loss function of the model during the training process.
  • the final hyperparameters used for model training are selected. Since this solution selects the weights of the model during part of the training process to draw the topographic map, it does not need to train the model to convergence, so it can save time in selecting model hyperparameters and improve the selection efficiency of model hyperparameters.
  • each topographic map is related to the sum of the area of each topographic map in the multiple topographic maps and the length of the isovalue line.
  • Each topographic map of includes the same number of contour lines, and the points on the contour lines correspond to the same loss value.
  • the electronic device may determine the sum of the contour lengths of each topographic map in the plurality of topographic maps, each topographic map in the plurality of topographic maps includes the same number of contour lines, and the equal Points on the value line correspond to the same loss value. Based on the sum of the area of each topographic map in the plurality of topographic maps and the length of the contour line, the electronic device determines the smoothness of each topographic map.
  • the sum of the isoline lengths of the topographic map refers to the sum of the lengths of all contour lines in the topographic map
  • the loss value refers to the value of the loss function.
  • the flatness corresponding to each topographic map is determined by calculating the length of the contour line in the topographic map, and the flatness of the topographic map can be determined in a quantitative way, so as to facilitate further determination of the topographic map with the highest flatness , which improves the feasibility of the scheme.
  • the electronic device respectively trains the model based on the multiple sets of hyperparameters, including: the electronic device acquires a training subset, and the training subset includes the training set of the model Part of the training data; the electronic device uses the training subset to perform multiple iterative training on the model based on the multiple sets of hyperparameters, so as to obtain multiple sets of weights of the model during the training process, and the multiple sets of Each set of weights in the weights includes a set of weights of the model after each iteration of training.
  • the electronic device uses the training subset to perform multiple iterations of training on the neural network model.
  • the weights in the neural network model will be updated. Therefore, a set of weights obtained by performing multiple iterations of training on the neural network model includes multiple weight sets, and each weight set includes the weights of the neural network model after each iteration of training.
  • the weights used to draw the topographic map are obtained by using the training subset to perform multiple iterative training on multiple sets of hyperparameters, which avoids training the model to convergence based on the entire training set, so it can save the selection of model hyperparameters The time to improve the selection efficiency of model hyperparameters.
  • the electronic device performs drawing of topographic maps based on the multiple sets of weights to obtain multiple topographic maps, including: the electronic device performs dimensionality reduction processing on the first set of weights to obtain Two high-dimensional vectors of projection directions in a two-dimensional space, the first set of weights is a set of weights in the multiple sets of weights, and the first set of weights includes a weight set of the model after each iteration of training .
  • the dimensions of the two high-dimensional vectors are the same as the number of weights included in the model.
  • the model includes N weights in total
  • the first set of weights includes 5 weight sets
  • each weight set includes N weight parameters.
  • two high-dimensional vectors can be obtained, and the dimension of each high-dimensional vector is N-dimensional, that is, each high-dimensional vector includes N elements.
  • the electronic device determines weights corresponding to multiple sampling points in a first topographic map based on the first set of weights and the two high-dimensional vectors, where the multiple topographic maps include the first topographic map.
  • the electronic device determines a loss value corresponding to the model based on the weights corresponding to the plurality of sampling points, so as to draw the first topographic map.
  • the electronic device may determine, based on the first set of weights and the two high-dimensional vectors, the coordinates in two-dimensional space of the weights in the first set of weights after each iterative training;
  • the coordinates of the weights after the iterative training in two-dimensional space determine the boundary of the first topographic map;
  • the electronic device divides the first topographic map into multiple regions according to the boundary of the first topographic map, so as to obtain the Sampling points in each of the multiple areas;
  • the electronic device determines a loss value corresponding to the model based on the weight corresponding to each sampling point in the first topographic map, so as to draw the first topographic map, so
  • the plurality of topographic maps includes the first topographic map.
  • the electronic device determines the loss value corresponding to the model based on the weights corresponding to the plurality of sampling points, including: the electronic device based on the model and the first topographic map
  • the weights corresponding to the multiple sampling points are constructed to obtain multiple sub-models, the multiple sub-models are in one-to-one correspondence with the weights corresponding to the multiple sampling points, and the structures of the multiple sub-models are the same as the structure of the model.
  • the electronic device respectively inputs the same training data into the multiple sub-models to obtain loss values corresponding to the multiple sampling points.
  • the structures of the multiple sub-models are the same, but the weights of the neurons in the multiple sub-models respectively correspond to the weights corresponding to the multiple sampling points.
  • the electronic device can perform inference operations on the multiple sub-models simultaneously and in parallel on the same hardware based on the same input data, so as to obtain loss values corresponding to multiple sampling points.
  • inference operations can be performed on multiple sub-models in parallel, so that the loss values corresponding to multiple sampling points can be obtained at the same time, and the drawing efficiency of topographic maps is improved.
  • the method further includes: the electronic device determines the degree of roughness of each region in the first topographic map, and the degree of ruggedness is used to represent the density of contour lines in each region , the loss values corresponding to the points on the contour line are the same; the electronic device adds sampling points in the first topographic map according to the degree of ruggedness, so as to update the first topographic map; wherein, the first There is a positive correlation between the density of sampling points in an area in the topographic map and the roughness of the area.
  • the electronic device adds sampling points in the first topographic map according to the degree of ruggedness, including: the electronic device calculates the first Sorting the multiple areas in the topographic map to obtain the sorting results of the multiple areas; based on the sorting results of the multiple areas, the electronic device sequentially increases sampling points in the multiple areas until the increased sampling points The number of points reaches a preset threshold.
  • the determining the ruggedness of each region in the first topographic map includes: the electronic device separately determines the second order derivative matrix of the sampling points of each region in the first topographic map ; The electronic device calculates the two eigenvalues of the second-order derivative matrix, and determines the sum of the absolute values of the two eigenvalues, so as to obtain the ruggedness of each region in the first topographic map.
  • a second aspect of the present application provides an electronic device, including an acquisition unit and a processing unit.
  • the acquiring unit is configured to acquire multiple sets of hyperparameters of the neural network model;
  • the processing unit is configured to perform multiple iterative training on the model based on the multiple sets of hyperparameters, so as to obtain the Multiple sets of weights in the process, the multiple sets of weights are in one-to-one correspondence with the multiple sets of hyperparameters, and each set of weights in the multiple sets of weights includes weights obtained by multiple iterative training;
  • the processing unit also uses for performing mapping of a plurality of topographic maps, wherein each topographic map of the plurality of topographic maps is drawn based on one of the plurality of sets of weights, each topographic map of the plurality of topographic maps is used for Indicates the change trend of the loss function of the model during the training process;
  • the processing unit is also used to obtain target hyperparameters, the target hyperparameters are a set of hyperparameters corresponding to the
  • each topographic map is related to the sum of the area and contour length of each topographic map in the multiple topographic maps, and each topographic map in the multiple topographic maps
  • Each topographic map includes the same number of contour lines, and the points on the contour lines correspond to the same loss value.
  • the acquiring unit is further configured to acquire a training subset, the training subset includes part of the training data in the training set of the model; the processing unit is also configured to use the Training subsets, performing multiple iterative training on the model based on the multiple sets of hyperparameters, to obtain multiple sets of weights of the model during the training process, each set of weights in the multiple sets of weights includes the model The set of weights after each iteration of training.
  • the processing unit is further configured to perform dimensionality reduction processing on the first set of weights to obtain two high-dimensional vectors as projection directions in a two-dimensional space, and the first set of weights is the A set of weights in the multiple sets of weights, the first set of weights includes a weight set of the model after each iteration of training; the processing unit is further configured to based on the first set of weights and the two A high-dimensional vector, determining weights corresponding to multiple sampling points in the first topographic map, the multiple topographic maps including the first topographic map; the processing unit is also configured to Corresponding to the weight, determine the loss value corresponding to the model, so as to draw the first topographic map.
  • the processing unit is further configured to construct multiple sub-models based on the model and weights corresponding to multiple sampling points in the first topographic map, and the multiple sub-models are related to The weights corresponding to the plurality of sampling points are in one-to-one correspondence, and the structure of the plurality of sub-models is the same as that of the model; the processing unit is also used to input the same training data to the plurality of sub-models respectively , to obtain the loss values corresponding to the multiple sampling points.
  • the processing unit is further configured to determine the degree of ruggedness of each region in the first topographic map, where the degree of ruggedness is used to represent the density of contour lines in each region , the loss values corresponding to the points on the contour lines are the same; the processing unit is further configured to add sampling points in the first topographic map according to the degree of roughness, so as to update the first topographic map; Wherein, the density of sampling points in the area in the first topographic map has a positive correlation with the ruggedness of the area.
  • the processing unit is further configured to sort the multiple regions in the first topographic map in descending order of ruggedness, so as to obtain the Sorting results: the processing unit is further configured to sequentially add sampling points in the multiple areas based on the sorting results of the multiple areas until the number of added sampling points reaches a preset threshold.
  • the processing unit is further configured to respectively determine a second-order derivative matrix of sampling points in each region in the first topographic map; the processing unit is also configured to calculate the second order derivative matrix Two eigenvalues of the order derivative matrix, and determine the sum of the absolute values of the two eigenvalues, so as to obtain the ruggedness of each region in the first topographic map.
  • the third aspect of the present application provides an electronic device, which may include a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in the above first aspect is implemented.
  • the steps in each possible implementation manner of the processor executing the first aspect details may refer to the first aspect, which will not be repeated here.
  • a fourth aspect of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it is run on a computer, the computer is made to execute the method described in the first aspect above.
  • a fifth aspect of the present application provides a circuit system, where the circuit system includes a processing circuit configured to execute the method described in the first aspect above.
  • the sixth aspect of the present application provides a computer program product, which, when run on a computer, causes the computer to execute the method described in the first aspect above.
  • the seventh aspect of the present application provides a chip system
  • the chip system includes a processor, used to support the server or the threshold value acquisition device to implement the functions involved in the first aspect above, for example, send or process the data and/or information.
  • the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the server or the communication device.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • Fig. 1 is a kind of structural schematic diagram of main frame of artificial intelligence
  • FIG. 2 is a schematic diagram of a convolutional neural network provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a convolutional neural network provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a grid search method for selecting hyperparameters in the prior art
  • FIG. 6 is a schematic flowchart of a method for selecting model hyperparameters provided in an embodiment of the present application.
  • Fig. 7 is a comparative schematic diagram of two topographic maps provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a twin parameter parallel method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of an overall process for drawing a topographic map provided by an embodiment of the present application.
  • Fig. 10 is a schematic flow chart of drawing a topographic map based on sampling points provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of increasing sampling points based on the roughness of the region provided by the embodiment of the present application.
  • FIG. 12 is a schematic diagram of the architecture of a training process visual analysis system provided by the embodiment of the present application.
  • Fig. 13 is a schematic workflow diagram of a training process visual analysis system provided by the embodiment of the present application.
  • FIG. 14 is a schematic workflow diagram of a computing acceleration module provided by an embodiment of the present application.
  • Fig. 15 is a comparative schematic diagram of topographic maps with different learning rates provided by the embodiment of the present application.
  • FIG. 16 is a schematic diagram of a topographic map-based guidance model structure provided by the embodiment of the present application.
  • Fig. 17 is the terrain map of the VGG model that does not add Batch Normalization layer provided by the embodiment of the present application;
  • Fig. 18 is the terrain map of the VGG model that adds Batch Normalization layer provided by the embodiment of the present application;
  • Figure 19 is a schematic diagram of the effect comparison before and after adding the Batch Normalization layer to the VGG model provided by the embodiment of the present application;
  • FIG. 20 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Fig. 21 is a schematic structural diagram of the execution device provided by the embodiment of the present application.
  • FIG. 22 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • FIG. 23 is a schematic structural diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • Figure 1 shows a schematic structural diagram of the main framework of artificial intelligence.
  • the following is from the “intelligent information chain” (horizontal axis) and “IT value chain” ( Vertical axis) to illustrate the above artificial intelligence theme framework in two dimensions.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the basic platform includes distributed computing framework and network and other related platform guarantees and supports, which can include cloud storage and Computing, interconnection network, etc.
  • sensors communicate with the outside to obtain data, and these data are provided to the smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent electronic equipment, intelligent transportation , smart healthcare, autonomous driving, smart cities, etc.
  • the model training method provided in the embodiment of the present application can be specifically applied to data processing methods such as data training, machine learning, and deep learning, and performs symbolic and formalized intelligent information modeling, extraction, preprocessing, and training on training data.
  • a trained neural network model (such as the target neural network model in the embodiment of the present application) is obtained; and the target neural network model can be used for model reasoning, specifically input data can be input into the target neural network model to obtain output data .
  • the neural network can be composed of neural units, and the neural unit can refer to an operation unit that takes xs (ie input data) and intercept 1 as input, and the output of the operation unit can be:
  • Ws is the weight of xs
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Convosutionas Neuras Network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of a convolutional layer and a subsampling layer. The feature extractor can be seen as a filter, and the convolution process can be seen as using a trainable filter to convolve with an input image or convolutional feature map.
  • a convolutional layer refers to a neuron layer that performs convolutional processing on an input signal in a convolutional neural network (for example, the first convolutional layer and the second convolutional layer in this embodiment). In the convolutional layer of a convolutional neural network, a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information that is independent of location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. That means that the image information learned in one part can also be used in another part. So for all positions on the image, we can use the same learned image information. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional layer/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .
  • the structure composed of the convolutional layer/pooling layer 120 and the neural network layer 130 can be the first convolutional layer and the second convolutional layer described in this application, the input layer 110 and the convolutional layer/pooling layer 120 Connection, the convolutional layer/pooling layer 120 is connected to the neural network layer 130, the output of the neural network layer 130 can be input to the activation layer, and the activation layer can perform nonlinear processing on the output of the neural network layer 130.
  • Convolutional/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer.
  • Layers, 124 layers are pooling layers
  • 125 are convolutional layers
  • 126 are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers Layer
  • 126 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolutional layer 121 can include many convolutional operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can essentially be a weight matrix. This weight matrix is usually pre-defined. In the process of convolution operation on the image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels...it depends on the value of the stride), so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image.
  • convolving with a single weight matrix produces a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolved image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be referred to as low-level features;
  • the features extracted by the later convolutional layers (such as 126) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • Pooling layer Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolutional layer, that is, the layers 121-126 shown in Figure 2 at 120, which can be a convolutional layer It is followed by a pooling layer, or a multi-layer convolutional layer followed by one or more pooling layers.
  • Neural network layer 130 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 100 needs to use the neural network layer 130 to generate one or a group of outputs with the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 2 ) and an output layer 140, and the parameters contained in the multi-layer hidden layers may vary according to the specific task type The related training data is pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, and so on.
  • the output layer 140 which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 100 (as shown in Figure 2, the propagation from 110 to 140 is forward propagation)
  • the backpropagation (as shown in Figure 2, the propagation from 140 to 110 is backward propagation) will start to update
  • the aforementioned weight values and deviations of each layer are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 100 shown in FIG. 2 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models, for example, as Multiple convolutional layers/pooling layers shown in FIG. 3 are parallelized, and the extracted features are input to the full neural network layer 130 for processing.
  • Deep Neural Network also known as multi-layer neural network
  • DNN Deep Neural Network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • the loss function loss function
  • objective function objective function
  • the training of the deep neural network becomes a process of reducing the loss as much as possible. That is to say, the training process is actually based on the loss function obtained through training, and continuously adjusts the weight vector in the deep neural network, so that the obtained loss function is continuously reduced.
  • the convolutional neural network can use the error back propagation (back propagation, BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial super-resolution model by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • FIG. 4 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices, and the user Data may be input to I/O interface 112 through client device 140 .
  • I/O input/output
  • the execution device 120 When the execution device 120 preprocesses the input data, or when the calculation module 111 of the execution device 120 executes calculations and other related processing (such as implementing the function of the neural network in this application), the execution device 120 can call the data storage system 150
  • the data, codes, etc. in the system can be used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result to the client device 140, thereby providing it to the user.
  • the client device 140 may be, for example, a control unit in an automatic driving system or a functional algorithm module in a mobile phone, for example, the functional algorithm module may be used to implement related tasks.
  • the training device 120 can generate corresponding target models/rules (such as the target neural network model in this embodiment) based on different training data for different targets or different tasks, and the corresponding target models/rules Rules can then be used to achieve the above-mentioned goals or complete the above-mentioned tasks to provide the user with the desired result.
  • target models/rules such as the target neural network model in this embodiment
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
  • the user can view the results output by the execution device 110 on the client device 140, and the specific display form may be specific ways such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
  • the client device 140 may not be used for collection, but the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample.
  • the data is stored in database 130 .
  • FIG. 4 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the electronic device may be, for example, a server, a surveillance camera, a mobile phone, a personal computer (PC), a notebook computer, a tablet computer, a smart TV, a mobile internet device (mobile internet device, MID) , wearable devices, virtual reality (virtual reality, VR) equipment, augmented reality (augmented reality, AR) equipment, wireless terminals in industrial control, wireless terminals in self driving, remote surgery Wireless terminals in remote medical surgery, wireless terminals in smart grid, wireless terminals in transportation safety, wireless terminals in smart city, smart home wireless terminals, etc.
  • FIG. 5 is a schematic diagram of a grid search method for selecting hyperparameters in the prior art.
  • learning rate LR
  • optimizer there are two hyperparameters to be set in the neural network model, namely learning rate (Learning rate, LR) and optimizer.
  • learning rate includes 0.2, 0.4 and 0.6
  • the selection range of optimizer includes stochastic gradient descent (Stochastic gradient descent, SGD) optimizer and Adam optimizer.
  • SGD stochastic gradient descent
  • the optimizer refers to an optimization method for adjusting the weight of the neural network model based on a loss function, and different optimizers use different optimization methods to adjust the weight of the neural network model.
  • the six sets of hyperparameters are: (0.2, Adam), (0.2, SGD), (0.4, Adam), (0.4, SGD), (0.6, Adam) and (0.6, SGD). Therefore, based on these 6 sets of hyperparameters, the user needs to select part of the training data to train the model 6 times, and each training must converge, and finally select a set of suitable hyperparameters. Since the model needs to be trained to converge for each set of hyperparameters, long-term training is often required, resulting in low efficiency in the selection of hyperparameters.
  • the embodiment of the present application provides a method for selecting model hyperparameters, by training models with different hyperparameters, and drawing a topographic map based on the weight of the model during the training process, the topographic map can represent the model's The trend of the loss function during training.
  • the final hyperparameters used for model training are selected. Since this solution selects the weights of the model during part of the training process to draw the topographic map, it does not need to train the model to convergence, so it can save time in selecting model hyperparameters and improve the selection efficiency of model hyperparameters.
  • FIG. 6 is a schematic flowchart of a method for selecting model hyperparameters provided by an embodiment of the present application. As shown in FIG. 6 , the method for selecting model hyperparameters includes the following steps 601-604.
  • Step 601 obtaining multiple sets of hyperparameters of the neural network model.
  • the electronic device before performing formal training on the neural network model based on all the data in the training set (that is, the full training set), the electronic device can first obtain multiple sets of hyperparameters of the neural network model, and select from multiple sets of hyperparameters A suitable set of hyperparameters is used for the training of the neural network model.
  • hyperparameters also known as hyperparameters, are parameters used to control the model training process. Hyperparameters are parameters that are set before starting the model training process, rather than parameter data obtained through training. In most cases of model training, hyperparameters need to be optimized, and a set of optimal hyperparameters should be selected for model training to improve the performance of the trained model.
  • each set of hyperparameters includes one or more hyperparameters, and the types of hyperparameters included in each set of hyperparameters are the same.
  • the value of at least one hyperparameter in the two sets of hyperparameters is different.
  • the multiple sets of hyperparameters acquired by the electronic device all include two hyperparameters, and the two hyperparameters are respectively a learning rate and an optimizer.
  • the value range of the learning rate includes 0.2, 0.4 and 0.6;
  • the selection range of the optimizer includes the SGD optimizer and the Adam optimizer.
  • the electronic device can obtain 6 sets of hyperparameters.
  • the six sets of hyperparameters are: (0.2, Adam), (0.2, SGD), (0.4, Adam), (0.4, SGD), (0.6, Adam) and (0.6, SGD).
  • the optimizer In the process of model training, the optimizer is used to control the direction in which the model adapts to the target problem; the learning rate is used to control the magnitude of the change in which the model adapts to the target problem.
  • the optimizer indicates how the model adjusts the weights in the model after each iteration of training, and the learning rate indicates how much the model adjusts the weights after each iteration of training.
  • the multiple sets of hyperparameters acquired by the electronic device may also include other hyperparameters, and this embodiment does not specifically limit the types of hyperparameters.
  • the electronic device may acquire multiple sets of hyperparameters of the neural network model by acquiring values of multiple sets of hyperparameters set by the user. For example, the user respectively inputs the hyperparameters included in each group of hyperparameters and the value of the hyperparameters in each group on the electronic device, so that the electronic device can obtain the above-mentioned multiple groups of hyperparameters.
  • the electronic device may also automatically generate multiple sets of hyperparameters of the neural network model by acquiring the value ranges of multiple hyperparameters set by the user.
  • the hyperparameters of the neural network model input by the user on the electronic device include a learning rate and an optimizer, and the learning rate has three different values, and the optimizer has two different values.
  • the electronic device can automatically generate different combinations between the learning rate and the optimizer based on the value ranges of the two hyperparameters of the learning rate and the optimizer, thereby obtaining six sets of hyperparameters.
  • Step 602 based on the multiple sets of hyperparameters, respectively perform multiple iterations on the model to obtain multiple sets of weights of the model during the training process, the multiple sets of weights are the same as the multiple sets of hyperparameters In one-to-one correspondence, each set of weights in the multiple sets of weights includes weights obtained through multiple iterations of training.
  • the electronic device may perform multiple sets of repeated training on the neural network model based on multiple sets of hyperparameters.
  • the training times and training data of the neural network model are the same.
  • the electronic device can set the hyperparameters of the neural network model during the training process based on a certain set of hyperparameters among multiple sets of hyperparameters; then, the electronic device performs a certain number of iterative training on the neural network model based on the training data, Thus a set of weights is obtained.
  • the electronic device sequentially sets the hyperparameters of the neural network model during the training process based on multiple sets of hyperparameters, and performs the same number of iterative training on the neural network model after setting the hyperparameters of the neural network model, thereby obtaining multiple sets of weights .
  • each set of weights in the multiple sets of weights is trained based on the same training data, and the multiple sets of weights correspond to the multiple sets of hyperparameters one by one.
  • each set of weights in the multiple sets of weights includes weights obtained by the neural network model after each iterative training in multiple iterative trainings.
  • the weight obtained by the neural network model after an iterative training is all the weight parameters in a model, which is a high-dimensional vector. All weight parameters in a neural network model refer to weight parameters of all neural units in the model.
  • a neural network model includes hundreds of thousands to several million neural units, so the weight obtained by the neural network model after an iterative training actually includes the weight parameters of hundreds of thousands to several million neural units, That is, the weight obtained after an iterative training is a high-dimensional vector with hundreds of thousands to millions of dimensions.
  • the electronic device may first acquire a training subset, where the training subset includes part of the training data in the training set of the neural network model.
  • the training set of the neural network model includes all the training data required in the formal training process of the model, and the training subset only includes part of the training data selected from the training set.
  • the training set may include 1.28 million training data
  • the training subset may include 50,000 training data.
  • the electronic device respectively uses the training subset to perform multiple iterative trainings on the neural network model, so as to obtain multiple sets of weights of the neural network model during the training process.
  • Each set of weights in the set of weights includes a set of weights of the neural network model after each iteration of training.
  • the electronic device uses the training subset to perform multiple iterations of training on the neural network model.
  • the weights in the neural network model will be updated. Therefore, a set of weights obtained by performing multiple iterations of training on the neural network model includes multiple weight sets, and each weight set includes the weights of the neural network model after each iteration of training.
  • the weight ⁇ i is obtained after performing an iterative training on the neural network model.
  • the weight ⁇ i ⁇ R m is all the weights in the neural network model
  • the weight ⁇ i is an m-dimensional high-dimensional vector.
  • ⁇ 1 represents the weight obtained after the first iteration training of the neural network model
  • ⁇ 2 represents the weight obtained after the second iteration training of the neural network model
  • ⁇ n represents the nth iteration of the neural network model Weights obtained after training.
  • Step 603 perform drawing of multiple topographic maps, wherein each topographic map in the multiple topographic maps is drawn based on one of the multiple sets of weights, and each topographic map in the multiple topographic maps is It is used to represent the change trend of the loss function of the model during the training process.
  • the topographic map refers to the topographic map of the loss function, which can map hundreds or thousands of loss function values into the weight space.
  • the electronic device may draw a topographic map corresponding to each set of weights, so as to obtain multiple topographic maps corresponding to the multiple sets of weights one-to-one.
  • the position of the loss function of the model is determined based on a set of weights corresponding to the topographic map. Since each set of weights includes weights corresponding to multiple iterations of training, and a value of a loss function can be obtained based on the weights obtained by each iteration of training, each set of weights has corresponding values of multiple loss functions.
  • the electronic device can draw 5 topographic maps, and each topographic map is drawn based on a set of weights.
  • each set of weights includes weights obtained by multiple iterations of training, and the weights obtained by each iteration of training actually include hundreds of thousands or even millions of weight parameters, that is, the weights obtained by iterative training are a high-dimensional vector. Therefore, in order to facilitate the display of the optimized terrain of the model, the electronic device can reduce the dimensionality of each set of weights during the process of drawing the terrain map. Then, the electronic device draws a topographical map based on the relationship between the dimensionality-reduced parameters and the loss function, so as to show the change trend of the loss function of the neural network model during the training process.
  • Step 604 obtain the target hyperparameters, the target hyperparameters are a set of hyperparameters corresponding to the target topographic map, the target hyperparameters are used to train the model, and the target topographic map is the plurality of topographic maps The topographic map with the highest smoothness, the flatness is used to represent the variation degree of the loss function of the model in the topographic map.
  • the electronic device can determine the smoothness of each topographic map. Specifically, the more tidy and sparse the isoline arrangement in the topographic map, the smaller the change of the loss function of the model in the topographic map, and the higher the flatness of the topographic map; the denser the isoline arrangement in the topographic map And messy, it means that the loss function of the model changes more in the topographic map, the steeper the topographic map is, and the lower the flatness of the topographic map is.
  • the generalization accuracy of the model is low. Among them, the generalization accuracy of the model refers to the ability of the model to fit new data.
  • FIG. 7 is a schematic comparison diagram of two topographic maps provided by the embodiment of the present application.
  • the topographic map on the left in FIG. 7 is a topographic map with a relatively high degree of smoothness
  • the topographic map on the right in FIG. 7 is a topographic map with a low degree of smoothness.
  • the electronic device can select a set of hyperparameters corresponding to the target topographic map with the highest flatness as the target hyperparameters, so as to execute the neural network model based on the target hyperparameters of formal training.
  • the topographic map can represent the change trend of the loss function of the model during the training process.
  • the final hyperparameters used for model training are selected. Since this solution selects the weights of the model during part of the training process to draw the topographic map, it does not need to train the model to convergence, so it can save time in selecting model hyperparameters and improve the selection efficiency of model hyperparameters.
  • the electronic device may determine the flatness corresponding to each topographic map by calculating the length of the contour line in the topographic map.
  • the number of contour lines in the topographic map can be set first.
  • the contour line refers to the smooth curve formed by connecting points with equal values of a certain quantity index of the mapping object.
  • the value of the loss function that is, the loss value
  • set the number of contour lines of the topographic map to 100.
  • the electronic device can draw corresponding contour lines in each topographic map based on the set quantity.
  • each topographic map includes the same number of contour lines.
  • the electronic device determines a sum of contour lengths for each topographic map in the plurality of topographic maps.
  • the sum of straight line lengths in the topographic map refers to the sum of the lengths of all contour lines in the topographic map.
  • the electronic device may determine the sum of the lengths of contour lines in the topographic map based on Formula 1 below.
  • L represents the sum of the contour lengths in the topographic map
  • m represents the number of contour lines in the topographic map
  • k represents that in each contour line, k nodes are collected at equal intervals
  • P i represents a contour line A point in
  • P i-1 represents a point in front of P i in the contour line.
  • the smoothness of each topographic map is determined based on the sum of the area of each topographic map in the plurality of topographic maps and the length of the contour line. Since the area of each topographic map may be different, the electronic device can obtain the length of the contour line per unit area in the topographic map by dividing the sum of the lengths of the contour lines in the topographic map by the sum of the areas of the topographic map. In this way, the electronic device can finally obtain the contour length per unit area of each topographic map in the multiple topographic maps, so as to determine the flatness of the topographic map. Specifically, the flatness of the topographic map is negatively correlated with the length of the contour line per unit area of the topographic map. The greater the length of the contour line per unit area in the topographic map, the smaller the flatness of the topographic map; the smaller the length of the contour line per unit area in the topographic map, the greater the flatness of the topographic map.
  • the flatness of the topographic map may be inversely proportional to the length of the contour line per unit area of the topographic map. After the contour length per unit area of the topographic map is determined, the flatness of the topographic map can be determined by calculating the reciprocal of the contour length per unit area of the topographic map.
  • the electronic device first performs dimension reduction processing on the first group of weights to obtain two high-dimensional vectors serving as projection directions in a two-dimensional space, so as to project the weights serving as high-dimensional vectors into the two-dimensional space.
  • the first set of weights is a set of weights in the multiple sets of weights, and the first set of weights includes the weights of the model after each iteration of training.
  • the first set of weights can be expressed as ( ⁇ 1 , ⁇ 2 , . . . , ⁇ n ).
  • ⁇ 1 represents the weight obtained after the first iteration training of the neural network model
  • ⁇ 2 represents the weight obtained after the second iteration training of the neural network model
  • ⁇ n represents the weight obtained after the nth iteration training of the neural network model get the weight.
  • the weight ⁇ i ⁇ R m is all the weights in the neural network model
  • the weight ⁇ i is an m-dimensional high-dimensional vector
  • m represents the number of all neural units in the neural network model.
  • the two high-dimensional vectors obtained after dimensionality reduction can be expressed as and That is, the vector and vector Both are m-dimensional high-dimensional vectors.
  • dimensionality reduction refers to reducing the data from the original high-dimensional space to a low-dimensional space, while maintaining some meaningful features of high-dimensional original data in the representation of the low-dimensional space.
  • the method for the electronic device to perform dimensionality reduction processing on the first group of weights may be a common high-dimensional dimensionality reduction method, such as a principal component analysis (Principal component analysis, PCA) method.
  • PCA Principal component analysis
  • Using the PCA method as a dimensionality reduction method can effectively retain most of the variance in the weight data, that is, retain the characteristics of the weight data itself as much as possible.
  • the electronic device determines the coordinates of the weights in the first set of weights after each iterative training in two-dimensional space. Specifically, assuming that the weight after a certain iterative training in the first group of weights is located at the center point (0,0) of the two-dimensional space, it can be determined based on establishing a system of equations to determine the rest of the iterative training in the first group of weights The coordinates of the final weights in two-dimensional space.
  • the first group of weights can be expressed as ( ⁇ 1 , ⁇ 2 , ..., ⁇ n ), and ⁇ n is located at the center point (0,0), then it can be based on Determine the coordinates in two-dimensional space of the weights trained for each iteration in the first set of weights. Specifically, a system of equations as shown below can be established.
  • ( ⁇ i , ⁇ i ) in the above equations are unknown coefficients, and ⁇ i and ⁇ i are used to represent ( ⁇ 1 , ⁇ 2 ,..., ⁇ n-1 ) in two-dimensional space. Abscissa and ordinate. Since ( ⁇ 1 , ⁇ 2 ,..., ⁇ n ) and is known, so the ( ⁇ i , ⁇ i ) in the equation system can be calculated based on the least square method, so as to determine the weights ( ⁇ 1 , ⁇ 2 , ... , ⁇ n ) coordinates in two-dimensional space.
  • the electronic device may further determine the boundary of the first topographic map based on the coordinates of the weights in the two-dimensional space after training for each iteration.
  • the electronic device may determine the maximum and minimum values of the multiple coordinates on the x-axis and the y-axis respectively based on the obtained multiple coordinates (that is, the coordinates of the weights in the two-dimensional space after each iteration training), Thus, the boundary of the first topographic map is determined.
  • the boundary of the first topographic map can be determined based on coordinate points (0.5, 0.15), (0.5, -0.2), (-0.7, 0.15), (-0.7, -0.2).
  • the boundary of the first topographic map may be appropriately extended on the basis of the above four coordinate points, so as to prevent the coordinates corresponding to the first group of weights from being too close to the boundary of the first topographic map.
  • the electronic device After the electronic device determines the boundary of the first topographic map, divide the first topographic map into a plurality of regions according to the boundary of the first topographic map, so as to obtain the sampling points of each region in the plurality of regions.
  • the electronic device may divide the first topographic map into regions at equal intervals, so as to divide the first topographic map into a plurality of regions with equal areas. In this way, the electronic device uses each area in the first topographic map as a sampling unit, and determines sampling points in each area. For example, when the electronic device divides the first topographic map into a plurality of square areas, the electronic device respectively determines center points or boundary points of the plurality of square areas as sampling points.
  • the electronic device may implement the area division in the first topographic map based on the following formula 2.
  • x represents the abscissa of the sampling points in the divided area
  • y represents the ordinate of the sampling points in the divided area
  • the value of i is (0,1,...N)
  • N represents the first topographic map
  • max( ⁇ ) represents the maximum value of the x-axis in the coordinates of the first group of weights on the two-dimensional space
  • min( ⁇ ) represents the coordinates of the first group of weights on the two-dimensional space
  • min( ⁇ ) represents the value of the y-axis in the coordinates of the first group of weights in the two-dimensional space min.
  • the electronic device may determine a loss value corresponding to the model based on the weight corresponding to each sampling point in the first topographic map, so as to draw the first topographic map.
  • the plurality of topographic maps include the first topographic map, and the first topographic map is a topographic map drawn based on a first set of weights among multiple sets of weights.
  • the weight corresponding to each sampling point can be determined based on formulas similar to the above equations. Then, the weight of each sampling point is passed into the model, and the forward calculation is performed on the model to obtain the corresponding loss value of the model, thereby obtaining the three-dimensional coordinates corresponding to each sampling point. Based on the three-dimensional coordinates corresponding to each sampling point, the electronic device can realize the drawing of the topographic map.
  • the weight corresponding to the sampling point can be passed into the model, and the forward inference calculation can be performed on the model to obtain the loss value Z, which constitutes a three-dimensional coordinate in the topographic map (x, y, z).
  • the loss value Z of the sampling point can be expressed based on the following formula 3.
  • Z represents the loss value of the sampling point
  • L() represents the forward inference calculation of the model based on the weight corresponding to the sampling point
  • the electronic device may use twin parameter parallelism to Calculate the loss value corresponding to each sampling point to improve the efficiency of topographic map drawing.
  • the electronic device may be based on the same model, and different weight parameters are set for the neurons in the model, so as to obtain multiple sub-models with the same structure but different weights. Then, based on the same input data, the electronic device executes the inference operation of the model in parallel on the same hardware (such as the same GPU), so as to obtain the loss values of multiple sub-models respectively.
  • the loss values of the multiple sub-models respectively correspond to the loss values of multiple sampling points.
  • the electronic device constructs multiple sub-models based on the model and the weights corresponding to the multiple sampling points in the first topographic map, and the multiple sub-models and the weights corresponding to the multiple sampling points are one by one
  • the structures of the multiple sub-models are the same as the structure of the model.
  • the electronic device respectively inputs the same training data into the multiple sub-models to obtain loss values corresponding to the multiple sampling points.
  • the training data for inputting into the multiple sub-models may be part of the data in the training set.
  • the structures of the multiple sub-models are the same, but the weights of the neurons in the multiple sub-models respectively correspond to the weights corresponding to the multiple sampling points.
  • the electronic device calculates the weight 1 corresponding to the sampling point 1 and the weight 2 corresponding to the sampling point.
  • the electronic device can construct sub-model 1 and sub-model 2 based on the weight 1 and weight 2 and the neural network model; wherein, the structure of the sub-model 1 and the sub-model 2 is the same, and the weight of the neural unit in the sub-model 1 is the weight corresponding to the sampling point 1 1.
  • the weight of the neural unit in the sub-model 2 is the weight 2 corresponding to the sampling point 2.
  • the electronic device can perform inference operations on the multiple sub-models simultaneously and in parallel on the same hardware based on the same input data, so as to obtain loss values corresponding to multiple sampling points.
  • inference operations can be performed on multiple sub-models in parallel, so that the loss values corresponding to multiple sampling points can be obtained at the same time, and the drawing efficiency of topographic maps is improved.
  • FIG. 8 is a schematic diagram of a twin parameter parallel method provided by an embodiment of the present application.
  • the electronic device can construct a twin model based on two sets of weight parameters (weight parameter 1 and weight parameter 2 respectively) and a neural network model.
  • the twin model includes sub-model 1 and sub-model 2 with the same structure.
  • the twin model Before performing inference operations, the twin model needs to read the weight parameters of the neurons in the twin model, so as to perform subsequent inference operations.
  • the neural network model identifies and reads weight parameters through unique variable names. Therefore, in this embodiment, the electronic device may correspondingly modify the name of the weight parameter based on the name of the sub-model in the twin model, so as to ensure that each sub-model can read and recognize the corresponding weight parameter. For example, assuming that the names of sub-model 1 and sub-model 2 are Net1 and Net2 respectively, the name of weight parameter 1 corresponding to sub-model 1 is changed to Net1.param, and the name of weight parameter 2 corresponding to sub-model 2 is changed to Net2 .param.
  • the twin model reads the weight parameter after the name has been modified, so that the sub-model 1 in the twin model is configured with the corresponding weight parameter 1, and the sub-model 2 is configured with the corresponding weight parameter 2. Then, a batch of training data is input to the twin model, and based on the same batch of training data, inference operations are performed on sub-model 1 and sub-model 2 in the twin model, and multiple sub-models in the twin model are stitched together using the concat operator The loss function to get the output of the twin model.
  • the loss function of each sub-model in the twin model is disassembled by the split operator, and the loss function corresponding to each sub-model is generated, thereby obtaining the loss value corresponding to multiple sets of weights, that is, obtaining the loss value corresponding to multiple sampling points loss value.
  • FIG. 9 is a schematic flowchart of a topographic map drawing provided by an embodiment of the present application
  • FIG. 10 is a schematic flowchart of a topographic map drawn based on sampling points provided by an embodiment of this application.
  • an electronic device obtains any set of weights ( ⁇ 1 , ⁇ 2 , ..., ⁇ n )
  • it can perform dimensionality reduction processing on the set of weights to obtain two high-dimensional vectors
  • the electronic device determines the coordinates in the two-dimensional space of the weights in the set of weights after each iteration training.
  • the electronic device takes the boundary of the topographic map and generates a grid of medium area in the topographic map. For each grid in the topographic map, the electronic device samples each grid and determines the weight corresponding to each sampling point. Based on the weight corresponding to each sampling point, the electronic device imports the weight into the model and performs forward calculation to obtain the loss value corresponding to each sampling point, and then determine the three-dimensional coordinates of each sampling point in the topographic map. Finally, based on the three-dimensional coordinates of each sampling point in the topographic map, the topographic map is drawn.
  • the electronic device can identify the rough areas in the topographic map, and increase the sampling points in the rough areas.
  • the electronic device may determine the ruggedness of each region in the first topographic map.
  • the electronic device may respectively determine the second-order derivative matrix H of the sampling points of each region in the first topographic map based on a numerical method. Then, the electronic device calculates the two eigenvalues of the second-order derivative matrix H, and determines the sum of the absolute values of the two eigenvalues, so as to obtain the ruggedness of each region in the first topographic map.
  • the electronic device After determining the roughness of each area in the first topographic map, the electronic device adds sampling points in the first topographic map according to the roughness of each area, so as to update the first topographic map.
  • the density of sampling points in the area in the first topographic map has a positive correlation with the ruggedness of the area.
  • the principle of increasing sampling points for electronic equipment can be: the higher the roughness of the area, the more sampling points will be added; the lower the roughness of the area, the fewer sampling points will be added, or even no sampling points will be added. .
  • the electronic device may sort the multiple areas in the first topographic map in descending order of ruggedness, so as to obtain a sorting result of the multiple areas.
  • the region with a higher degree of ruggedness is ranked higher; the region with a lower degree of ruggedness is ranked lower.
  • the electronic device Based on the sorting results of the multiple areas, the electronic device sequentially increases sampling points in the multiple areas until the number of added sampling points reaches a preset threshold.
  • the electronic device can divide the first topographic map into multiple regions based on 20*20, and obtain 400 regions in total, and obtain the initial 400 sampling points. Then, the electronic device calculates the roughness corresponding to the 400 regions based on the above-mentioned roughness calculation method, and sorts the 400 regions based on the roughness of the regions. Since the total number of sampling points in the first topographic map is 1600, and 400 sampling points have been sampled, there are still 900 sampling points that have not been collected.
  • the electronic device can re-divide the more rugged area into four smaller areas based on the sorting results of multiple areas, and increase sampling points in the re-divided area until the increased sampling points reach 900 indivual. Finally, for the remaining areas in the first topographic map without additional sampling points, these areas can also be divided into four smaller areas, and the loss values of the grid points in these areas are determined by linear interpolation.
  • FIG. 11 is a schematic diagram of increasing sampling points based on the roughness of the region provided by the embodiment of the present application. As shown in Figure 11, for the more rugged area on the left side of the topographic map, sampling points are added to ensure that the topographic map can accurately describe the change trend of the loss value in the topographic map.
  • the electronic device may acquire or establish a mapping relationship between the degree of roughness and the number of sampling points in advance.
  • the electronic device can determine the sampling points that need to be increased for each area based on the mapping relationship between the roughness and the number of sampling points, so as to increase the sampling points.
  • the mapping relationship between the ruggedness and the number of sampling points the higher the roughness of the region, the larger the number of corresponding sampling points; the lower the roughness of the region, the smaller the corresponding number of sampling points.
  • mapping relationship between the degree of ruggedness and the number of sampling points can be comprehensively determined, and the specific relationship between the degree of ruggedness and the number of sampling points is not limited here. mapping relationship.
  • the topographic map is first sampled at a relatively sparse equidistant distance, and then based on the roughness of each sampling area, sampling points are added in areas with higher roughness, which can achieve approximately full sampling based on fewer sampling points
  • the terrain map visual effect for . While ensuring that the topographic map can accurately describe the change trend of the loss value in the topographic map, the drawing efficiency of the topographic map is improved.
  • FIG. 12 is a schematic diagram of the architecture of a training process visual analysis system provided by an embodiment of the present application.
  • the visual analysis system for the training process can read the hyperparameter configuration file required by the neural network model from the server or host directory, and configure the hyperparameters of the neural network model based on the hyperparameter configuration file.
  • the computing acceleration module is used to draw topographic maps, and the topographic maps corresponding to each set of hyperparameters are obtained.
  • the topographic map is analyzed based on the quantitative analysis module to determine the hyperparameters that need to be selected.
  • the drawn topographic map and the determined hyperparameters can be returned to the front-end service based on the interaction request of the front-end service, so as to render and display the topographic map and hyperparameters on the front-end.
  • FIG. 13 is a schematic workflow diagram of a training process visual analysis system provided by an embodiment of the present application. As shown in FIG. 13 , the workflow of the training process visual analysis system includes the following steps 1 to 5.
  • Step 1 select different values of hyperparameters (such as hyperparameter 1 and hyperparameter 2) and appropriate data subsets, conduct small-scale training on the neural network model, and obtain the weight of the neural network model after each iteration of training. For example, based on a subset of data, the neural network model is trained for 5 epochs. Among them, 1 epoch means that the model has been completely trained once using all the data in the training set.
  • hyperparameters such as hyperparameter 1 and hyperparameter 2
  • appropriate data subsets For example, based on a subset of data, the neural network model is trained for 5 epochs. Among them, 1 epoch means that the model has been completely trained once using all the data in the training set.
  • Step 2 based on the selected dimensionality reduction method (such as PCA method), determine the dimensionality reduction direction and sampling range of the weights belonging to the high-dimensional parameter space.
  • the selected dimensionality reduction method such as PCA method
  • Step 3 combined with the calculation acceleration module to perform terrain map rendering calculation.
  • FIG. 14 is a schematic workflow diagram of a calculation acceleration module provided by an embodiment of the present application.
  • the calculation acceleration module adopts the adaptive sampling granularity method, which can first perform relatively sparse equidistant sampling on the terrain map, and then increase sampling points in areas with higher ruggedness based on the roughness of each sampling area. It can realize the visual effect of topographic map based on less sampling points to approximate full sampling.
  • the computing acceleration module uses multiple computing cards (chips) to perform parallel sampling to improve computing efficiency.
  • the calculation acceleration module also uses the above-mentioned twin parameter parallel method to train the model to improve the training efficiency of the model.
  • Step 4 After drawing the topographic map, analyze the topographic map to select a better hyperparameter configuration.
  • Step 5 Perform large-scale training on the obtained hyperparameters on the full data set to obtain the final model.
  • the Imagenet data set includes 1.28 million data
  • the data subset may include 50,000 data, for example.
  • three initial learning rates of 0.1, 0.4 and 0.8 are selected, and 5 epochs are trained on the data subset.
  • FIG. 15 is a comparative schematic diagram of topographic maps with different learning rates provided by the embodiment of the present application.
  • the model is trained for a long time on the full data set (that is, 1.28 million data) to obtain the final model.
  • the learning rate 0.8 selected by the model hyperparameter selection method provided by the embodiment of the present application has the lowest flatness index.
  • the final model trained based on the learning rate of 0.8 also achieved the highest test accuracy. That is to say, the hyperparameters selected based on the model hyperparameter selection method provided by the embodiment of the present application can guarantee the accuracy of the trained model.
  • the model hyperparameter selection method provided by the embodiment of the present application can increase the hyperparameter selection efficiency by about 3 times.
  • Experimental scenario The user needs to determine an initial learning rate of the model so that the model can have higher generalization accuracy. Assume that the number of full data sets used to train the model is N, and the time consumed to train a model based on the full data set is T.
  • the solution of the embodiment of this application the user randomly and uniformly selects a small data subset from the full data set, and performs short-term training of the model based on the data subset. After a short period of training, draw the terrain map corresponding to each learning rate, so as to determine the final learning rate according to the terrain map. Assuming that the time consumed to train the model for a short time using this data subset is T s , and the time consumed to draw a topographic map is T m , then the total time consumed to draw a topographic map based on P initial learning rates is P( T s +T m ).
  • the ResNet50 network is trained based on the Image dataset.
  • a subset of data is uniformly selected in the Image dataset.
  • the cifar10 data set in the Image data set includes 50,000 data, so the cifar10 data set can be regarded as a selected data subset.
  • Table 2 The information for performing training based on the cifar10 dataset is shown in Table 2 below.
  • Epoch represents the number of iterations of training
  • 1 epoch represents a complete training of the model using all the data in the training set.
  • Batch means to use a part of the samples in the training set to perform a backpropagation parameter update on the model weight. This small part of the samples is called "a batch of data", that is, batch.
  • Batch_size is the size of the batch data. Steps per epoch indicates the number of steps included in an epoch, and each step is sent to batch_size data for training.
  • the method for selecting model hyperparameters provided by the embodiment of the present application can improve the hyperparameter selection efficiency by about 3 times.
  • the above is an example of hyperparameters with learning rate to be selected, and introduces the process of selecting the corresponding learning rate based on the flatness of the terrain map.
  • the following will take the Batch Normalization layer as the hyperparameter of the model as an example to introduce the process of choosing whether to add the Batch Normalization layer based on the flatness of the terrain map.
  • the applicant's research found that adding a Batch Normalization layer to the neural network model can make the optimization space for neural network model training smoother, which is more conducive to gradient descent and makes the model converge faster and more smoothly. Conversely, if the Batch Normalization layer is not added to the neural network model, the convergence of the neural network model will be slower and the convergence process will be more tortuous. Specifically, in the VGG type neural network model, if the Batch Normalization layer is not added, there will be obvious bumps and rough shapes around the training track of the neural network model.
  • the electronic device can identify the training path of the loss function in the topographic map.
  • the electronic device can generate corresponding prompt information to instruct the user to select a specific hyperparameter, that is, instruct the user to select a specific hyperparameter. Select hyperparameters in the neural network model -- Batch Normalization layer.
  • FIG. 16 is a schematic diagram of a topographic map-based guidance model structure provided by an embodiment of the present application.
  • the electronic device obtains the topographic map, if it recognizes that the terrain of the training path in the topographic map is gentle and has no obvious bulge, it may not generate a prompt message, and there is no need to recommend the user to select hyperparameters for the model --Batch Normalization layer; if the electronic device recognizes that there is an obvious raised area with a large loss value around the training path in the topographic map, the electronic device can generate a prompt message to prompt the user to check whether the model has added the Batch Normalization layer, that is, prompt the user to be the model Select hyperparameters -- Batch Normalization layer.
  • the shape of the training path in the topographic map can be used to represent the flatness of the topographic map, and the specific shape of the training path in the topographic map can be considered as a specific flatness.
  • each point on the training path has a corresponding position coordinate point in the two-dimensional space, and each point on the training path also has a corresponding loss value, so based on the position coordinates of each point in the two-dimensional space Points and loss values, the training path can be drawn in 3D space.
  • the smaller the change in the loss function of the higher the smoothness of the topographic map.
  • the electronic device when the electronic device recognizes that the training path in the topographic map has a specific shape, it may consider that the topographic map has a specific flatness, thereby generating prompt information to instruct the user to select a specific hyperparameter.
  • VGG Visual Geometry Group
  • the purpose of the experiment is: how to speed up the convergence speed of model training.
  • Fig. 17 is a topographic map of the VGG model provided by the embodiment of the present application without adding the Batch Normalization layer.
  • FIG. 18 is a topographic map of the VGG model with the Batch Normalization layer added according to the embodiment of the present application. As shown in Figure 18, after adding the Batch Normalization layer, the shape of the training trajectory in the topographic map corresponding to the VGG model has been significantly improved.
  • FIG. 19 is a schematic diagram of the effect comparison before and after adding the Batch Normalization layer to the VGG model provided by the embodiment of the present application. It can be seen from Figure 19 that after the Batch Normalization layer is added to the VGG model, the convergence of the VGG model becomes faster and more stable during the training process.
  • an electronic device provided in an embodiment of the present application includes: an acquisition unit 2001 and a processing unit 2002 .
  • the acquiring unit 2001 is configured to acquire multiple sets of hyperparameters of the neural network model;
  • the processing unit 2002 is configured to perform multiple iterative training on the model based on the multiple sets of hyperparameters to obtain the model Multiple sets of weights in the training process, the multiple sets of weights are in one-to-one correspondence with the multiple sets of hyperparameters, and each set of weights in the multiple sets of weights includes weights obtained by multiple iterative training;
  • the processing unit 2002 is also used to perform the rendering of a plurality of topographic maps, wherein each topographic map in the plurality of topographic maps is drawn based on a set of the plurality of sets of weights, and each topographic map in the plurality of topographic maps Both are used to represent the change trend of the loss function of the model during
  • each topographic map is related to the sum of the area and contour length of each topographic map in the multiple topographic maps, and each topographic map in the multiple topographic maps
  • Each topographic map includes the same number of contour lines, and the points on the contour lines correspond to the same loss value.
  • the acquiring unit 2001 is further configured to acquire a training subset, and the training subset includes part of the training data in the training set of the model; the processing unit 2002 is also configured to adopt The training subset is to perform multiple iterative training on the model based on the multiple sets of hyperparameters, so as to obtain multiple sets of weights of the model during the training process, and each set of weights in the multiple sets of weights includes the The set of weights of the above model after each iteration of training.
  • the processing unit 2002 is further configured to perform dimension reduction processing on the first set of weights to obtain two high-dimensional vectors as projection directions in a two-dimensional space, and the first set of weights is A set of weights in the multiple sets of weights, the first set of weights includes a weight set of the model after each iteration of training; the processing unit 2002 is further configured to based on the first set of weights and the Two high-dimensional vectors, determining weights corresponding to multiple sampling points in the first topographic map, the multiple topographic maps including the first topographic map; the processing unit 2002 is further configured to The weights corresponding to the sampling points are used to determine the loss value corresponding to the model to draw the first topographic map.
  • the processing unit 2002 is further configured to construct multiple sub-models based on the model and weights corresponding to multiple sampling points in the first topographic map, and the multiple sub-models There is a one-to-one correspondence with the weights corresponding to the multiple sampling points, and the structure of the multiple sub-models is the same as the structure of the model; the processing unit 2002 is also configured to input the same training data to the multiple sub-models respectively. sub-models to obtain the loss values corresponding to the multiple sampling points.
  • the processing unit 2002 is further configured to determine the roughness of each region in the first topographic map, and the roughness is used to represent the concentration of contour lines in each region degree, the loss values corresponding to the points on the contour line are the same; the processing unit 2002 is further configured to add sampling points in the first terrain map according to the degree of roughness, so as to update the first terrain Figure; wherein, the density of sampling points in the region in the first topographic map has a positive correlation with the roughness of the region.
  • the processing unit 2002 is further configured to sort the multiple areas in the first topographic map in descending order of ruggedness, so as to obtain the multiple areas The sorting results of the multiple areas; the processing unit 2002 is further configured to sequentially add sampling points in the multiple areas based on the sorting results of the multiple areas until the number of added sampling points reaches a preset threshold.
  • the processing unit 2002 is further configured to respectively determine the second-order derivative matrix of the sampling points in each region in the first topographic map; the processing unit 2002 is also configured to calculate the Two eigenvalues of the second-order derivative matrix, and determine the sum of the absolute values of the two eigenvalues, so as to obtain the ruggedness of each region in the first topographic map.
  • FIG. 21 is a schematic structural diagram of the execution device provided by the embodiment of the present application. Smart wearable devices, servers, etc. are not limited here.
  • the data processing apparatus described in the embodiment corresponding to FIG. 21 may be deployed on the execution device 2100 to realize the data processing function in the embodiment corresponding to FIG. 21 .
  • the execution device 2100 includes: a receiver 2101, a transmitter 2102, a processor 2103, and a memory 2104 (the number of processors 2103 in the execution device 2100 may be one or more, and one processor is taken as an example in FIG.
  • the processor 2103 may include an application processor 21031 and a communication processor 21032 .
  • the receiver 2101, the transmitter 2102, the processor 2103, and the memory 2104 may be connected through a bus or in other ways.
  • the memory 2104 may include read-only memory and random-access memory, and provides instructions and data to the processor 2103 .
  • a part of the memory 2104 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 2104 stores processors and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 2103 controls the operations of the execution device.
  • various components of the execution device are coupled together through a bus system, where the bus system may include not only a data bus, but also a power bus, a control bus, and a status signal bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 2103 or implemented by the processor 2103 .
  • the processor 2103 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 2103 or instructions in the form of software.
  • the above-mentioned processor 2103 can be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and can further include an application-specific integrated circuit (application specific integrated circuit, ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable Field-programmable gate array
  • the processor 2103 may implement or execute various methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 2104, and the processor 2103 reads the information in the memory 2104, and completes the steps of the above method in combination with its hardware.
  • the receiver 2101 can be used to receive input digital or character information, and generate signal input related to performing device related settings and function control.
  • the transmitter 2102 can be used to output digital or character information through the first interface; the transmitter 2102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 2102 can also include a display device such as a display screen .
  • the processor 2103 is configured to execute the method for selecting model hyperparameters executed by the execution device in the embodiment corresponding to FIG. 6 .
  • the execution device, training device or electronic device provided in the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits etc.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chip in the execution device executes the method for selecting model hyperparameters described in the above embodiments, or makes the chip in the training device execute the model hyperparameters described in the above embodiments. Parameter selection method.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 22 is a schematic structural diagram of a chip provided by the embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 2200, and the NPU 2200 is mounted to the main CPU (Host CPU) as a coprocessor. CPU), the tasks are assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 2203, and the controller 2204 controls the operation circuit 2203 to extract matrix data in the memory and perform multiplication operations.
  • the operation circuit 2203 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 2203 is a two-dimensional systolic array.
  • the arithmetic circuit 2203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 2203 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 2202, and caches it in each PE in the operation circuit.
  • the operation circuit takes the data of matrix A from the input memory 2201 and performs matrix operation with matrix B, and the obtained partial or final results of the matrix are stored in the accumulator (accumulator) 2208 .
  • the unified memory 2206 is used to store input data and output data.
  • the weight data directly accesses the controller (Direct Memory Access Controller, DMAC) 2205 through the storage unit, and the DMAC is transferred to the weight storage 2202.
  • the input data is also transferred to the unified memory 2206 through the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 2213, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 2209.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 2213 (Bus Interface Unit, BIU for short), is used for the instruction fetch memory 2209 to obtain instructions from the external memory, and is also used for the storage unit access controller 2205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to move the input data in the external memory DDR to the unified memory 2206 , move the weight data to the weight memory 2202 , or move the input data to the input memory 2201 .
  • the vector computing unit 2207 includes a plurality of computing processing units, and if necessary, further processes the output of the computing circuit 2203, such as vector multiplication, vector addition, exponent operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization (batch normalization), pixel-level summation, and upsampling of feature planes.
  • the vector computation unit 2207 can store the vector of the processed output to unified memory 2206 .
  • the vector calculation unit 2207 can apply a linear function; or, a nonlinear function to the output of the operation circuit 2203, such as performing linear interpolation on the feature plane extracted by the convolution layer, and then for example, a vector of accumulated values to generate an activation value.
  • the vector computation unit 2207 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as an activation input to operational circuitry 2203, eg, for use in subsequent layers in a neural network.
  • An instruction fetch buffer (instruction fetch buffer) 2209 connected to the controller 2204 is used to store instructions used by the controller 2204;
  • the unified memory 2206, the input memory 2201, the weight memory 2202 and the fetch memory 2209 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned above can be a general-purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
  • FIG. 23 is a schematic structural diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the present application also provides a computer-readable storage medium.
  • the method disclosed in FIG. 6 above can be implemented as being encoded on a computer-readable storage medium or in other non-transitory computer program instructions on permanent media or articles of manufacture.
  • Figure 23 schematically illustrates a conceptual partial view of an example computer-readable storage medium comprising a computer program for executing a computer process on a computing device arranged in accordance with at least some embodiments presented herein.
  • computer readable storage media 2300 is provided using signal bearing media 2301 .
  • the signal-bearing medium 2301 may include one or more program instructions 2302 which, when executed by one or more processors, may provide the functions or part of the functions described above with respect to FIG. 6 .
  • program instructions 2302 in FIG. 23 also describe example instructions.
  • signal bearing media 2301 may include computer readable media 2303 such as, but not limited to, a hard drive, compact disc (CD), digital video disc (DVD), digital tape, memory, ROM or RAM, and the like.
  • computer readable media 2303 such as, but not limited to, a hard drive, compact disc (CD), digital video disc (DVD), digital tape, memory, ROM or RAM, and the like.
  • signal bearing media 2301 may comprise computer recordable media 2304 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like.
  • signal bearing media 2301 may include communication media 2305, such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.).
  • signal bearing medium 2301 may be conveyed by a wireless form of communication medium 2305 (eg, a wireless communication medium that complies with the IEEE 802.23 standard or other transmission protocol).
  • One or more program instructions 2302 may be, for example, computer-executable instructions or logic-implemented instructions.
  • the computing device may be configured to respond to program instructions 2302 communicated to the computing device via one or more of computer-readable media 2303 , computer-recordable media 2304 , and/or communication media 2305 , providing various operations, functions, or actions.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Geometry (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Remote Sensing (AREA)
  • Computer Graphics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种模型超参数的选择方法,应用于人工智能技术领域。该方法包括:获取神经网络模型的多组超参数;基于多组超参数,分别对模型进行多轮迭代训练,以得到模型在训练过程中的多组权重,多组权重与多组超参数一一对应,多组权重中的每组权重包括多次迭代训练得到的权重;执行多个地形图的绘制,其中,多个地形图中的每个地形图基于多组权重中的一组绘制;得到目标超参数,目标超参数为目标地形图所对应的一组超参数,目标地形图为多个地形图中平整程度最高的地形图。本方案中,通过选择模型在部分训练过程中的权重来绘制地形图,无需将模型训练至收敛,能够节省选择模型超参数的时间,提高模型超参数的选择效率。

Description

一种模型超参数的选择方法及相关装置
本申请要求于2021年6月28日提交中国专利局、申请号为202110722986.3、发明名称为“一种模型超参数的选择方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种模型超参数的选择方法及相关装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
深度学习方法,是近年来人工智能领域发展的一个关键推动力,目前广泛应用于复杂数据的特征提取及推理预测。例如,深度学习方法应用于计算机视觉任务,能够实现图像增强;又例如,深度学习方法应用于数据分类任务,能够对文本或语音等数据进行分类。
深度学习方法通常是通过训练得到的神经网络模型对数据进行处理。在神经网络模型的训练过程中,用户为模型所设置的超参数(例如学习率或优化器)能够影响模型的泛化精度。因此,为了尽可能地提高模型的泛化精度,人们通常需要在模型训练前选择合理的超参数。
目前,超参数的选择方法主要是在模型的训练集中挑选部分训练数据,并且基于部分训练数据对采用了不同超参数的模型进行训练,直至训练收敛;然后,再通过对比训练得到的模型来选择相应的超参数。由于这种超参数的选择方法需要将模型训练至收敛,往往需要进行的长时间训练,超参数的选择效率较低。
发明内容
本申请提供了一种模型超参数的选择方法,能够提高模型超参数的选择效率。
本申请第一方面提供一种模型超参数的选择方法,应用于人工智能技术领域。该方法包括以下的步骤:电子设备获取神经网络模型的多组超参数。其中,超参数是一种用于控制模型训练过程的参数。超参数是在开始模型训练过程之前所设置的参数,而不是通过训练得到的参数数据。在电子设备所获取的多组超参数中,每组超参数均包括一个或多个超参数,且每组超参数所包括的超参数的类型是相同的。对于任意两组超参数,这两组超参数中至少有一个超参数的取值是不相同的。例如,电子设备获取到的多组超参数中均包括两个超参数,这两个超参数分别为学习率和优化器。
然后,电子设备基于所述多组超参数,分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重与所述多组超参数一一对应,所述多组权重中的每组权重包括多次迭代训练得到的权重。具体地,电子设备可以基于多组超参数,对神经网络模型进行多组重复的训练。在每一组对神经网络模型的训练中,神经网络模型 的训练次数和训练数据均相同。
其次,电子设备执行多个地形图的绘制,其中,所述多个地形图中的每个地形图基于所述多组权重中的一组绘制,所述多个地形图中的每个地形图均用于表示所述模型的损失函数在训练过程中的变化趋势。在每个地形图中,所述模型的损失函数的位置是基于该地形图对应的一组权重确定的。由于每组权重包括多次迭代训练所对应的权重,且基于每一次迭代训练所得到的权重可以获取到一个损失函数的值,因此每组权重均具有对应的多个损失函数的值。具体地,基于所述多组权重,电子设备可以绘制每一组权重所对应的地形图,从而得到与所述多组权重一一对应的多个地形图。例如,在电子设备获得5组权重的情况下,电子设备可以绘制得到5个地形图,每一个地形图都是基于一组权重绘制得到的。
最后,电子设备得到目标超参数。所述目标超参数为目标地形图所对应的一组超参数,所述目标超参数用于训练所述模型,所述目标地形图为所述多个地形图中平整程度最高的地形图,所述平整程度用于表示所述模型的损失函数在地形图中的变化程度。所述模型的损失函数在地形图中的变化越大,则该地形图的平整程度越小;所述模型的损失函数在地形图中的变化越小,则该地形图的平整程度越高。
本方案中,通过对采用不同超参数的模型进行训练,并且基于模型在训练过程中的权重绘制地形图,该地形图能够表示模型的损失函数在训练过程中的变化趋势。通过对比不同超参数所对应的地形图的平整程度,来选择最终用于模型训练的超参数。由于本方案是选择模型在部分训练过程中的权重来绘制地形图,无需将模型训练至收敛,因此能够节省选择模型超参数的时间,提高模型超参数的选择效率。
在一种可能的实现方式中,所述每个地形图的平整程度与所述多个地形图中每个地形图的面积和所述等值线长度之和有关,所述多个地形图中的每个地形图均包括相同数量的等值线,所述等值线上的点对应的损失值相同。
具体地,电子设备可以确定所述多个地形图中每个地形图的等值线长度之和,所述多个地形图中的每个地形图均包括相同数量的等值线,所述等值线上的点对应的损失值相同。基于所述多个地形图中每个地形图的面积和所述等值线长度之和,电子设备确定所述每个地形图的平整程度。其中,地形图的等直线长度之和是指地形图中的所有等值线的长度之和,损失值是指损失函数的值。
本方案中,通过求取地形图中的等值线长度,来确定各个地形图对应的平整程度,能够基于定量的方式来实现确定地形图的平整程度,从而便于进一步确定平整程度最高的地形图,提高了方案的可行性。
在一种可能的实现方式中,所述电子设备基于所述多组超参数,分别对所述模型进行训练,包括:电子设备获取训练子集,所述训练子集包括所述模型的训练集中的部分训练数据;电子设备采用所述训练子集,基于所述多组超参数分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重中的每组权重包括所述模型在每次迭代训练后的权重集合。
简单来说,电子设备基于任意一组超参数设置了所述神经网络模型的超参数之后,电子设备采用所述训练子集对所述神经网络模型进行多次迭代训练。在对神经网络模型所进行的每一次迭代训练过程中,神经网络模型中的权重都会进行更新。因此,在对所述神经网络模型进行多次迭代训练所得到的一组权重中包括多个权重集合,每个权重集合分别包括神经网络模型在每次迭代训练后的权重。
本方案中,通过采用训练子集分别对多组超参数进行多次迭代训练,来得到用于绘制地形图的权重,避免了基于整个训练集将模型训练至收敛,因此能够节省选择模型超参数的时间,提高模型超参数的选择效率。
在一种可能的实现方式中,所述电子设备基于所述多组权重,分别执行地形图的绘制,以得到多个地形图,包括:电子设备对第一组权重进行降维处理,得到作为二维空间中投影方向的两个高维向量,所述第一组权重为所述多组权重中的一组权重,所述第一组权重包括所述模型在每次迭代训练后的权重集合。其中,所述两个高维向量的维度与所述模型中所包括的权重的数量相同。例如,假设所述模型中共包括N个权重,在对所述模型执行5次迭代训练的情况下,所述第一组权重中包括5个权重集合,每个权重集合中包括N个权重参数。在对第一组权重进行降维处理后,可以得到2个高维向量,每个高维向量的维度为N维,即每个高维向量包括N个元素。
然后,电子设备基于所述第一组权重和所述两个高维向量,确定第一地形图中的多个采样点所对应的权重,所述多个地形图包括所述第一地形图。并且,电子设备基于所述多个采样点所对应的权重,确定所述模型对应的损失值,以绘制得到第一地形图。
具体地,电子设备可以基于所述第一组权重和所述两个高维向量,确定所述第一组权重中每次迭代训练后的权重在二维空间的坐标;电子设备基于所述每次迭代训练后的权重在二维空间的坐标,确定第一地形图的边界;电子设备根据所述第一地形图的边界,将所述第一地形图划分为多个区域,以得到所述多个区域中每个区域的采样点;电子设备基于所述第一地形图中每个采样点所对应的权重,确定所述模型对应的损失值,以绘制得到所述第一地形图,所述多个地形图包括所述第一地形图。
在一种可能的实现方式中,所述电子设备基于所述多个采样点所对应的权重,确定所述模型对应的损失值,包括:电子设备基于所述模型和所述第一地形图中的多个采样点对应的权重,构建得到多个子模型,所述多个子模型与所述多个采样点对应的权重一一对应,所述多个子模型的结构均与所述模型的结构相同。电子设备将相同的训练数据分别输入至所述多个子模型,以得到所述多个采样点对应的损失值。
也就是说,对于电子设备所构建得到的多个子模型,所述多个子模型的结构相同,但所述多个子模型中的神经单元的权重分别对应于所述多个采样点对应的权重。这样,在得到多个子模型之后,电子设备可以基于相同的输入数据,在同一个硬件上同时并行地对多个子模型进行推理运算,从而得到多个采样点对应的损失值。
通过基于多个采样点对应的权重,构建得到多个子模型,能够并行地对多个子模型进 行推理运算,从而能够同时得到多个采样点对应的损失值,提高了地形图的绘制效率。
在一种可能的实现方式中,所述方法还包括:电子设备确定所述第一地形图中每个区域的崎岖程度,所述崎岖程度用于表示每个区域内的等值线的密集程度,所述等值线上的点对应的损失值相同;电子设备根据所述崎岖程度,在所述第一地形图中增加采样点,以更新所述第一地形图;其中,所述第一地形图中区域的采样点密集程度与所述区域的崎岖程度具有正相关关系。
本方案中,通过先对地形图进行较为稀疏的等距采样,然后基于各个采样区域的崎岖程度,在崎岖程度较高的区域增加采样点,可以实现了基于较少的采样点近似达到全量采样的地形图可视效果。在保证地形图能够准确地描绘损失值在地形图中的变化趋势的同时,提高了地形图的绘制效率。
在一种可能的实现方式中,所述电子设备根据所述崎岖程度,在所述第一地形图中增加采样点,包括:按照崎岖程度从高到低的顺序,电子设备对所述第一地形图中的多个区域进行排序,以得到所述多个区域的排序结果;电子设备基于所述多个区域的排序结果,依次在所述多个区域中增加采样点,直至所增加的采样点的数量达到预设阈值。
在一种可能的实现方式中,所述确定所述第一地形图中每个区域的崎岖程度,包括:电子设备分别确定所述第一地形图中每个区域的采样点的二阶导数矩阵;电子设备计算所述二阶导数矩阵的两个特征值,并确定所述两个特征值的绝对值之和,以得到所述第一地形图中每个区域的崎岖程度。
本申请第二方面提供一种电子设备,包括获取单元和处理单元。所述获取单元,用于获取神经网络模型的多组超参数;所述处理单元,用于基于所述多组超参数,分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重与所述多组超参数一一对应,所述多组权重中的每组权重包括多次迭代训练得到的权重;所述处理单元,还用于执行多个地形图的绘制,其中,所述多个地形图中的每个地形图基于所述多组权重中的一组绘制,所述多个地形图中的每个地形图均用于表示所述模型的损失函数在训练过程中的变化趋势;所述处理单元,还用于得到目标超参数,所述目标超参数为目标地形图所对应的一组超参数,所述目标超参数用于训练所述模型,所述目标地形图为所述多个地形图中平整程度最高的地形图,所述平整程度用于表示所述模型的损失函数在地形图中的变化程度。
在一种可能的实现方式中,所述每个地形图的平整程度与所述多个地形图中每个地形图的面积和等值线长度之和有关,所述多个地形图中的每个地形图均包括相同数量的等值线,所述等值线上的点对应的损失值相同。
在一种可能的实现方式中,所述获取单元,还用于获取训练子集,所述训练子集包括所述模型的训练集中的部分训练数据;所述处理单元,还用于采用所述训练子集,基于所述多组超参数分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权 重,所述多组权重中的每组权重包括所述模型在每次迭代训练后的权重集合。
在一种可能的实现方式中,所述处理单元,还用于对第一组权重进行降维处理,得到作为二维空间中投影方向的两个高维向量,所述第一组权重为所述多组权重中的一组权重,所述第一组权重包括所述模型在每次迭代训练后的权重集合;所述处理单元,还用于基于所述第一组权重和所述两个高维向量,确定第一地形图中的多个采样点所对应的权重,所述多个地形图包括所述第一地形图;所述处理单元,还用于基于所述多个采样点所对应的权重,确定所述模型对应的损失值,以绘制得到第一地形图。
在一种可能的实现方式中,所述处理单元,还用于基于所述模型和所述第一地形图中的多个采样点对应的权重,构建得到多个子模型,所述多个子模型与所述多个采样点对应的权重一一对应,所述多个子模型的结构均与所述模型的结构相同;所述处理单元,还用于将相同的训练数据分别输入至所述多个子模型,以得到所述多个采样点对应的损失值。
在一种可能的实现方式中,所述处理单元,还用于确定所述第一地形图中每个区域的崎岖程度,所述崎岖程度用于表示每个区域内的等值线的密集程度,所述等值线上的点对应的损失值相同;所述处理单元,还用于根据所述崎岖程度,在所述第一地形图中增加采样点,以更新所述第一地形图;其中,所述第一地形图中区域的采样点密集程度与所述区域的崎岖程度具有正相关关系。
在一种可能的实现方式中,所述处理单元,还用于按照崎岖程度从高到低的顺序,对所述第一地形图中的多个区域进行排序,以得到所述多个区域的排序结果;所述处理单元,还用于基于所述多个区域的排序结果,依次在所述多个区域中增加采样点,直至所增加的采样点的数量达到预设阈值。
在一种可能的实现方式中,所述处理单元,还用于分别确定所述第一地形图中每个区域的采样点的二阶导数矩阵;所述处理单元,还用于计算所述二阶导数矩阵的两个特征值,并确定所述两个特征值的绝对值之和,以得到所述第一地形图中每个区域的崎岖程度。
本申请第三方面提供了一种电子设备,可以包括处理器,处理器和存储器耦合,存储器存储有程序指令,当存储器存储的程序指令被处理器执行时实现上述第一方面所述的方法。对于处理器执行第一方面的各个可能实现方式中的步骤,具体均可以参阅第一方面,此处不再赘述。
本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请第五方面提供了一种电路系统,所述电路系统包括处理电路,所述处理电路配置为执行上述第一方面所述的方法。
本申请第六方面提供了一种计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。
本申请第七方面提供了一种芯片系统,该芯片系统包括处理器,用于支持服务器或门限值获取装置实现上述第一方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据和/或信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保 存服务器或通信设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1为人工智能主体框架的一种结构示意图;
图2为本申请实施例提供的卷积神经网络的示意图;
图3为本申请实施例提供的卷积神经网络的示意图;
图4为本申请实施例提供的一种系统架构的示意图;
图5为现有技术中用于选择超参数的一种网格搜索方法的示意图;
图6为本申请实施例提供的一种模型超参数的选择方法的流程示意图;
图7为本申请实施例提供的两个地形图的对比示意图;
图8为本申请实施例提供的一种孪生参数并行方法的示意图;
图9为本申请实施例提供的一种绘制地形图的总流程示意图;
图10为本申请实施例提供的一种基于采样点绘制得到地形图的流程示意图;
图11为本申请实施例提供的一种基于区域的崎岖程度增加采样点的示意图;
图12为本申请实施例提供的一种训练过程可视分析系统的架构示意图;
图13为本申请实施例提供的一种训练过程可视分析系统的工作流程示意图;
图14为本申请实施例提供的一种计算加速模块的工作流程示意图;
图15为本申请实施例提供的不同学习率的地形图的对比示意图;
图16为本申请实施例提供的一种基于地形图指导模型结构的示意图;
图17为本申请实施例所提供的未添加Batch Normalization层的VGG模型的地形图;
图18为本申请实施例所提供的添加Batch Normalization层的VGG模型的地形图;
图19为本申请实施例提供的VGG模型添加Batch Normalization层前后的效果对比示意图;
图20为本申请实施例提供的一种电子设备的结构示意图;
图21为本申请实施例提供的执行设备的一种结构示意图;
图22为本申请实施例提供的芯片的一种结构示意图;
图23为本申请实施例提供的一种计算机可读存储介质的结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包 含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。
首先对人工智能系统总体工作流程进行描述,请参见图1,图1示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施。
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据。
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理。
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力。
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用。
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能电子设 备、智能交通、智能医疗、自动驾驶、智慧城市等。
下面从模型训练侧和模型应用侧对本申请提供的方法进行描述:
本申请实施例提供的模型训练方法,具体可以应用于数据训练、机器学习、深度学习等数据处理方法,对训练数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等,最终得到训练好的神经网络模型(如本申请实施例中的目标神经网络模型);并且目标神经网络模型可以用于进行模型推理,具体可以将输入数据输入到目标神经网络模型中,得到输出数据。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络。
神经网络可以是由神经单元组成的,神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元,该运算单元的输出可以为:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)卷积神经网络(Convosutionas Neuras Network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层(例如本实施例中的第一卷积层、第二卷积层)。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,我们都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
具体的,如图2所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。
其中,卷积层/池化层120以及神经网络层130组成的结构可以为本申请中所描述的第 一卷积层以及第二卷积层,输入层110和卷积层/池化层120连接,卷积层/池化层120连接与神经网络层130连接,神经网络层130的输出可以输入至激活层,激活层可以对神经网络层130的输出进行非线性化处理。
卷积层/池化层120。卷积层:如图2所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图2中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。
神经网络层130:在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图2所示的131、132至13n)以及输出层140, 该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图2由110至140的传播为前向传播)完成,反向传播(如图2由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图3所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
(3)深度神经网络。
深度神经网络(Deep Neural Network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022099779-appb-000001
其中,
Figure PCTCN2022099779-appb-000002
是输入向量,
Figure PCTCN2022099779-appb-000003
是输出向量,
Figure PCTCN2022099779-appb-000004
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022099779-appb-000005
经过如此简单的操作得到输出向量
Figure PCTCN2022099779-appb-000006
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2022099779-appb-000007
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022099779-appb-000008
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022099779-appb-000009
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(4)损失函数。
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值 或与真正想要的目标值非常接近的值。
因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。也就是说,训练过程实际上是基于训练得到的损失函数,不断调整深度神经网络中的权重向量,从而使得所得到的损失函数不断变小。
(5)反向传播算法。
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
图4是本申请实施例提供的一种系统架构的示意图,在图4中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据。
在执行设备120对输入数据进行预处理,或者在执行设备120的计算模块111执行计算等相关的处理(比如进行本申请中神经网络的功能实现)过程中,执行设备120可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果返回给客户设备140,从而提供给用户。
可选地,客户设备140,例如可以是自动驾驶系统中的控制单元、手机中的功能算法模块,例如该功能算法模块可以用于实现相关的任务。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则(例如本实施例中的目标神经网络模型),该相应的目标模型/规则即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图4中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图4仅是本申请实施例提供的一种系统架构的示意图,图中所示设 备、器件、模块等之间的位置关系不构成任何限制,例如,在图4中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
本申请实施例所提供的模型超参数的选择方法可以应用于电子设备上。示例性地,该电子设备例如可以是服务器、监控摄像装置、手机(mobile phone)、个人电脑(personal computer,PC)、笔记本电脑、平板电脑、智慧电视、移动互联网设备(mobile internet device,MID)、可穿戴设备,虚拟现实(virtual reality,VR)设备、增强现实(augmented reality,AR)设备、工业控制(industrial control)中的无线终端、无人驾驶(self driving)中的无线终端、远程手术(remote medical surgery)中的无线终端、智能电网(smart grid)中的无线终端、运输安全(transportation safety)中的无线终端、智慧城市(smart city)中的无线终端、智慧家庭(smart home)中的无线终端等。
可以参阅图5,图5为现有技术中用于选择超参数的一种网格搜索方法的示意图。如图5所示,神经网络模型中存在两个待设置的超参数,分别为学习率(Learning rate,LR)和优化器。其中,学习率的取值范围包括0.2、0.4和0.6;优化器的选择范围则包括随机梯度下降(Stochastic gradient descent,SGD)优化器和Adam优化器。其中,优化器指的是基于损失函数调整神经网络模型权重的优化方式,不同的优化器采用不同的优化方式来调整神经网络模型的权重。因此,基于3个不同的学习率和2个不同的优化器,一共可以得到6组超参数。这6组超参数分别为:(0.2,Adam)、(0.2,SGD)、(0.4,Adam)、(0.4,SGD)、(0.6,Adam)和(0.6,SGD)。因此,用户需要基于这6组超参数,选择部分训练数据对模型进行6次训练,且每次训练都必须收敛,最终选出一组合适的超参数。由于对于每一组超参数,都需要将模型训练至收敛,因此往往需要进行的长时间训练,导致超参数的选择效率较低。
有鉴于此,本申请实施例提供了一种模型超参数的选择方法,通过对采用不同超参数的模型进行训练,并且基于模型在训练过程中的权重绘制地形图,该地形图能够表示模型的损失函数在训练过程中的变化趋势。通过对比不同超参数所对应的地形图的平整程度,来选择最终用于模型训练的超参数。由于本方案是选择模型在部分训练过程中的权重来绘制地形图,无需将模型训练至收敛,因此能够节省选择模型超参数的时间,提高模型超参数的选择效率。
可以参阅图6,图6为本申请实施例提供的一种模型超参数的选择方法的流程示意图。如图6所示,该模型超参数的选择方法包括以下的步骤601-604。
步骤601,获取神经网络模型的多组超参数。
本实施例中,在基于训练集中的所有数据(即全量训练集)对神经网络模型执行正式的训练之前,电子设备可以先获取神经网络模型的多组超参数,并从多组超参数中选择一组合适的超参数用于神经网络模型的训练。其中,超参数又称为超参,是一种用于控制模型训练过程的参数。超参数是在开始模型训练过程之前所设置的参数,而不是通过训练得 到的参数数据。在大部分的模型训练情况下,需要对超参数进行优化,给模型训练选择一组最优的超参数,以提高训练得到的模型的性能。
具体地,在电子设备所获取的多组超参数中,每组超参数均包括一个或多个超参数,且每组超参数所包括的超参数的类型是相同的。对于任意两组超参数,这两组超参数中至少有一个超参数的取值是不相同的。例如,电子设备获取到的多组超参数中均包括两个超参数,这两个超参数分别为学习率和优化器。并且,学习率的取值范围包括0.2、0.4和0.6;优化器的选择范围则包括SGD优化器和Adam优化器。因此,基于3个不同的学习率和2个不同的优化器,电子设备可以是获取到6组超参数。这6组超参数分别为:(0.2,Adam)、(0.2,SGD)、(0.4,Adam)、(0.4,SGD)、(0.6,Adam)和(0.6,SGD)。
在模型训练过程中,优化器用于控制模型适应目标问题所改变的方向;学习率用于控制模型适应目标问题所改变的幅度。简单来说,优化器指示了模型在每次迭代训练后如何调整模型中的权重,学习率则指示了模型在每次迭代训练后权重调整的幅度。需要说明的是,除了上述的学习率和优化器之外,电子设备所获取到的多组超参数中也可以是包括其他的超参数,本实施例并不对超参数的类型做具体限定。
可选的,电子设备可以是通过获取用户所设置的多组超参数的取值,来获取神经网络模型的多组超参数。例如,用户在电子设备上分别输入每组超参数中所包括的超参数以及每组内超参数的取值,从而使得电子设备能够获取到上述的多组超参数。
电子设备也可以是通过获取用户所设置的多个超参数的取值范围,自动生成神经网络模型的多组超参数。例如,用户在电子设备上输入神经网络模型的超参数包括学习率和优化器,且学习率具有三个不同的取值,优化器则具有两个不同的取值。这样,电子设备可以基于学习率和优化器这两个超参数的取值范围,自动生成学习率和优化器之间的不同组合,从而得到六组超参数。
步骤602,基于所述多组超参数,分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重与所述多组超参数一一对应,所述多组权重中的每组权重包括多次迭代训练得到的权重。
本实施例中,电子设备可以基于多组超参数,对神经网络模型进行多组重复的训练。在每一组对神经网络模型的训练中,神经网络模型的训练次数和训练数据均相同。具体地,电子设备可以基于多组超参数中的某一组超参数,设置神经网络模型在训练过程中的超参数;然后,电子设备再基于训练数据对神经网络模型执行一定次数的迭代训练,从而得到一组权重。以此类推,电子设备基于多组超参数依次设置神经网络模型在训练过程中的超参数,并且在设置神经网络模型的超参数后对神经网络模型执行相同次数的迭代训练,从而得到多组权重。其中,多组权重中的每组权重都是基于相同的训练数据训练得到的,且多组权重与多组超参数一一对应。
其中,多组权重中的每一组权重都包括所述神经网络模型在多次迭代训练中每次迭代训练后所得到的权重。神经网络模型在一次迭代训练后所得到的权重为一个模型中的所有权重参数,是一个高维向量。一个神经网络模型中的所有权重参数是指模型中的所有神经单元的权重参数。通常,一个神经网络模型中包括几十万至几百万的神经单元,因此神经 网络模型在一次迭代训练后所得到的权重实际上包括了几十万至几百万的神经单元的权重参数,即一次迭代训练后所得到的权重是一个几十万维至几百万维的高维向量。
示例性地,电子设备可以先获取训练子集,所述训练子集包括所述神经网络模型的训练集中的部分训练数据。其中,所述神经网络模型的训练集包括模型正式训练过程中需要用到的全部训练数据,训练子集则是仅包括从训练集中选择出来的部分训练数据。例如,训练集可以是包括128万个训练数据,训练子集中则是包括5万个训练数据。
然后,基于所述多组超参数,电子设备分别采用所述训练子集对所述神经网络模型进行多次迭代训练,以得到所述神经网络模型在训练过程中的多组权重,所述多组权重中的每组权重包括所述神经网络模型在每次迭代训练后的权重集合。
简单来说,电子设备基于任意一组超参数设置了所述神经网络模型的超参数之后,电子设备采用所述训练子集对所述神经网络模型进行多次迭代训练。在对神经网络模型所进行的每一次迭代训练过程中,神经网络模型中的权重都会进行更新。因此,在对所述神经网络模型进行多次迭代训练所得到的一组权重中包括多个权重集合,每个权重集合分别包括神经网络模型在每次迭代训练后的权重。
例如,假设神经网络模型中共包括m个神经单元,在对神经网络模型进行一次迭代训练得到权重θ i。其中,权重θ i∈R m是该神经网络模型中的所有权重,权重θ i是一个m维的高维向量。那么,基于一组超参数,对神经网络模型进行n次迭代训练后,得到一组权重,该组权重可以表示为(θ 1,θ 2,...,θ n)。其中,θ 1表示对神经网络模型进行第1次迭代训练后得到的权重,θ 2表示对神经网络模型进行第2次迭代训练后得到的权重,θ n表示对神经网络模型进行第n次迭代训练后得到的权重。
步骤603,执行多个地形图的绘制,其中,所述多个地形图中的每个地形图基于所述多组权重中的一组绘制,所述多个地形图中的每个地形图均用于表示所述模型的损失函数在训练过程中的变化趋势。
本实施例中,地形图是指损失函数的地形图,能够将成百上千的损失函数的值映射到权重空间中。基于所述多组权重,电子设备可以绘制每一组权重所对应的地形图,从而得到与所述多组权重一一对应的多个地形图。在每个地形图中,所述模型的损失函数的位置是基于该地形图对应的一组权重确定的。由于每组权重包括多次迭代训练所对应的权重,且基于每一次迭代训练所得到的权重可以获取到一个损失函数的值,因此每组权重均具有对应的多个损失函数的值。
例如,在电子设备获得5组权重的情况下,电子设备可以绘制得到5个地形图,每一个地形图都是基于一组权重绘制得到的。
由于每一组权重中包括多次迭代训练得到的权重,而每次迭代训练得到的权重实际上包括了几十万甚至几百万的权重参数,即迭代训练得到的权重为一个高维向量。因此,为了便于展示模型的优化地形,在绘制地形图的过程中,电子设备可以将每一组权重都进行降维处理。然后,电子设备再基于降维后的参数与损失函数之间的关系,绘制得到地形图,从而展示神经网络模型的损失函数在训练过程中的变化趋势。
步骤604,得到目标超参数,所述目标超参数为目标地形图所对应的一组超参数,所 述目标超参数用于训练所述模型,所述目标地形图为所述多个地形图中平整程度最高的地形图,所述平整程度用于表示所述模型的损失函数在地形图中的变化程度。
在得到多组权重中每一组权重所对应的地形图之后,电子设备可以确定每个地形图的平整程度。具体地,地形图中的等值线排布越整齐稀疏,则表示模型的损失函数在地形图中的变化越小,地形图的平整程度越高;地形图中的等值线排布越密集且杂乱无章,则表示模型的损失函数在地形图中的变化越大,地形图的陡峭程度越高,地形图的平整程度越低。
此外,经过申请人研究发现,不同的超参数在模型的训练初期会使得梯度协方差矩阵的性质产生较大的差异,从而改变训练过程中损失函数在高维优化空间的运行轨迹,导致损失函数进入到不同的优化空间区域。在长时间训练后,基于不同超参数的模型最终训练得到的损失值基本相同,但是模型的泛化精度会存在较大的差别。并且,在模型的初期训练过程中,不同的超参数对应于不同的地形图平整程度。超参数所对应的地形图的平整程度越高,则基于该超参数训练得到的模型的泛化精度越高;超参数所对应的地形图的平整程度越低,则基于该超参数训练得到的模型的泛化精度较低。其中,模型的泛化精度是指模型拟合新数据的能力。
示例性地,可以参阅图7,图7为本申请实施例提供的两个地形图的对比示意图。如图7所示,图7中左边的地形图为平整程度较高的地形图,图7中右边的地形图为平整程度较低的地形图。由图7可以看出,平整程度较高的地形图中的等值线较为平滑以及规整,且损失函数对应的轨迹附近的等值线之间呈近似平行排布;平整程度较低的地形图中的等值线较为崎岖且杂乱无章,并且等值线还存在很多的闭环,导致损失函数的轨迹可能存在局部最优点。
基于此,在电子设备确定多个地形图的平整程度之后,电子设备可以选择平整程度最高的目标地形图所对应的一组超参数作为目标超参数,以基于该目标超参数来执行神经网络模型的正式训练。
本实施例中,通过对采用不同超参数的模型进行训练,并且基于模型在训练过程中的权重绘制地形图,该地形图能够表示模型的损失函数在训练过程中的变化趋势。通过对比不同超参数所对应的地形图的平整程度,来选择最终用于模型训练的超参数。由于本方案是选择模型在部分训练过程中的权重来绘制地形图,无需将模型训练至收敛,因此能够节省选择模型超参数的时间,提高模型超参数的选择效率。
在一个可能的实施例中,电子设备可以是通过求取地形图中的等值线长度,来确定各个地形图对应的平整程度。
示例性地,在绘制地形图之前,可以先设定地形图中的等值线的数量。其中,等值线是指制图对象某一数量指标值相等的各点连成的平滑曲线。同一条等值线上的各个点所对应的损失函数的值(即损失值)相同。例如,设定地形图的等值线的数量为100。这样,在电子设备绘制地形图的过程中,电子设备可以基于所设定的数量在各个地形图中绘制相应的等值线。这样,在电子设备绘制得到的多个地形图中,每个地形图均包括相同数量的 等值线。
然后,电子设备确定所述多个地形图中每个地形图的等值线长度之和。其中,地形图的直线长度之和是指地形图中的所有等值线的长度之和。例如,电子设备可以基于以下的公式1来确定地形图中的等值线长度之和。
Figure PCTCN2022099779-appb-000010
其中,L表示地形图中的等值线长度之和;m表示地形图中的等值线的数量;k表示每条等值线中,等间距采集k个节点;P i表示一条等值线中的某个点,P i-1表示等值线中位于P i之前的一个点。
最后,基于所述多个地形图中每个地形图的面积和所述等值线长度之和,确定所述每个地形图的平整程度。由于每个地形图的面积可能是不一样的,因此电子设备可以通过将地形图中的等值线长度之和除以地形图的面积和,得到该地形图中单位面积的等值线长度。这样一来,电子设备最终可以得到多个地形图中每个地形图的单位面积的等值线长度,从而确定地形图的平整程度。具体地,地形图的平整程度与地形图的单位面积的等值线长度成负相关相关。地形图中单位面积的等值线长度越大,则地形图的平整程度越小;地形图中单位面积的等值线长度越小,则地形图的平整程度越大。
示例性地,地形图的平整程度与地形图的单位面积的等值线长度可以是呈反比关系。在确定地形图的单位面积的等值线长度之后,可以通过求取地形图的单位面积的等值线长度的倒数,来确定地形图的平整程度。
以上详细介绍了电子设备确定地形图的平整程度的过程,为了便于理解,以下将详细介绍基于得到的多组权重,绘制地形图的过程。具体地,以下将以多组权重中的一组权重为例,详细介绍基于该组权重绘制得到对应的地形图的过程。
示例性地,电子设备首先对第一组权重进行降维处理,得到作为二维空间中投影方向的两个高维向量,以实现将作为高维向量的权重投影到二维空间中。其中,所述第一组权重为所述多组权重中的一组权重,所述第一组权重包括所述模型在每次迭代训练后的权重。例如,第一组权重可以表示为(θ 1,θ 2,...,θ n)。θ 1表示对神经网络模型进行第1次迭代训练后得到的权重,θ 2表示对神经网络模型进行第2次迭代训练后得到的权重,θ n表示对神经网络模型进行第n次迭代训练后得到的权重。权重θ i∈R m是该神经网络模型中的所有权重,权重θ i是一个m维的高维向量,m表示该神经网络模型中所有神经单元的数量。降维处理后所得到的两个高维向量例如可以表示为
Figure PCTCN2022099779-appb-000011
Figure PCTCN2022099779-appb-000012
也就是说,向量
Figure PCTCN2022099779-appb-000013
和向量
Figure PCTCN2022099779-appb-000014
均为m维的高维向量。
其中,降维是指将数据从原始所在的高维空间降至低维空间,同时在低维空间的表示保持某些高维原始数据的有意义的特征。电子设备对第一组权重进行降维处理的方法可以是采用常用的高维降维方法,例如主成分分析(Principal component analysis,PCA)方法。采用PCA方法作为降维方法,能够有效地保留权重数据中的大部分方差,即尽可能地保留权重数据本身的特征。
然后,电子设备基于所述第一组权重和所述两个高维向量,确定所述第一组权重中每 次迭代训练后的权重在二维空间的坐标。具体地,假设第一组权重中的某一次迭代训练后的权重位于二维空间的中心点(0,0),则可以基于通过建立方程组来确定第一组权重中的其余每次迭代训练后的权重在二维空间中的坐标。
示例性地,假设第一组权重可以表示为(θ 1,θ 2,...,θ n),θ n位于中心点(0,0),则可以基于
Figure PCTCN2022099779-appb-000015
确定第一组权重中每次迭代训练后的权重在二维空间中的坐标。具体地,可以建立如下所示的方程组。
Figure PCTCN2022099779-appb-000016
其中,上述方程组中的(α i,β i)为未知系数,α i和β i分别用于表示(θ 1,θ 2,...,θ n-1)在二维空间中相应的横坐标和纵坐标。由于上述方程组中(θ 1,θ 2,...,θ n)和
Figure PCTCN2022099779-appb-000017
是已知的,因此可以基于最小二乘法来计算出方程组中的(α i,β i),从而确定第一组权重中每次迭代训练后的权重(θ 1,θ 2,...,θ n)在二维空间中的坐标。
其次,在确定权重在二维空间的坐标之后,电子设备可以进一步基于所述每次迭代训练后的权重在二维空间的坐标,确定第一地形图的边界。示例性地,电子设备可以基于所得到的多个坐标(即每次迭代训练后的权重在二维空间的坐标),确定多个坐标分别在x轴和y轴上的最大值和最小值,从而确定第一地形图的边界。
例如,假设第一组权重(θ 1,θ 2,...,θ 5)在二维空间上的坐标分别为(-0.7,0.15),(-0.4,0.1),(0.5,-0.2),(-0.1,0.15),(0,0),则可以确定该多个坐标在x轴上的最大值为0.5,最小值为-0.7;该多个坐标在y轴上的最大值为0.15,最小值为-0.2。这样一来,第一地形图的边界则可以是基于坐标点(0.5,0.15),(0.5,-0.2),(-0.7,0.15),(-0.7,-0.2)来确定。此外,为了便于观察,第一地形图的边界可以是在上述的四个坐标点的基础上适当延伸,以避免第一组权重对应的坐标过于靠近第一地形图的边界。
在电子设备确定第一地形图的边界后,根据所述第一地形图的边界,将所述第一地形图划分为多个区域,以得到所述多个区域中每个区域的采样点。简单来说,在确定第一地形图的边界之后,电子设备可以在第一地形图等间距地进行区域划分,从而将第一地形图划分为多个面积相等的区域。这样,电子设备以第一地形图中的每个区域作为采样单位,确定每个区域中的采样点。例如,在电子设备将第一地形图划分为多个正方形区域的情况下,电子设备分别确定多个正方形区域的中心点或者边界点为采样点。
示例性地,电子设备可以是基于以下的公式2来实现第一地形图中的区域划分。
x=i(max(α)-min(α))/N
y=i(max(β)-min(β))/N      公式2
其中,x表示划分得到的区域中的采样点的横坐标,y表示划分得到的区域中的采样点的纵坐标,i的取值为(0,1,…N),N表示第一地形图在x轴和y轴上的划分次数,max(α)表示第一组权重在二维空间上的坐标中x轴的最大值,min(α)表示第一组权重在二维空间上的坐标中x轴的最小值,max(β)表示第一组权重在二维空间上的坐标中y轴的最大值,min(β)表示第一组权重在二维空间上的坐标中y轴的最小值。
最后,电子设备可以基于所述第一地形图中每个采样点所对应的权重,确定所述模型对应的损失值,以绘制得到所述第一地形图。其中,所述多个地形图包括所述第一地形图,所述第一地形图是基于多组权重中的第一组权重所绘制得到的一个地形图。
具体地,由于每个采样点在二维空间中的坐标是确定的,因此基于类似于上述方程组中的公式,即可确定每个采样点对应的权重。然后,将每个采样点的权重传入模型中,对模型进行正向计算,即可得到模型对应的损失值,从而得到每个采样点对应的三维坐标。基于每个采样点对应的三维坐标,电子设备可以实现地形图的绘制。
示例性地,对于任意一个采样点(x,y),可以将该采样点对应的权重传入模型中,并对模型进行正向推理计算,得到损失值Z,构成地形图中的一个三维坐标(x,y,z)。例如,对于采样点(x,y),可以基于以下的公式3来表示该采样点的损失值Z。
Figure PCTCN2022099779-appb-000018
其中,Z表示采样点的损失值,L()表示基于该采样点对应的权重进行模型的正向推理计算,
Figure PCTCN2022099779-appb-000019
表示该采样点对应的权重。
在一个可能的实施例中,在电子设备基于所述第一地形图中每个采样点所对应的权重,确定所述模型对应的损失值的过程中,电子设备可以采用孪生参数并行的方式来求取各个采样点所对应的损失值,以提高地形图的绘制效率。
具体地,电子设备可以是基于同一个模型,给模型中的神经单元设置不同的权重参数,从而得到多个结构相同但权重不同的子模型。然后,电子设备基于相同的输入数据,在同一个硬件(例如同一个GPU)上并行地执行模型的推理运算,从而分别得到多个子模型的损失值。这多个子模型的损失值则分别对应于多个采样点的损失值。
示例性地,电子设备基于所述模型和所述第一地形图中的多个采样点对应的权重,构建得到多个子模型,所述多个子模型与所述多个采样点对应的权重一一对应,所述多个子模型的结构均与所述模型的结构相同。然后,电子设备将相同的训练数据分别输入至所述多个子模型,以得到所述多个采样点对应的损失值。其中,用于输入至多个子模型中的训练数据可以为训练集中的部分数据。
也就是说,对于电子设备所构建得到的多个子模型,所述多个子模型的结构相同,但所述多个子模型中的神经单元的权重分别对应于所述多个采样点对应的权重。例如,电子设备计算得到采样点1对应的权重1,以及采样点对应的权重2。电子设备可以基于权重1和权重2和神经网络模型,构建得到子模型1和子模型2;其中,子模型1和子模型2的结构相同,子模型1中神经单元的权重为采样点1对应的权重1,子模型2中神经单元的权重为采样点2对应的权重2。
这样,在得到多个子模型之后,电子设备可以基于相同的输入数据,在同一个硬件上同时并行地对多个子模型进行推理运算,从而得到多个采样点对应的损失值。通过基于多个采样点对应的权重,构建得到多个子模型,能够并行地对多个子模型进行推理运算,从而能够同时得到多个采样点对应的损失值,提高了地形图的绘制效率。
可以参阅图8,图8为本申请实施例提供的一种孪生参数并行方法的示意图。如图8 所示,电子设备可以基于2组权重参数(分别为权重参数1和权重参数2)和神经网络模型,构建得到孪生模型。该孪生模型中包括结构相同的子模型1和子模型2。
在进行推理运算之前,孪生模型需要读取该孪生模型中神经单元的权重参数,以便于执行后续的推理运算。一般来说,在神经网络模型框架中,神经网络模型通过唯一的变量名来识别和读取权重参数。因此,本实施例中,电子设备可以基于孪生模型中的子模型的名称,对应修改权重参数的名称,以确保各个子模型能够读取和识别到对应的权重参数。例如,假设子模型1和子模型2的名称分别为Net1和Net2,则对应于子模型1的权重参数1的名称修改为Net1.param,对应于子模型2的权重参数2的名称则修改为Net2.param。
具体地,在推理运算过程中,孪生模型读取修改名称后的权重参数,从而使得孪生模型中的子模型1配置对应的权重参数1,子模型2配置对应的权重参数2。然后,向孪生模型输入一批训练数据,基于相同的一批训练数据,分别对孪生模型中的子模型1和子模型2进行推理运算,并采用拼接(concat)算子拼接孪生模型中多个子模型的损失函数,得到孪生模型的输出。最后,通过分割(split)算子将孪生模型中的各个子模型的损失函数拆开,生成各个子模型对应的损失函数,从而获得多组权重对应的损失值,即获得多个采样点对应的损失值。
以上介绍了电子设备采用孪生参数并行的方式来求取各个采样点所对应的损失值,以绘制地形图的过程。以下将结合附图介绍绘制地形图的具体流程。
可以参阅图9和图10,图9为本申请实施例提供的一种绘制地形图的总流程示意图;图10为本申请实施例提供的一种基于采样点绘制得到地形图的流程示意图。如图9所示,电子设备在得到任意一组权重(θ 1,θ 2,...,θ n)之后,可以对该组权重进行降维处理,得到两个高维向量
Figure PCTCN2022099779-appb-000020
然后,电子设备基于最小二乘法,确定该组权重中每次迭代训练后的权重在二维空间中的坐标。基于求得的坐标,电子设备取地形图的边界,并生成地形图中等面积的网格。对于地形图中各个网格,电子设备对各个网格进行采样,确定每个采样点对应的权重。基于每个采样点对应的权重,电子设备在模型中传入权重,并进行正向计算,以得到每个采样点对应的损失值,进而确定每个采样点在地形图中的三维坐标。最后,基于每个采样点在地形图中的三维坐标,绘制得到地形图。
以上介绍了在绘制得到第一地形图之后,均匀地对第一地形图进行多区域的划分,以实现第一地形图的采样。然而,在第一地形图中,不同的区域之间的崎岖程度可以是不一样的,越崎岖的区域,损失值变化得越快;越平坦的区域,损失值则变化得越慢。因此,为了尽可能地详细描绘损失值在地形图中的变化趋势,电子设备可以识别地形图中的崎岖区域,并且增加在崎岖区域的采样点。
在一个可能的实施例中,在对第一地形图进行多区域划分之后,电子设备可以确定所述第一地形图中每个区域的崎岖程度。示例性地,电子设备可以基于数值方法分别确定所述第一地形图中每个区域的采样点的二阶导数矩阵H。然后,电子设备再计算所述二阶导数矩阵H的两个特征值,并确定所述两个特征值的绝对值之和,以得到所述第一地形图中 每个区域的崎岖程度。
例如,假设电子设备计算得到的二阶导数矩阵H的两个特征值分别为λ 1和λ 2,则电子设备可以通过将这两个特征值的绝对值相加,得到区域的崎岖程度f=|λ 1|+|λ 2|。
在确定第一地形图中每个区域的崎岖程度之后,电子设备根据每个区域的崎岖程度,在所述第一地形图中增加采样点,以更新所述第一地形图。其中,所述第一地形图中区域的采样点密集程度与所述区域的崎岖程度具有正相关关系。简单来说,电子设备增加采样点的原则可以为:区域的崎岖程度越高,则增加的采样点越多;区域的崎岖程度越低,则增加的采样点越少,甚至是不增加采样点。
在一个可能的示例中,电子设备可以是按照崎岖程度从高到低的顺序,对所述第一地形图中的多个区域进行排序,以得到所述多个区域的排序结果。在该多个区域的排序结果中,崎岖程度越高的区域,排序越靠前;崎岖程度越低的区域,排序越靠后。基于所述多个区域的排序结果,电子设备依次在所述多个区域中增加采样点,直至所增加的采样点的数量达到预设阈值。
例如,假设第一地形图被设定为总的采样点为1600个,则电子设备可以基于20*20的方式来对第一地形图进行多区域划分,共划分得到400个区域,并且得到初始的400个采样点。然后,电子设备基于上述的崎岖程度计算方法,计算得到该400个区域对应的崎岖程度,并基于区域的崎岖程度对该400个区域进行排序。由于第一地形图中总的采样点为1600个,而已进行采样的采样点为400个,因此还剩余900个采样点没有采集。基于此,电子设备可以基于多个区域的排序结果,将崎岖程度较高的区域重新划分为四个较小的区域,并且重新划分得到的区域内增加采样点,直至所增加的采样点达到900个。最后,对于第一地形图中剩余没有增加采样点的区域,同样可以将这些区域划分为四个较小的区域,采用线性插值的方式确定这些区域中的网格点的损失值。
示例性地,可以参阅图11,图11为本申请实施例提供的一种基于区域的崎岖程度增加采样点的示意图。如图11所示,对于地形图中左侧较为崎岖的区域,增加了采样点,保证了地形图能够准确地描绘损失值在地形图中的变化趋势。
在另一个可能的示例中,电子设备可以提前获取或者建立崎岖程度与采样点数量之间的映射关系。这样,在确定第一地形图中每个区域的崎岖程度之后,电子设备可以基于崎岖程度与采样点数量之间的映射关系,确定每个区域需要增加的采样点,从而实现采样点的增加。其中,对于崎岖程度与采样点数量之间的映射关系来说,区域的崎岖程度越高,则对应的采样点数量越大;区域的崎岖程度越低,则对应的采样点数量越小。在实际应用中,可以是基于电子设备的算力以及地形图的精确度要求,综合确定崎岖程度与采样点数量之间具体的映射关系,在此并不限定崎岖程度与采样点数量之间具体的映射关系。
本实施例中,先对地形图进行较为稀疏的等距采样,然后基于各个采样区域的崎岖程度,在崎岖程度较高的区域增加采样点,可以实现了基于较少的采样点近似达到全量采样的地形图可视效果。在保证地形图能够准确地描绘损失值在地形图中的变化趋势的同时,提高了地形图的绘制效率。
为便于理解,以下将结合例子介绍本申请实施例提供的模型超参数的选择方法的实现方式。
可以参阅图12,图12为本申请实施例提供的一种训练过程可视分析系统的架构示意图。如图12所示,训练过程可视分析系统可以从服务器或者主机目录中读取神经网络模型所需的超参数配置文件,并基于该超参数配置文件对神经网络模型的超参数进行配置。基于配置了超参数的神经网络模型,利用计算加速模块进行地形图的绘制,得到每组超参数对应的地形图。最后,基于定量分析模块对地形图进行分析,确定需要选择的超参数。此外,绘制得到的地形图以及所确定的超参数可以基于前端服务的交互请求,而返回前端服务中,以实现在前端渲染展示地形图以及超参数。
可以参阅图13,图13为本申请实施例提供的一种训练过程可视分析系统的工作流程示意图。如图13所示,训练过程可视分析系统的工作流程包括以下的步骤1至步骤5。
步骤1,选择超参数的不同取值(如超参数1和超参数2)与合适的数据子集,对神经网络模型进行小规模的训练,得到神经网络模型每次迭代训练后的权重。例如,基于数据子集,对神经网络模型进行5个epoch的训练。其中,1个epoch表示使用训练集的全部数据对模型进行了一次完整的训练。
步骤2,基于选定的降维方法(例如PCA方法),确定属于高维参数空间的权重的降维方向和采样范围。
步骤3,结合计算加速模块进行地形图绘制计算。可以参阅图14,图14为本申请实施例提供的一种计算加速模块的工作流程示意图。如图14所示,计算加速模块采用了自适应采样粒度方法,能够先对地形图进行较为稀疏的等距采样,然后基于各个采样区域的崎岖程度,在崎岖程度较高的区域增加采样点,可以实现了基于较少的采样点近似达到全量采样的地形图可视效果。此外,计算加速模块在执行采样的过程中,例如计算采样点所在位置时,采用多张运算卡(chip)来并列采样,以提高计算效率。另外,在计算各个采样点对应的损失值的过程中,计算加速模块还采用了上述的孪生参数并行方法对模型进行训练,以提高模型的训练效率。
步骤4,在绘制得到地形图之后,通过对地形图进行分析,从而选择更优的超参数配置。
步骤5,将得到的超参数在全量数据集上进行大规模训练,得到最终模型。
为了便于理解本申请实施例所提供的模型超参数的选择方法的有益效果,以下将结合具体的实验例子详细描述本申请实施例所提供的模型超参数的选择方法相较于现有技术的提升。
实验描述:选择ResNet50网络作为神经网络模型,使用Imagenet数据集在GPU上训练ResNet50网络,并且确定如何设置ResNet50网络的学习率才能够获得更好的泛化精度。
实验步骤:
(1)从Imagenet数据集的所有数据中(Imagenet数据集中包括128万个数据),按类别随机均匀地选取一个数据子集,数据子集中例如可以包括5万个数据。此外,选择0.1、0.4 和0.8这三个初始的学习率,在数据子集上进行5个epoch的训练。
(2)采用PCA降维方法,绘制得到三个学习率分别对应的地形图,并且在地形图中可视化训练轨迹,得到如图15所示的地形图。此外,基于等值线长度度量方法得到每个地形图的平整度指数(flatness),指数越小,代表地形图的平整程度越高。如图15所示,图15为本申请实施例提供的不同学习率的地形图的对比示意图。
(3)通过分析每个地形图的平整程度,可以确定当学习率为0.8时,地形图的平整程度最高。因此,选择学习率为0.8作为模型全数据集长时间训练的学习率。
(4)基于学习率为0.8的超参数,对模型进行全数据集(即128万个数据)的长时间训练,得到最终的模型。
基于上述的实验步骤,采用不同学习率对模型进行全数据集的长时间训练,得到不同学习率下的长时间训练结果对比。具体地,不同学习率下的长时间训练结果对比如表1所示。
表1
网络名称 学习率 测试精度 平整度指数
ResNet50 0.8 0.76965 644
ResNet50 0.4 0.74217 3937
ResNet50 0.1 0.74257 1659
由表1可知,通过本申请实施例提供的模型超参数的选择方法所选择的学习率0.8,具有最低的平整度指数。并且,基于学习率0.8所训练得到的最终模型也达到了最高的测试精度。也就是说,基于本申请实施例提供的模型超参数的选择方法所选择的超参数能够保证训练得到的模型的精度。
以下将介绍本申请实施例提供的模型超参数的选择方法相较于现有技术的效率提升情况。
整体结论:相较于现有技术,本申请实施例提供的模型超参数的选择方法能够提升约3倍的超参数选择效率。
实验场景:用户需要确定模型的一个初始学习率,以使得模型能够具有更高的泛化精度。假设,用于训练模型的全量数据集的数量为N,基于全量数据集训练一次模型所消耗的时间为T。
现有技术:用户在全量数据集中随机、均匀地选取一个小的数据子集,并基于数据子集来执行模型的训练。假设,使用该数据子集训练模型直至收敛所消耗的时间为T 0,用户需要在P个初始学习率中确定一个学习率,则基于P个初始学习率训练模型直至收敛所消耗的总时间为PT 0
本申请实施例的方案:用户在全量数据集中随机、均匀地选取一个小的数据子集,并基于数据子集来执行模型的短时间训练。在短时间训练后,绘制各个学习率对应的地形图,从而根据地形图确定最终的学习率。假设,使用该数据子集短时间训练模型所消耗的时间为T s,绘制一个地形图所消耗的时间为T m,则基于P个初始学习率实现地形图绘制所消耗 的总时间为P(T s+T m)。
那么,本申请实施例的方案相对于现有技术的效率提升比率为:
Figure PCTCN2022099779-appb-000021
以上述的实验为例,即基于Image数据集训练ResNet50网络,首先在Image数据集中均匀选取一个数据子集。示例性地,Image数据集中的cifar10数据集包括5万个数据,因此可以将cifar10数据集看作为所选取的一个数据子集。基于cifar10数据集执行训练的信息如下表2所示。
表2
Figure PCTCN2022099779-appb-000022
其中,Epoch表示训练的迭代轮数,1个epoch表示使用训练集的全部数据对模型进行了一次完整的训练。Batch表示使用训练集中的一部分样本对模型权重进行一次反向传播的参数更新,这一小部分样本被称为“一批数据”,即batch。Batch_size即表示批数据的大小。Steps per epoch表示一个epoch包含的步数,每一步是送入batch_size个数据来进行训练。
基于表2,可以确定上述的T 0=20.2。即使用cifar10数据集训练模型直至收敛所消耗的时间为20.2分钟。
此外,在表2所示资源的场景下,绘制一张ResNet50对应的地形图,所需的时间约为5.2分钟,即上述的T m=5.2。使用cifar10数据集短时间训练模型(即模型训练的epoch为5)所消耗的时间T s=1.3(即20.2*5/90=1.3)。
基于上述的具体数据,本申请实施例的方案相对于现有技术的效率提升比率可以如公式4所示。
Figure PCTCN2022099779-appb-000023
也就是说,相较于现有技术,本申请实施例提供的模型超参数的选择方法能够提升约3倍的超参数选择效率。
以上是以学习率为待选择的超参数为例,介绍了基于地形图的平整程度来选择相应的学习率的过程。以下将以批标准化(Batch Normalization)层为模型的超参数为例,介绍基于地形图的平整程度来选择是否添加Batch Normalization层的过程。经申请人研究发现, 在神经网络模型中添加Batch Normalization层,可以使得神经网络模型训练的优化空间更加平滑,从而更有利于梯度下降,使模型收敛更快、更平稳。反之,神经网络模型中没有添加Batch Normalization层,将会导致神经网络模型的收敛变慢且更收敛过程更为曲折。具体地,在VGG类型的神经网络模型中,如果没有添加Batch Normalization层,则神经网络模型的训练轨迹周边会存在明显的凸起和崎岖形状。
基于此,电子设备在绘制得到神经网络模型的地形图之后,电子设备可以对地形图中损失函数的训练路径进行识别。在电子设备识别到地形图具有特定的平整程度时,例如地形图中的训练路径具有特定的崎岖形状,电子设备则可以生成相应的提示信息,以指示用户选择特定的超参数,即指示用户在神经网络模型中选择超参数--Batch Normalization层。
示例性地,可以参阅图16,图16为本申请实施例提供的一种基于地形图指导模型结构的示意图。如图16所示,电子设备在得到地形图之后,如果识别到地形图中训练路径的地形平缓且无明显凸起,则可以是不生成提示信息,无需建议用户为模型选择超参数--Batch Normalization层;如果电子设备识别到地形图中训练路径周边存在损失值过大的明显凸起区域,则电子设备可以生成提示信息,以提示用户检查模型是否已添加Batch Normalization层,即提示用户为模型选择超参数--Batch Normalization层。
其中,地形图中训练路径的形状可以用于表示地形图的平整程度,地形图中训练路径所具有的特定形状则可以认为是特定的平整程度。示例性地,训练路径上的每个点在二维空间中具有对应的位置坐标点,且训练路径上的每个点还具有对应的损失值,因此基于每个点在二维空间的位置坐标点以及损失值,可以在三维空间中绘制训练路径。例如,以训练路径上的每个点在二维空间的位置坐标点为x轴和y轴的坐标,以每个点对应的损失值为z轴的坐标,即可得到训练路径上的每个点在三维空间中的坐标点,从而绘制得到三维地形图中的训练路径。这样一来,基于三维地形图中的训练路径的形状即可以确定地形图的平整程度。例如,训练路径在z轴方向上变化越大,则代表地形图中的损失函数变化程度越大,地形图的平整程度越小;训练路径在z轴方向上变化越小,则代表地形图中的损失函数变化程度越小,地形图的平整程度越高。
在实际应用中,电子设备在识别到地形图中训练路径具有特定的形状时,则可以认为地形图具有特定的平整程度,从而生成提示信息,以指示用户选择特定的超参数。
为验证上述通过识别地形图中特定的崎岖形状来选择神经网络模型结构的效果,以下将结合具体的实验来详细说明。
实验描述:使用cifar10数据集训练超分辨率测试序列(Visual Geometry Group,VGG)类型的神经网络模型(以下简称VGG模型),该神经网络模型用于进行图片分类。实验目的为:如何加快模型训练的收敛速度。
实验步骤:
(1)对VGG模型进行初始阶段的训练,例如对VGG模型进行5个epoch的训练。
(2)采用PCA降维方法,绘制得到VGG模型对应的地形图,并且在地形图中可视化训练轨迹,得到如图17所示的地形图。图17为本申请实施例所提供的未添加Batch Normalization层的VGG模型的地形图。
(3)通过识别如图17所示的地形图,发现地形图中训练轨迹周边存在明显的凸起及崎岖形状。基于对地形图的识别结果,生成提示信息,以提示用户在VGG模型中添加Batch Normalization层。
(4)用户在VGG模型中添加Batch Normalization层后,再次对添加了Batch Normalization层的VGG模型进行训练,并绘制相应的地形图,得到如图18所示的地形图。图18为本申请实施例所提供的添加Batch Normalization层的VGG模型的地形图。如图18所示,在添加了Batch Normalization层后,VGG模型对应的地形图中的训练轨迹形状得到了明显的改善。
此外,可以参阅图19,图19为本申请实施例提供的VGG模型添加Batch Normalization层前后的效果对比示意图。由图19可以看出,在VGG模型添加了Batch Normalization层之后,VGG模型在训练过程中收敛变得更快且更为平稳。
可以参阅图20,图20为本申请实施例提供的一种电子设备的结构示意图。如图20所示,本申请实施例提供的一种电子设备,包括:获取单元2001和处理单元2002。所述获取单元2001,用于获取神经网络模型的多组超参数;所述处理单元2002,用于基于所述多组超参数,分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重与所述多组超参数一一对应,所述多组权重中的每组权重包括多次迭代训练得到的权重;所述处理单元2002,还用于执行多个地形图的绘制,其中,所述多个地形图中的每个地形图基于所述多组权重中的一组绘制,所述多个地形图中的每个地形图均用于表示所述模型的损失函数在训练过程中的变化趋势;所述处理单元2002,还用于得到目标超参数,所述目标超参数为目标地形图所对应的一组超参数,所述目标超参数用于训练所述模型,所述目标地形图为所述多个地形图中平整程度最高的地形图,所述平整程度用于表示所述模型的损失函数在地形图中的变化程度。
在一种可能的实现方式中,所述每个地形图的平整程度与所述多个地形图中每个地形图的面积和等值线长度之和有关,所述多个地形图中的每个地形图均包括相同数量的等值线,所述等值线上的点对应的损失值相同。
在一种可能的实现方式中,所述获取单元2001,还用于获取训练子集,所述训练子集包括所述模型的训练集中的部分训练数据;所述处理单元2002,还用于采用所述训练子集,基于所述多组超参数分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重中的每组权重包括所述模型在每次迭代训练后的权重集合。
在一种可能的实现方式中,所述处理单元2002,还用于对第一组权重进行降维处理,得到作为二维空间中投影方向的两个高维向量,所述第一组权重为所述多组权重中的一组权重,所述第一组权重包括所述模型在每次迭代训练后的权重集合;所述处理单元2002,还用于基于所述第一组权重和所述两个高维向量,确定第一地形图中的多个采样点所对应的权重,所述多个地形图包括所述第一地形图;所述处理单元2002,还用于基于所述多个采样点所对应的权重,确定所述模型对应的损失值,以绘制得到第一地形图。
在一种可能的实现方式中,所述处理单元2002,还用于基于所述模型和所述第一地形 图中的多个采样点对应的权重,构建得到多个子模型,所述多个子模型与所述多个采样点对应的权重一一对应,所述多个子模型的结构均与所述模型的结构相同;所述处理单元2002,还用于将相同的训练数据分别输入至所述多个子模型,以得到所述多个采样点对应的损失值。
在一种可能的实现方式中,所述处理单元2002,还用于确定所述第一地形图中每个区域的崎岖程度,所述崎岖程度用于表示每个区域内的等值线的密集程度,所述等值线上的点对应的损失值相同;所述处理单元2002,还用于根据所述崎岖程度,在所述第一地形图中增加采样点,以更新所述第一地形图;其中,所述第一地形图中区域的采样点密集程度与所述区域的崎岖程度具有正相关关系。
在一种可能的实现方式中,所述处理单元2002,还用于按照崎岖程度从高到低的顺序,对所述第一地形图中的多个区域进行排序,以得到所述多个区域的排序结果;所述处理单元2002,还用于基于所述多个区域的排序结果,依次在所述多个区域中增加采样点,直至所增加的采样点的数量达到预设阈值。
在一种可能的实现方式中,所述处理单元2002,还用于分别确定所述第一地形图中每个区域的采样点的二阶导数矩阵;所述处理单元2002,还用于计算所述二阶导数矩阵的两个特征值,并确定所述两个特征值的绝对值之和,以得到所述第一地形图中每个区域的崎岖程度。
接下来介绍本申请实施例提供的一种执行设备,请参阅图21,图21为本申请实施例提供的执行设备的一种结构示意图,执行设备2100具体可以表现为手机、平板、笔记本电脑、智能穿戴设备、服务器等,此处不做限定。其中,执行设备2100上可以部署有图21对应实施例中所描述的数据处理装置,用于实现图21对应实施例中数据处理的功能。具体的,执行设备2100包括:接收器2101、发射器2102、处理器2103和存储器2104(其中执行设备2100中的处理器2103的数量可以一个或多个,图21中以一个处理器为例),其中,处理器2103可以包括应用处理器21031和通信处理器21032。在本申请的一些实施例中,接收器2101、发射器2102、处理器2103和存储器2104可通过总线或其它方式连接。
存储器2104可以包括只读存储器和随机存取存储器,并向处理器2103提供指令和数据。存储器2104的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器2104存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器2103控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
上述本申请实施例揭示的方法可以应用于处理器2103中,或者由处理器2103实现。处理器2103可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器2103中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处 理器2103可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器2103可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2104,处理器2103读取存储器2104中的信息,结合其硬件完成上述方法的步骤。
接收器2101可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器2102可用于通过第一接口输出数字或字符信息;发射器2102还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器2102还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器2103,用于执行图6对应实施例中的执行设备执行的模型超参数的选择方法。
本申请实施例提供的执行设备、训练设备或电子设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的模型超参数的选择方法,或者,以使训练设备内的芯片执行上述实施例描述的模型超参数的选择方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
具体的,请参阅图22,图22为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 2200,NPU 2200作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2203,通过控制器2204控制运算电路2203提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路2203内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路2203是二维脉动阵列。运算电路2203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)2208中。
统一存储器2206用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)2205,DMAC被搬运到权重存储器2202中。输入数据也通过DMAC被搬运到统一存储器2206中。
BIU为Bus Interface Unit即,总线接口单元2213,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)2209的交互。
总线接口单元2213(Bus Interface Unit,简称BIU),用于取指存储器2209从外部存储器获取指令,还用于存储单元访问控制器2205从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器2206或将权重数据搬运到权重存储器2202中或将输入数据数据搬运到输入存储器2201中。
向量计算单元2207包括多个运算处理单元,在需要的情况下,对运算电路2203的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。
在一些实现中,向量计算单元2207能将经处理的输出的向量存储到统一存储器2206。例如,向量计算单元2207可以将线性函数;或,非线性函数应用到运算电路2203的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2207生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2203的激活输入,例如用于在神经网络中的后续层中的使用。
控制器2204连接的取指存储器(instruction fetch buffer)2209,用于存储控制器2204使用的指令;
统一存储器2206,输入存储器2201,权重存储器2202以及取指存储器2209均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。
可以参阅图23,图23为本申请实施例提供的一种计算机可读存储介质的结构示意图。本申请还提供了一种计算机可读存储介质,在一些实施例中,上述图6所公开的方法可以实施为以机器可读格式被编码在计算机可读存储介质上或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
图23示意性地示出根据这里展示的至少一些实施例而布置的示例计算机可读存储介质的概念性局部视图,示例计算机可读存储介质包括用于在计算设备上执行计算机进程的计算机程序。
在一个实施例中,计算机可读存储介质2300是使用信号承载介质2301来提供的。信号承载介质2301可以包括一个或多个程序指令2302,其当被一个或多个处理器运行时可以提供以上针对图6描述的功能或者部分功能。因此,例如,参考图6中所示的实施例, 步骤601-604的一个或多个特征可以由与信号承载介质2301相关联的一个或多个指令来承担。此外,图23中的程序指令2302也描述示例指令。
在一些示例中,信号承载介质2301可以包含计算机可读介质2303,诸如但不限于,硬盘驱动器、紧密盘(CD)、数字视频光盘(DVD)、数字磁带、存储器、ROM或RAM等等。
在一些实施方式中,信号承载介质2301可以包含计算机可记录介质2304,诸如但不限于,存储器、读/写(R/W)CD、R/W DVD、等等。在一些实施方式中,信号承载介质2301可以包含通信介质2305,诸如但不限于,数字和/或模拟通信介质(例如,光纤电缆、波导、有线通信链路、无线通信链路、等等)。因此,例如,信号承载介质2301可以由无线形式的通信介质2305(例如,遵守IEEE 802.23标准或者其它传输协议的无线通信介质)来传达。
一个或多个程序指令2302可以是,例如,计算机可执行指令或者逻辑实施指令。在一些示例中,计算设备的计算设备可以被配置为,响应于通过计算机可读介质2303、计算机可记录介质2304、和/或通信介质2305中的一个或多个传达到计算设备的程序指令2302,提供各种操作、功能、或者动作。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线 (例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。

Claims (11)

  1. 一种模型超参数的选择方法,其特征在于,包括:
    获取神经网络模型的多组超参数;
    基于所述多组超参数,分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重与所述多组超参数一一对应,所述多组权重中的每组权重包括多次迭代训练得到的权重;
    执行多个地形图的绘制,其中,所述多个地形图中的每个地形图基于所述多组权重中的一组绘制,所述多个地形图中的每个地形图均用于表示所述模型的损失函数在训练过程中的变化趋势;
    得到目标超参数,所述目标超参数为目标地形图所对应的一组超参数,所述目标超参数用于训练所述模型,所述目标地形图为所述多个地形图中平整程度最高的地形图,所述平整程度用于表示所述模型的损失函数在地形图中的变化程度。
  2. 根据权利要求1所述的方法,其特征在于,
    所述每个地形图的平整程度与所述多个地形图中每个地形图的面积和等值线长度之和有关,所述多个地形图中的每个地形图均包括相同数量的等值线,所述等值线上的点对应的损失值相同。
  3. 根据权利要求1或2所述的方法,其特征在于,所述基于所述多组超参数,分别对所述模型进行训练,包括:
    获取训练子集,所述训练子集包括所述模型的训练集中的部分训练数据;
    采用所述训练子集,基于所述多组超参数分别对所述模型进行多次迭代训练,以得到所述模型在训练过程中的多组权重,所述多组权重中的每组权重包括所述模型在每次迭代训练后的权重集合。
  4. 根据权利要求1-3任意一项所述的方法,其特征在于,所述基于所述多组权重,分别执行地形图的绘制,以得到多个地形图,包括:
    对第一组权重进行降维处理,得到作为二维空间中投影方向的两个高维向量,所述第一组权重为所述多组权重中的一组权重,所述第一组权重包括所述模型在每次迭代训练后的权重集合;
    基于所述第一组权重和所述两个高维向量,确定第一地形图中的多个采样点所对应的权重,所述多个地形图包括所述第一地形图;
    基于所述多个采样点所对应的权重,确定所述模型对应的损失值,以绘制得到第一地形图。
  5. 根据权利要求4所述的方法,其特征在于,所述基于所述多个采样点所对应的权重,确定所述模型对应的损失值,包括:
    基于所述模型和所述第一地形图中的多个采样点对应的权重,构建得到多个子模型,所述多个子模型与所述多个采样点对应的权重一一对应,所述多个子模型的结构均与所述模型的结构相同;
    将相同的训练数据分别输入至所述多个子模型,以得到所述多个采样点对应的损失值。
  6. 根据权利要求4或5所述的方法,其特征在于,所述方法还包括:
    确定所述第一地形图中每个区域的崎岖程度,所述崎岖程度用于表示每个区域内的等值线的密集程度,所述等值线上的点对应的损失值相同;
    根据所述崎岖程度,在所述第一地形图中增加采样点,以更新所述第一地形图;
    其中,所述第一地形图中区域的采样点密集程度与所述区域的崎岖程度具有正相关关系。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述崎岖程度,在所述第一地形图中增加采样点,包括:
    按照崎岖程度从高到低的顺序,对所述第一地形图中的多个区域进行排序,以得到所述多个区域的排序结果;
    基于所述多个区域的排序结果,依次在所述多个区域中增加采样点,直至所增加的采样点的数量达到预设阈值。
  8. 根据权利要求6或7所述的方法,其特征在于,所述确定所述第一地形图中每个区域的崎岖程度,包括:
    分别确定所述第一地形图中每个区域的采样点的二阶导数矩阵;
    计算所述二阶导数矩阵的两个特征值,并确定所述两个特征值的绝对值之和,以得到所述第一地形图中每个区域的崎岖程度。
  9. 一种电子设备,其特征在于,包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为执行所述代码,当所述代码被执行时,所述电子设备执行如权利要求1至8任意一项所述的方法。
  10. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至8任意一项所述的方法。
  11. 一种计算机程序产品,其特征在于,所述计算机程序产品存储有指令,所述指令在由计算机执行时使得所述计算机实施权利要求1至8任意一项所述的方法。
PCT/CN2022/099779 2021-06-28 2022-06-20 一种模型超参数的选择方法及相关装置 WO2023273934A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110722986.3 2021-06-28
CN202110722986.3A CN115601513A (zh) 2021-06-28 2021-06-28 一种模型超参数的选择方法及相关装置

Publications (1)

Publication Number Publication Date
WO2023273934A1 true WO2023273934A1 (zh) 2023-01-05

Family

ID=84690037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099779 WO2023273934A1 (zh) 2021-06-28 2022-06-20 一种模型超参数的选择方法及相关装置

Country Status (2)

Country Link
CN (1) CN115601513A (zh)
WO (1) WO2023273934A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237952A (zh) * 2023-11-15 2023-12-15 山东大学 基于免疫地形图的染色病理切片细胞分布标注方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109799533A (zh) * 2018-12-28 2019-05-24 中国石油化工股份有限公司 一种基于双向循环神经网络的储层预测方法
CN110163713A (zh) * 2019-01-28 2019-08-23 腾讯科技(深圳)有限公司 一种业务数据处理方法、装置以及相关设备
US20200364553A1 (en) * 2019-05-17 2020-11-19 Robert Bosch Gmbh Neural network including a neural network layer
CN112699833A (zh) * 2021-01-12 2021-04-23 上海海事大学 基于卷积神经网络复杂光照环境下的舰船目标识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109799533A (zh) * 2018-12-28 2019-05-24 中国石油化工股份有限公司 一种基于双向循环神经网络的储层预测方法
CN110163713A (zh) * 2019-01-28 2019-08-23 腾讯科技(深圳)有限公司 一种业务数据处理方法、装置以及相关设备
US20200364553A1 (en) * 2019-05-17 2020-11-19 Robert Bosch Gmbh Neural network including a neural network layer
CN112699833A (zh) * 2021-01-12 2021-04-23 上海海事大学 基于卷积神经网络复杂光照环境下的舰船目标识别方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117237952A (zh) * 2023-11-15 2023-12-15 山东大学 基于免疫地形图的染色病理切片细胞分布标注方法及系统
CN117237952B (zh) * 2023-11-15 2024-02-09 山东大学 基于免疫地形图的染色病理切片细胞分布标注方法及系统

Also Published As

Publication number Publication date
CN115601513A (zh) 2023-01-13

Similar Documents

Publication Publication Date Title
WO2022083536A1 (zh) 一种神经网络构建方法以及装置
WO2021043193A1 (zh) 神经网络结构的搜索方法、图像处理方法和装置
WO2022042713A1 (zh) 一种用于计算设备的深度学习训练方法和装置
WO2022068623A1 (zh) 一种模型训练方法及相关设备
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
WO2022052601A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
WO2021218517A1 (zh) 获取神经网络模型的方法、图像处理方法及装置
EP4145351A1 (en) Neural network construction method and system
WO2022179492A1 (zh) 一种卷积神经网络的剪枝处理方法、数据处理方法及设备
WO2023221928A1 (zh) 一种推荐方法、训练方法以及装置
CN113705769A (zh) 一种神经网络训练方法以及装置
WO2021218470A1 (zh) 一种神经网络优化方法以及装置
WO2022111617A1 (zh) 一种模型训练方法及装置
CN110222718B (zh) 图像处理的方法及装置
WO2022012668A1 (zh) 一种训练集处理方法和装置
WO2022007867A1 (zh) 神经网络的构建方法和装置
CN112801265A (zh) 一种机器学习方法以及装置
WO2021129668A1 (zh) 训练神经网络的方法和装置
WO2022179586A1 (zh) 一种模型训练方法及其相关联设备
WO2023185925A1 (zh) 一种数据处理方法及相关装置
CN113536970A (zh) 一种视频分类模型的训练方法及相关装置
US20220327835A1 (en) Video processing method and apparatus
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
CN113407820B (zh) 利用模型进行数据处理的方法及相关系统、存储介质
WO2023273934A1 (zh) 一种模型超参数的选择方法及相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831752

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE