CN112052626B

CN112052626B - Automatic design system and method for neural network

Info

Publication number: CN112052626B
Application number: CN202010818278.5A
Authority: CN
Inventors: 蔡行; 张兰清; 李承远; 李宏
Original assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Current assignee: Advanced Institute of Information Technology AIIT of Peking University; Hangzhou Weiming Information Technology Co Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2024-01-19
Anticipated expiration: 2040-08-14
Also published as: CN112052626A

Abstract

The invention discloses a neural network automatic design system and a neural network automatic design method, which utilize NAS to carry out automatic design of a model with a supervision VO task, and improve the two fields of NAS and VO. For the NAS field, a more general NAS framework is provided, which can process space and time sequence information simultaneously so as to be suitable for video visual tasks. In the VO aspect, by utilizing our VONAS algorithm, a network model with better performance and lighter weight is obtained by searching.

Description

Automatic design system and method for neural network

Technical Field

The invention belongs to the field of visual odometers in the field of computer vision, and particularly relates to an automatic design system and method for a neural network.

Background

Visual Odometer (VO) is a critical task in the field of autopilot and robotics, aimed at estimating camera pose from successive frames. The conventional VO task is a typical geometrical task, and the pose is obtained by using the matching of feature points or pixels to perform strict computation. With the rapid development of CNNs (Convolutional Neural Networks ) and RNNs (recurrent neural network, recurrent neural networks) in visual tasks, more and more end-to-end network frameworks are also applied in VO tasks.

In a deep learning-based framework, VO tasks are distinguished as video regression tasks from semantic-based visual tasks (e.g., image classification, object detection). First, the VO task predicts a 6-DoF (degree of freedom, degrees of freedom) camera pose, focusing more on the geometric feature stream than on the semantic features. The camera motion cannot be accurately calculated merely by semantic information, such as simply detecting or identifying objects in the image. Secondly, the VO task needs to process at least two pictures simultaneously to calculate the relative pose, pays attention to the extraction capability of the VO task to time sequence characteristics, is sensitive to the input sequence of the images, and means that the input sequence is different, and the prediction results are also different.

The automatic design of a model for a VO task is performed by using a neural network architecture search (Neural Architecture Search, abbreviated as NAS) nowadays, and the selection of a lightweight model suitable for extracting geometric features and timing features is an innovative and very challenging attempt. However, as described above, the requirement of geometric feature extraction can be solved by the model automatic design of NAS, but NAS cannot process the timing information.

Disclosure of Invention

The invention aims to solve the technical problem that the prior NAS cannot process time sequence information.

In order to solve the above problems, the present invention provides an automatic design system and method for a neural network.

The technical scheme adopted by the invention is as follows:

an automatic design method of a neural network comprises a super-network structure and a controller model, and comprises the following steps:

s1, preparing a video sequence containing video data and real camera pose data;

s2, extracting a video segment V1 from the video sequence of the S1, forming training batch data by the video segment V1, uniformly sampling each block operation of the super-network structure, selecting the operation of the training batch, forming a path after the selection is completed, wherein the path is a sub-network model, sequentially inputting two adjacent frames of images in the V1 into the sub-network model according to time sequence to obtain a pose sequence between the image frames, calculating an error by using a loss function, and updating network parameters until the loss function is not reduced;

s3, outputting operands selected by each block by using a controller model to generate codes of sub-models, extracting video segments v2 and corresponding real camera pose data from the video sequence of the S1 according to video time sequence by adopting the network parameters iterated in the S2, inputting v2 into the sub-models to obtain predicted poses, comparing the predicted poses with the real camera poses, calculating to obtain segment evaluation indexes, repeating the operation of the S3 until the complete video sequence is extracted, and calculating all segment evaluation indexes to obtain final evaluation indexes of the sub-models;

s4, carrying out parameter updating on the controller model parameters by utilizing the final evaluation index of the submodel obtained in the S3, and repeating the S3 until the set iteration number is reached or the performance of the submodel is not improved;

s5, outputting n submodels by using a controller, and picking out the submodel with the best final evaluation index as a final output result.

Preferably, in the step S2, the loss function is:

where x is the Euclidean distance between the predicted pose and the true camera pose and a and c are parameters controlling loss.

Preferably, the video sequence of S1 is divided into a training set and a verification set, S2 employs the training set, and S3 employs the verification set.

Preferably, the final evaluation index calculation method of the S3 neutron model includes: all segment evaluation indexes are averaged.

Preferably, n in S5 is an integer of 10 or less.

Preferably, in S3, the final evaluation index is recorded.

The utility model provides an automatic design system of neural network, contains super network structure and controller model, super network structure contains Stem module, convolution block module, the sequential block module and Tail module, stem module is used for handling the input and is two stacked rgb pictures, tail module is used for handling time sequence information, convolution block module includes the convolution operation combination of different parameter of juxtaposition, the sequential block module includes the downsampling operation combination of juxtaposition based on the convolution, the sequential block module is used for integrating the time sequence information of input.

Preferably, the sequential block module may employ one of 4 operations, which are ConvlstmS 3x3 or ConvlstmS 5x5 or ConvGRUS 3x3 or ConvGRUS 5x5, respectively.

Compared with the prior art, the invention has the following advantages and effects:

the invention utilizes the NAS to automatically design the model with the supervision of VO tasks, and improves the NAS field and the VO field. For the NAS field, a more general NAS framework is provided, which can process space and time sequence information simultaneously so as to be suitable for video visual tasks. In the VO aspect, by utilizing our VONAS algorithm, a network model with better performance and lighter weight is obtained by searching.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic diagram of a super network structure according to the present invention;

FIG. 2 is a sequential block module;

FIG. 3 is an operational diagram of the Convolution Block module and the Reduction Block module;

FIG. 4 is a schematic diagram of a controller model;

FIG. 5 is a schematic diagram of the output results of the present invention;

FIG. 6 is se:Sup>A comparison of pose estimation results of the present invention (VONAS-A and VONAS-B) with other algorithms;

fig. 7 is a graph of pose estimation versus complexity and performance of an existing advanced network architecture.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1:

an automatic design system of a neural network comprises a super-network structure and a controller model, wherein the super-network structure comprises a stem initial processing module, a plurality of block modules and a tail module as shown in figure 1. The Stem block contains convolution operations with a convolution kernel size 7x7, stride of 2. The Tail module contains a common convlstm layer for the final processing of the timing information. The operations for block include 3 types: first, convolution block, comprises a combination of convolution operations of a plurality of different parameters in parallel; the size of the feature map output by the module is consistent with the number of channels and the input, and the number of channels (the number of convolution kernels of each layer) is unchanged. Second, reduction block, comprising a parallel convolution-based combination of downsampling operations; the size of the feature map output by the module is half of that of the feature map input, and the channel number is doubled. Third, sequential block, comprising different similar convolution-based convRNN operations, integrates timing information for the input sequence; the size of the feature map output by the module is unchanged, and the channel number is unchanged. As shown in fig. 2, the timing module includes two modules: first, the convlstm module reduces one gate operation, thereby reducing the complexity of the model, and eliminates the tanh operation, adding normalization, named convlstms (simple). Second, the convglu module reduces the tanh operation and increases the normalization and is named convGRUs (simple). Each timing block contains both operations, with convolution kernel sizes of 3 and 5, respectively, so the number of alternate operations in the timing block is 4. As shown in FIG. 3, alternative operations contained in the normal block and the reduction block are shown.

As shown in fig. 1-5, an automatic design system for a neural network includes the following steps.

Step 1, preparing a video sequence containing video data and real camera pose data, such as a KITTI unmanned dataset containing video data acquired by an onboard camera of an automobile and real camera pose data provided by the dataset. Step 2 may then be performed directly or the entire video sequence may be partitioned into a training set and a validation set and then step 2 may be performed.

Step 2, divide into 5 steps:

2.1, a continuous video segment V1, V1 preferably comprising segments of 5-10 frames of data is extracted from the video sequence or training set in step 1, constituting the training data of the current batch.

2.2, sampling the structure of the sub-network model, specifically comprising uniformly sampling each block operation of the super-network, selecting the operation of the current training batch, and forming a path from input to output in the super-network after all blocks are selected, wherein the path is a sub-network model.

And 2.3, inputting two adjacent images in the V1 into the submodel in sequence according to the video time sequence to obtain a pose sequence between the image frames.

2.4, calculating errors by using Adaptive loss, and then carrying out back propagation to update network parameters, wherein the Adaptive loss is defined as follows:

where x is the Euclidean distance between the predicted pose and the true pose, and a and c are parameters for controlling loss, which can be controlled by network gradient feedback.

And 2.5, repeating the training iteration from 2.1 to 2.4 until the loss function is no longer reduced, so that all candidate operations of each block in the super network can be fully trained.

And 3, carrying out model searching by adopting the verification set in the step 1, wherein the method comprises the following steps of:

and 3.1, outputting operands selected by each block in time sequentially by using the controller model, and generating a code of a sub-model after outputting. The code is a series of sequences, each value in the sequence being represented as a candidate operation number selected in each block. The code uniquely corresponds to a sub-model. A schematic of the process is shown in fig. 4.

And 3.2, after the network structure of the sub-model is determined, the parameters of the sub-model are inherited (directly copied) from the network parameters of the corresponding structure in the super-network structure trained in the step 2, and retraining is not needed, so that huge time consumption caused by retraining of the sub-model is reduced.

3.3, the structure and parameters of the submodel are all determined through 3.1 and 3.2, then the video segment V2 is extracted from the video sequence or the verification set according to the video time sequence to conduct pose prediction, namely, V2 is input into the submodel to obtain a predicted pose, then the predicted pose is compared with a pose true value, and a segment evaluation index is obtained through calculation. And averaging the evaluation indexes of the prediction results of the video sequences of all verification sets to obtain a final evaluation index of a sub-model, thereby evaluating the performance of the sub-model. Step 3.4 may be performed directly or after recording the submodel structure and the evaluation index, step 3.4 may be performed.

And 4, determining and evaluating the complete sub-model once in the steps 3.1-3.3, and then updating parameters of the controller model by using one obtained sub-model evaluation index, wherein the method particularly adopts policy gradient in a reinforcement learning method to update parameters so as to gradually improve the performance of the sub-network model output by the controller. And then repeating the steps 3.1-3.3 for training iteration, and completing the training of the controller model after the preset training iteration number is reached or the sub-network performance of the controller output is not improved (namely, the accuracy is not improved).

And 5, outputting (preferably, less than 10) submodels by using the controller model trained in the step 4, and selecting the submodel with the best pose prediction evaluation index (namely, the smallest evaluation value) as a final search result.

As shown in fig. 6-7, the average values of the pose estimates (Avg terr and Avg rerr) of the present invention are lower and perform better than the performance of other advanced network models.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. The utility model provides an automatic design method of a neural network, which is characterized by comprising a super network structure and a controller model, wherein the super network structure comprises a step module, a convolution block module, a reduction block module, a sequential block module and a Tai l module, the step module is used for processing an rgb picture which is input into two stacks, the Tail module is used for processing time sequence information, the convolution block module comprises a convolution operation combination of parallel different parameters, the reduction block module comprises a parallel convolution-based downsampling operation combination, and the sequential block module is used for integrating the input time sequence information; the automatic design method of the neural network comprises the following steps:

s1, preparing a video sequence containing video data and real camera pose data;

s3, outputting operands selected by each block by using a controller model to generate codes of sub-models, extracting video segments v2 and corresponding real camera pose data from the video sequence of the S1 according to video time sequence by adopting the network parameters iterated in the S2, inputting v2 into the sub-models to obtain predicted poses, comparing the predicted poses with the real camera poses, calculating to obtain segment evaluation indexes, repeating the operation of the S3 until the complete video sequence is extracted, and calculating the segment evaluation indexes to obtain final evaluation indexes of the sub-models;

2. The automatic design method of a neural network according to claim 1, wherein in the step S2, the loss function is:

3. The automatic design method of neural network according to claim 1, wherein the video sequence of S1 is divided into a training set and a verification set, S2 employs the training set, and S3 employs the verification set.

4. The automatic design method of a neural network according to claim 1, wherein the final evaluation index calculation method of the S3 neutron model is as follows: all segment evaluation indexes are averaged.

5. The automatic neural network design method according to claim 1, wherein in S3, the final evaluation index is recorded.

6. The automatic design method of a neural network according to claim 1, wherein n in S5 is an integer of 10 or less.

7. An automatic neural network design system for implementing the automatic neural network design method according to claim 1, comprising a super network structure and a controller model, wherein the super network structure comprises a step module, a convolution block module, a reduction block module, a sequential block module and a Tai module, the step module is used for processing two stacked rgb pictures, the Tai module is used for processing time sequence information, the convolution block module comprises a convolution operation combination of parallel different parameters, the reduction block module comprises a parallel convolution-based downsampling operation combination, and the sequential block module is used for integrating the input time sequence information.

8. The neural network automatic design system of claim 7, wherein the sequential block module can employ one of 4 operations, which are ConvlstmS 3x3 or ConvlstmS 5x5 or convrus 3x3 or convrus 5x5, respectively.