CN111738267A

CN111738267A - Visual perception method and visual perception device based on linear multi-step residual error learning

Info

Publication number: CN111738267A
Application number: CN202010473221.6A
Authority: CN
Inventors: 张寒波; 邵文泽; 李海波
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-10-02
Anticipated expiration: 2040-05-29
Also published as: CN111738267B

Abstract

The invention discloses a visual perception method and a visual perception device based on linear residual error learning, wherein the method comprises the following steps: acquiring a real-time image; performing visual perception on the image by adopting a pre-established visual perception model to acquire semantic information or distance parameters in the image; the visual perception model is established by carrying out depth convolution on the acquired image and based on linear residual learning; the visual perception model takes data in a training set as a training sample, inputs original shared features and task features of an image, and outputs a semantic segmentation map or a depth map. The method and the device greatly reduce the parameter quantity of the model and improve the operation efficiency of the perception model while ensuring the calculation precision.

Description

Visual perception method and visual perception device based on linear multi-step residual error learning

Technical Field

The invention relates to a visual perception method and a visual perception device based on linear multi-step residual error learning, and belongs to the field of computer visual image processing.

Background

In the automatic driving process, the unmanned vehicle not only needs to judge what the surrounding objects are based on the visual perception system, but also needs to quickly judge the distance between the object and the unmanned vehicle, and the decision is made correctly. The quality of the visual perception system directly affects the safety and reliability of the unmanned vehicle, and the degree of the technical development of the visual perception system directly relates to the existence and subsequent development of automatic driving. Cardinality mainly adopted in the vision perception system of automatic driving at present includes technologies such as a camera, a laser radar, a GPS and the like. Currently, in the industry, there are two main ways to design a visual perception system. Firstly, the judgment and the distance measurement of surrounding objects are realized through active scanning of a laser radar; the other scheme is a pure visual perception scheme, and specifically, the camera is used for collecting surrounding environment information of the unmanned vehicle, and semantic information and distance parameters of surrounding objects are analyzed through a visual algorithm.

Compared with a laser radar perception scheme, the pure visual perception scheme is low in price and easy to apply, and is an ideal scheme. However, how to design an efficient visual perception algorithm becomes a difficult problem? Although the development of computer vision and deep learning currently produces many high-efficiency perception algorithms in the academic world, in an industrial floor scene, the accuracy of the perception algorithms and the efficiency and parameter problems of the algorithms need to be considered due to the computing capability of unmanned vehicle computing platforms. Therefore, how to maximize the accuracy of the algorithm on the unmanned vehicle platform and ensure high execution efficiency is a core problem of perceptual algorithm design.

In recent years, many efforts have been made to solve the problem of the automatic Driving visual perception algorithm, for example, MultiNet [ Real-time Joint mechanical reading for Autonomous Driving ] proposes a method for Real-time classification, detection and Semantic segmentation by a unified architecture. The Cross-batch network for multi-task learning researches the influence of network weight sharing of different levels on multi-task learning, and provides an optimal network sharing structure for realizing automatic learning by a Cross-batch unit. The UberNet [ Training autonomous' connected visual network for low-, mid-, and high-level using direct data sets and limited memory ] proposes a general network as a "Swiss Knife" solution to visual tasks, which can jointly process low, medium and high level visual tasks using different data sets and limited memory. Although the model has high accuracy, the quantity of parameters is too large, so that the calculation is complex and the efficiency is low. Particularly, when the method is applied to the unmanned vehicle, the computational complexity and efficiency of the model become factors which restrict the unmanned vehicle to make correct decisions quickly.

Disclosure of Invention

The invention provides a visual perception method based on linear multistep residual learning, which greatly reduces the parameter quantity of a model and improves the operation efficiency of a perception model while ensuring the calculation precision.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a visual perception method based on linear multi-step residual error learning comprises the following steps: acquiring a real-time image; performing visual perception on the image by adopting a pre-established visual perception model to acquire semantic information or distance parameters in the image; the visual perception model is established by performing depth convolution on the acquired image and based on linear multi-step residual learning; the visual perception model takes data in a training set as a training sample, inputs original shared features and task features of an image, and outputs a semantic segmentation map or a depth map.

Further, the training set is a data set of an original Cityscape training set after random rotation and horizontal inversion processing.

Further, after the deep convolution is performed on the acquired image, the establishing of the visual perception model based on the linear multi-step residual learning comprises the following steps: carrying out shared feature extraction on the acquired real-time image to obtain global scene information; extracting task features of the global scene information to obtain semantic segmentation information or depth estimation information; and performing depth convolution on the global scene information and the semantic segmentation information or the depth estimation information, and then performing linear multi-step residual learning to obtain a visual perception model.

Further, the depth convolution is a depth separable convolution including a depth convolution and a point-by-point convolution.

Further, the visual perception model is calculated by formula (1):

x_n+1＝k_n*(b_n(x_n)+x_n)+b_n(b_n(x_n)+x_n)+(1-k_n)x_n(1)

wherein k is_nLearnable parameters for module n within the visual perception model, b_nIs the nth processing unit, x_nOriginal information, x, input for the nth processing unit_n+1Is the processed information.

A visual perception device based on linear multi-step residual learning is characterized in that: the system comprises a data set establishing module, a data set generating module and a data set generating module, wherein the data set establishing module is used for establishing a training set and a verification set; the model establishing module is used for establishing a visual perception model; and the model training module is used for training the visual perception model by a training set.

Optionally, the model building module includes: the shared feature extraction module is used for extracting shared features to obtain global scene information, performing deep convolution on the global scene information and then performing linear multi-step residual error learning; the semantic segmentation module is used for extracting semantic segmentation information from global scene information, performing deep convolution on the semantic segmentation information and then performing linear multi-step residual error learning; and the depth estimation module is used for extracting depth estimation information from the global scene information, performing depth convolution on the depth estimation information and then performing linear multi-step residual error learning.

According to the method, the obtained image is subjected to deep convolution, and the visual perception model is established based on deep convolution linear multi-step residual learning, so that the parameter quantity of the model is greatly reduced while the calculation precision is ensured, and the operation efficiency of the perception model is improved.

Drawings

Fig. 1 is a block diagram of a visual perception method based on linear multi-step residual error learning according to an embodiment of the present invention;

fig. 2 is a block diagram of a visual perception apparatus based on linear multi-step residual learning according to an embodiment of the present invention.

Detailed Description

For a better understanding of the nature of the invention, its description is further set forth below in connection with the specific embodiments and the drawings.

The invention discloses a visual perception method based on linear multistep residual error learning, which is particularly suitable for visual perception of automatic driving, and specifically comprises the following steps as shown in figure 1:

step one, establishing a training set and a verification set.

Carrying out random rotation and horizontal turnover on an original Cityscape training set to be used as a training set; the original Ciytscape verification set is used as the verification set. The original Cityscape training set and the original Ciytscape verification set contain image categories: RGB images, true semantic segmentation maps and depth estimation maps. The training set and validation set were saved as npy data formats as input to the visual perception model.

2975 images are selected from training images in total, and 500 images are selected from verification assembly. Semantic segmentation and depth estimation in the cityscaps verification set are 7 types, specifically as follows:

[1] flat: corresponding to rod, sidewalk

[2] constraint: corresponding to building, wall, fence

[3] object: corresponding to pole, traffic light, traffic sign

[4] And (6) nature: corresponding to vegetation, terrain

[5] sky: corresponding to, sky

[6] Man: corresponding to person, rider

[7] vehicle: corresponding to the name of carm truck, bus caravan, trailer train, motorcycle

And the training set and the verification set are saved in npy format, so that the data storage space is saved.

And step two, performing depth convolution on the acquired image, and establishing a visual perception model based on linear multi-step residual error learning.

And S1, acquiring an image with the height H and the width W (marked as H x W) in real time, and extracting the global scene information of the image under different scales.

And S2, extracting task features of the global scene information to obtain semantic segmentation information or depth estimation information.

And S3, performing depth separable convolution on the global scene information and the semantic segmentation information or the depth estimation information. The depth separable convolution is divided into two parts: depth convolution and point-by-point convolution.

S4, linear multi-step residual error learning is carried out on the global scene information, the semantic segmentation information or the depth estimation information to obtain a visual perception model:

x_n+1＝k_n*(b_n(x_n)+x_n)+b_n(b_n(x_n)+x_n)+(1-k_n)x_n(1)

By utilizing the depth separable convolution and respectively considering the image area and the channel, the separation of the channel and the area is realized, and the model parameter quantity can be reduced. Meanwhile, deep-level associated information can be mined and the characteristics of a specific task can be extracted by utilizing a linear multi-step algorithm, so that the accuracy and the efficiency of the model can be improved.

And step three, training the visual perception model.

Configuring learnable parameter k of visual perception model using a pyrrch deep learning framework_nAnd simultaneously configuring training parameters of the visual perception model: the optimization function is set to Adam's algorithm, the base learning rate is set to 5e-3, batch _ size is set to 2, and the total number of iterations is set to 200. And then importing the data in the training set into a visual perception model for iterative training.

And step four, verifying the visual perception model.

Using the learnable parameters k saved in the verification set and step three_nAnd comparing and verifying the visual perception model by using a basic learning rate.

Comparing and verifying the semantic segmentation graph and the depth graph output by the model obtained by the method, and the semantic segmentation graph and the depth graph output by other models and comparing and verifying the semantic segmentation graph and the depth graph output by other models by adopting a semantic segmentation evaluation index Pixel Precision (PA) and a mean intersection unit (mIoU), and an absolute Error (Absolute Error) and a Relative Error (Relative Error).

1. Pixel precision PA:

and k is the number of the target classes, Pii is the total number of the pixel points which belong to the i class and are predicted to be the i class, and Pij is the total number of the pixel points which belong to the i class and are predicted to be the j class.

2. Homozygosity ratio mIoU:

wherein k +1 is k target classes and a background class, and Pji is the total number of pixels which belong to j classes and are predicted to be i classes.

3. Absolute error Abs Err:

wherein Y (i, j) is the true depth value,

for the predicted depth values, m is the height of the image and n is the width of the image.

4. Relative error Rel Err:

the invention is compared and verified with MTAN model and Dense model. The MTAN model is mentioned in Shikun Liu, Edward Jons et al. The evaluation parameter calculation is respectively carried out on the three models by utilizing 7 types of semantic segmentation and depth estimation results in the CityScaps verification set, and the calculation results are shown in Table 1

TABLE 1

As can be seen from Table 1, the invention greatly reduces the model parameters without affecting the average cross-over ratio, the pixel precision, the relative error and the absolute error of the model, thereby improving the efficiency of the visual perception model and lightening the model.

The invention also provides a visual perception device based on linear multi-step residual error learning, which comprises a data set establishing module, a data set verifying module and a data set selecting module, wherein the data set establishing module is used for establishing a training set and a verifying set; the model establishing module is used for establishing a visual perception model; and the model training module is used for training the visual perception model by a training set.

The model building module comprises: and the shared feature extraction module is used for extracting shared features to obtain global scene information, and performing linear multi-step residual error learning after performing deep convolution on the global scene information. And the semantic segmentation module is used for extracting semantic segmentation information from the global scene information, performing deep convolution on the semantic segmentation information and then performing linear multi-step residual error learning. And the depth estimation module is used for extracting depth estimation information from the global scene information, performing depth convolution on the depth estimation information and then performing linear multi-step residual error learning.

1. And sharing the characteristic module.

The shared characteristic module is used for extracting global scene information and transmitting the global scene information to the depth estimation module or the semantic segmentation module.

The shared characteristic module mainly adopts an encoding-decoding (encoder-decoder) architecture. The coding layer is provided with five modules which are connected in sequence, namely a first feature coding block, a second feature coding block, a third feature coding block, a fourth feature coding block and a fifth feature coding block. Each module is followed by a maximum pooling layer for image down-sampling by a factor of 2. The decoding layer has five modules for decoding information. The five decoding modules are connected in sequence, and a first characteristic decoding block, a second characteristic decoding block, a third characteristic decoding block, a fourth characteristic decoding block and a fifth characteristic decoding block are connected in sequence. And a maximum despooling layer is arranged in front of each decoding block and is used for image upsampling, and the upsampling multiple is 2.

The first feature encoding module to share the feature module uses two basic convolution blocks that use mainly the standard 3x3 convolution, batch normalization and activation function Relu. Other feature coding modules are used for linear multi-step residual learning of global scene information.

2. Depth estimation module

And the depth estimation module is used for extracting depth estimation information from the global scene information, performing depth convolution on the depth estimation information and then performing linear multi-step residual error learning.

The depth estimation module also employs an encoding-decoding (encoder-decoder) architecture. The coding layer is mainly used for coding the global scene information of the shared characteristic module and extracting depth estimation information from the global scene information. The coding layer comprises five modules, namely a first task characteristic coding block, a second task characteristic coding block, a third task characteristic coding block, a fourth task characteristic coding block and a fifth task characteristic coding block. Each module is followed by a maximum pooling layer for image down-sampling by a factor of 2. The input of the first task feature coding block is from an original image, the input of the other four modules is from two parts, the output of the last task module after down sampling and the output of the feature coding block corresponding to the shared feature module after down sampling. The decoding layer also contains five modules. The decoding layer is mainly used to decode shared features, as well as features that tend to estimate the depth of the image. The decoding layer comprises five modules, namely a first task characteristic coding block, a second task characteristic coding block, a third task characteristic coding block, a fourth task characteristic coding block and a fifth task characteristic coding block. Immediately before each module is a maximum deslustering layer for image upsampling by a factor of 2. The input of each task characteristic decoding block is from two parts, namely the up-sampled output of the last task characteristic decoding block and the up-sampled output corresponding to the decoding layer of the sharing module.

The first task feature encoding block of the depth estimation module depth convolves the depth estimation information with two 3x3 standard convolutions. Other task feature coding modules are used for linear multi-step residual learning of depth estimation information.

The input of the depth estimation module mainly comes from two parts, namely information in the corresponding shared feature module and information of a previous layer of the module. And (3) performing cat function on the two parts of information, performing convolution lightweight after the two parts of information are connected on channels of the two parts of information, and extracting the geometric information of the object surface which is beneficial to depth estimation. The final output of the depth estimation module is fed into a task prediction module that estimates the depth map using two layers of standard 3x3 convolution, allowing accurate predictions to be made for each pixel in the image.

3. Semantic segmentation module

And the semantic segmentation module is used for extracting semantic segmentation information from the global scene information, performing deep convolution on the semantic segmentation information and then performing linear multi-step residual error learning.

The semantic segmentation module adopts the same coding-decoding architecture as the depth estimation module. The first task feature encoding block of the semantic segmentation module performs a deep convolution using two 3x3 standard convolution semantic segmentation information. Other task feature coding modules are used for linear multi-step residual learning of semantic segmentation information.

The input of the semantic segmentation module mainly comes from two parts, namely information in the corresponding shared characteristic module and information of a previous layer of the module. And (3) performing cat function on the two parts of information, performing convolution lightweight after the two parts of information are connected on channels of the two parts of information, and extracting object semantic information beneficial to semantic segmentation. The final output of the semantic segmentation module is fed into a task prediction module that estimates the segmentation map using two layers of standard 3x3 convolution, allowing accurate predictions to be made for each pixel in the image.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The present invention is not limited to the above embodiments, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principle of the present invention are included in the scope of the claims of the present invention which are filed as the application.

Claims

1. A visual perception method based on linear multi-step residual learning is characterized by comprising the following steps:

acquiring a real-time image;

performing visual perception on the image by adopting a pre-established visual perception model to acquire semantic information or distance parameters in the image;

the visual perception model is established by performing depth convolution on the acquired image and based on linear multi-step residual learning;

the visual perception model takes data in a training set as a training sample, inputs original shared features and task features of an image, and outputs a semantic segmentation map or a depth map.

2. The visual perception method based on linear multi-step residual learning according to claim 1, characterized in that: the training set is a data set of an original Cityscape training set after random rotation and horizontal turnover processing.

3. The visual perception method based on linear multi-step residual learning according to claim 1, characterized in that: the deep convolution of the acquired image and the establishment of the visual perception model based on the linear multi-step residual error learning comprise the following steps:

carrying out shared feature extraction on the acquired real-time image to obtain global scene information;

extracting task features of the global scene information to obtain semantic segmentation information or depth estimation information;

and performing depth convolution on the global scene information and the semantic segmentation information or the depth estimation information, and then performing linear multi-step residual learning to obtain a visual perception model.

4. The visual perception method based on linear multi-step residual learning according to claim 3, characterized in that: the depth convolution is a depth separable convolution including a depth convolution and a point-by-point convolution.

5. The visual perception method based on linear multi-step residual learning according to claim 1, characterized in that: the visual perception model is calculated by formula (1):

x_n+1＝k_n*(b_n(x_n)+x_n)+b_n(b_n(x_n)+x_n)+(1-k_n)x_n(1)

6. A visual perception device based on linear multi-step residual learning is characterized in that: comprises that

The data set establishing module is used for establishing a training set and a verification set;

the model establishing module is used for establishing a visual perception model;

and the model training module is used for training the visual perception model by a training set.

7. The visual perception device based on linear multi-step residual learning according to claim 6, wherein: the model building module comprises:

the shared feature extraction module is used for extracting shared features to obtain global scene information, performing deep convolution on the global scene information and then performing linear multi-step residual error learning;

the semantic segmentation module is used for extracting semantic segmentation information from global scene information, performing deep convolution on the semantic segmentation information and then performing linear multi-step residual error learning;