CN113793341B

CN113793341B - Automatic driving scene semantic segmentation method, electronic equipment and readable medium

Info

Publication number: CN113793341B
Application number: CN202111086495.0A
Authority: CN
Inventors: 周彦; 袁指天; 王冬丽; 李云燕
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2024-02-06
Anticipated expiration: 2041-09-16
Also published as: CN113793341A

Abstract

The invention discloses an automatic driving scene semantic segmentation method, electronic equipment and a readable medium. The invention extracts the knowledge of the middle layer of the teacher network by refining the high confidence knowledge of the teacher network in knowledge distillation and utilizing four branches: the two branches extract the spatial structure knowledge of the level and the vertical respectively by two-dimensional convolution groups; the third branch uses two layers of cavity convolution to acquire space remote dependent knowledge; the fourth branch uses multi-layer perceptrons to extract channel distribution knowledge learned by the teacher's network. And taking the refined knowledge as a true sample, and taking the student network intermediate feature map as a false sample to perform countermeasure training. The teacher network efficiently transfers the thinned space and channel knowledge to the student network, and the model performance of the student network finally obtained after knowledge distillation is improved. The module is only optimized in the training process, and does not participate in calculation in the reasoning process, so that the parameter number of the student model is not increased.

Description

Automatic driving scene semantic segmentation method, electronic equipment and readable medium

Technical Field

The invention belongs to the field of deep learning and machine vision, and particularly relates to an automatic driving scene semantic segmentation method, electronic equipment and a readable medium.

Background

Applications such as object detection and semantic segmentation are currently being developed at remarkable speeds under the support of deep convolutional neural networks (DNNs). Introducing more parameters will generally improve the accuracy of the model. Semantic segmentation is an important task in computer vision. Currently, most advanced semantic segmentation methods generally require a large amount of computing resources to achieve accurate semantic segmentation, and although the performance of current DNNs is significantly improved, the efficiency is very important for semantic segmentation, and the huge memory cost and the huge computation of these deep networks make it difficult to directly deploy the trained networks in real-time applications, such as embedded systems and autopilot vehicles. Model compression techniques have emerged to address these issues, including lightweight network design, pruning, quantization, and knowledge distillation. Among these methods, knowledge distillation has proven to be an effective method of obtaining a lightweight model that simplifies training of deep neural networks by following a student-teacher paradigm in which students are penalized according to a softened version of the teacher output. These lightweight models maintain a high level of accuracy. Knowledge distillation is an attractive method of network compression, where inspiration comes from the transfer of knowledge from a teacher to a student. It is essentially a compact student model to approximate an over parameterized teacher model. Thus, the student model may achieve significant performance improvements, occasionally exceeding the teacher. By replacing the over-parameterized teacher model with a compact student model, compression of a high-scale model can be achieved. FOR example, the document "FINETS: HINTS FOR THIN DEEP NETS" uses a method of aligning student network and teacher network feature mappings. The approach proposed by document "PAYING MORE ATTENTION TOATTENTION: IMPROVING THE PERFORMANCE OF CONVOLUTIONAL NEURALNETWORKS VIA ATTENTION TRANSFER" forces the student model to mimic the attention profile of a powerful teacher model. Document "Structured Knowledge Distillation for Dense Prediction" proposes a strategy of point-to-distillation and partial pair-wise distillation, with various loss functions to optimize the student's network. The literature "Intra-class FeatureVariation Distillation for Semantic Segmentation" considers that the teacher model typically learns more robust Intra-class feature representations than the student model, and therefore proposes that students mimic Intra-class feature variations of the student's feature distribution, allowing students to better mimic the teacher and improve segmentation accuracy.

Defects of the prior art: many knowledge distillation methods have limitations because the student network is forced to only simulate the output probability distribution of the teacher network to shift the knowledge embedded in the soft target. Due to the performance differences between teacher and student networks, it is difficult for lightweight, shallow student networks to learn the high-level distilled knowledge output by parameterized and redundant teacher networks. Typically, the feature map of the teacher network contains redundant and noisy information that is detrimental to supervising the students. The document "FINETS: HINTS FOR THIN DEEP NETS" learns intermediate representations by directly aligning feature maps, but ignores the large differences in original scale between teacher models and compact student models. The Attention Transfer (AT) method of document "PAYING MOREATTENTION TO ATTENTION: IMPROVING THE PERFORMANCE OF CONVOLUTIONALNEURAL NETWORKS VIA ATTENTION TRANSFER" aims to mimic an attention diagram between a student and a teacher model so that the sum of feature maps across channel dimensions can represent the attention distribution of an image classification task. However, this may not be suitable for the pixel-level segmentation task, as different channels represent different types of activations. Furthermore, the characteristic distillation method for characteristics of document "Structured Knowledge Distillation for Dense Prediction" includes point-wise distillation and partial pair-wise distillation. However, tightly fitting point classification scores or feature activation between a teacher network and a compact student network may impose too tight constraints and result in suboptimal solutions. The local pairwise distillation has a fixed receptive field and the teacher network cannot transfer the ability to capture remote context to the student network. The method of document "Intra-class Feature Variation Distillation for Semantic Segmentation" delivers knowledge of feature variations within a class, which considers only high-dimensional features, and ignores learning of intermediate layer features. There are two general problems with current methods of knowledge distillation in semantic segmentation. The first problem is that in the process where the student network mimics the teacher network, the student simply mimics the feature or score map of the teacher network, the over-parameterized information of the teacher network is also learned in the process without focusing on the most significant areas, i.e., the most useful information in the semantic segmentation task, and redundant and useless information, resulting in inefficient knowledge distillation. The second problem is that the difference between the output of the teacher network and the student network is due to the difference in model performance. If the student network blindly simulates the output of the score map or the final layer of feature map of the teacher network, the teacher network lacks the supervised learning of the middle layer features of the student network, so that the knowledge learned by the student network is limited, and the student network training reaches limited precision.

Disclosure of Invention

Aiming at the technical problem that a teacher network lacks effective supervision of the generation process of a student network intermediate feature map in the existing knowledge distillation method suitable for semantic segmentation, the invention provides an automatic driving scene semantic segmentation method, electronic equipment and readable media for enabling the teacher network to supervise the generation of intermediate features of the student network.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

an automatic driving scene semantic segmentation method comprises the following steps:

step 1, establishing an countermeasure network structure, which comprises a teacher network, a student network, a classifier, a refinement module and a self-attention discriminator;

step 2, taking a feature map generated by a teacher network based on an original map of a training set as input of a refinement module, inputting the feature map output by the refinement module into a classifier to obtain a semantic segmentation prediction map result, and training the refinement module according to the semantic segmentation prediction map result and a label map of the training set;

step 3, taking the feature image output by the refinement module as a true sample, taking the feature image generated by the student network as a false sample, and inputting the true sample and the false sample into the self-attention discriminator together for countermeasure training, wherein the student network is optimized in the countermeasure training, and the teacher network is not optimized;

step 4, inputting a scene picture data set obtained in an automatic driving scene into a student network model for generating an countermeasure network structure, and training the student network until training is finished;

and 5, inputting the real automatic driving scene picture to be subjected to semantic segmentation into a trained student model to obtain a segmentation result.

In the method, in the step 2, the teacher network is a neural network which is already trained in advance.

In the method, in the step 2, the label graph in the training set is formed by manually marking the original graph.

In the method, in the step 1, a refinement module performs convolution operation by using a convolution check input feature map with the size of 3×3 to obtain a feature map with 256 channels; then four branches are passed, three branches extract the spatial structure knowledge of the teacher network, the first two branches have two layers of two-dimensional convolution, the first layer of convolution kernel is 1×7 and 7×1, and the second layer of convolution kernel is 7×1 and 1×7, so as to extract the horizontal and vertical spatial structure knowledge respectively; the third branch is convolved by two holes with convolution kernels of 3 multiplied by 3 and expansion rate of 4 to obtain space remote dependent knowledge; the fourth branch uses a multi-layer perceptron to extract channel distribution knowledge learned by a teacher network; and then cascading the feature graphs obtained by the first three branches, carrying out element-by-element multiplication by using channel distribution knowledge obtained by the first three branches after carrying out 1X 1 convolution dimension reduction, mapping the feature graphs of the 128 channel numbers between (0 and 1) through a Sigmoid function, carrying out element-by-element multiplication on the mapped feature graphs and the feature graphs of the 128 channel numbers obtained after carrying out 1X 1 convolution dimension reduction on the feature graphs of the last layer of the teacher network, and carrying out element-by-element addition on the obtained result and the feature graphs of the last layer of the teacher network to obtain the processing result of the thinning module.

In the method, in the step 2, the input of the refinement module is a feature map with 512 channels generated by the penultimate layer of the teacher network; in the step 3, the feature map generated by the student network is a feature map with 128 channels generated by the penultimate layer of the student network.

In the method, in the step 2, training the refinement module includes:

training a neural network model by adopting a self-adaptive learning rate method, and taking cross entropy loss as a loss function, wherein the mathematical expression of the self-adaptive learning rate current_rate is as follows:

the current_step is the Current learning rate, the base_rate is the initial learning rate, the current_step is the Current iteration step number, the max_step is the maximum iteration step number, the power constant is 0.9, and the initial learning rate is set to be 0.01;

the Loss function Loss is expressed as:

wherein y=y _truth ,y'＝y _pred ，y _truth Representing a label map, y _pred Representing a predictive map;w represents the limited variable and n represents the total number of samples.

In the method, in the step 3, the countermeasure training is as follows:

performing countermeasure training by adopting a method of generating a countermeasure network structure by using the Neisserian, and defining basic distances and differences between fine feature mapping distribution of a teacher middle layer and a student middle layer, wherein the mathematical expression of the Neisserian generation countermeasure Loss function W_Loss is as follows:

F _I is a characteristic diagram input by teacher student network, and the characteristic diagram F of student network prediction _S Considered as a false sample, the feature map generated by the teacher network is refined by the refinement module to generate a feature map F _T Considered as a true sample, P _S (F _S ) And P _T (F _T ) The feature distribution of the dummy sample and the real sample respectively,and->Respectively input F _S And F _T To the expectations of the arbiter, D is the embedded arbiter network.

An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods.

A computer readable medium having stored thereon a computer program which when executed by a processor implements the method.

The invention has the technical effects that:

(1) According to the invention, the teacher network is added in knowledge distillation to monitor the middle characteristics of the student network, and the monitoring of the middle layer addition of the student network is beneficial to training of the student network, so that the student network can obtain a better effect.

(2) The invention adds a knowledge refinement module based on DNNs, which refines relevant useful knowledge of teachers and converts the knowledge into knowledge which is easier to understand and learn by student networks, deletes redundant and noisy excessive parameterized information which is unfavorable for supervising students and is contained in a high-dimensional feature map of the teacher network, and then transfers the knowledge which is easier to understand and learn by the refined student networks from the teacher network to the student networks.

(3) The intermediate feature distillation of the present invention aligns each channel of the student feature map with the feature map of each channel of the teacher network through countermeasure learning. The middle layer characteristic knowledge of the teacher network is perfected, and the teacher network can supervise the network learning of students in the middle layer by generating the countermeasure network structure to perform countermeasure training, so that the intermediate characteristic diagram generated by the student network is close to the intermediate characteristic generated by the teacher network.

(4) The invention uses two-dimensional convolution to extract horizontal and vertical space structured knowledge respectively, uses two layers of cavity convolution to obtain space remote dependent knowledge, uses a multi-layer perceptron to extract channel distribution knowledge learned by teachers, and then uses a residual error learning scheme to transfer the thinned knowledge to the middle layer of a student network channel through the channel and the space. In the training process, a teacher network is allowed to supervise the multi-layer learning of the student network, so that the student can learn not only the characteristic information but also how to extract useful knowledge in the knowledge distillation process.

Drawings

Fig. 1 is a detailed module structure diagram of the present invention.

Fig. 2 is a general flow chart of the present invention.

Fig. 3 (a) is an original drawing of the picture a.

Fig. 3 (b) is a label diagram of the picture a.

Fig. 3 (c) is a segmentation diagram of picture a based on the current most advanced method IFVD.

Fig. 3 (d) is a segmentation diagram of the student network under the knowledge distillation method of the present invention of picture a.

Fig. 3 (e) is an original drawing of the picture B.

Fig. 3 (f) is a label diagram of the picture B.

Fig. 3 (g) is a segmentation diagram of picture B based on the current most advanced method IFVD.

Fig. 3 (h) is a segmentation diagram of the student network under the knowledge distillation method of the present invention of picture B.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. Here, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

As shown in fig. 1, the detailed module structure diagram of the present invention sequentially includes a 3×3 convolution, a 7×1 convolution, a 1×7 convolution, a 3×3 hole convolution with an expansion rate of 4, a 1×1 convolution, a Sigmoid function, a multi-layer perceptron, a global average pooling, a classifier, and a discriminant in the detailed module structure diagram shown in fig. 1.

As shown in fig. 2, the flow chart of the present invention mainly comprises the following steps: 1) Collecting semantic segmentation data; 2) Building a teacher student network model; 3) Designing a feature refinement module; 4) Setting network model parameters; 5) And obtaining a segmentation map and average cross-correlation comparison, and analyzing the result. In order to achieve the above purpose, the invention particularly relates to a semantic segmentation method for resisting learning knowledge distillation of a neural network middle layer characteristic diagram aiming at automatic driving, which comprises the following specific steps.

S1, collecting semantic segmentation data sets required by training a student network.

S1.1, as shown in fig. 3 (a) and 3 (e), the training input image is a real scene RGB photograph.

S1.2, the data set label graph divides the image into 19 categories according to road scenes, namely roads, sidewalks, buildings, walls, fences, poles, traffic lights, traffic signboards, plants, terrains, sky, people, riders, automobiles, trucks, buses, trains, motorcycles and bicycles. Marking the corresponding colors on each major category, wherein the rest colors are not divided and are marked as black, and the RGB values are as follows in sequence: (128,64,128), (244,35,232), (70,70,70), (102,102,156), (190,153,153), (153,153,153), (250,170,30), (220,220,0), (107,142,35), (152,251,152), (70,130,180), (220,20,60), (255, 0), (0,0,142), (0,0,70), (0,60,100), (0,80,100), (0,0,230), (119,11,32). The marked picture is the truth diagram.

S1.3, marking colors in the truth diagram from 0 to 18 in sequence according to categories, and thus making a final label diagram containing 19 categories. As shown in fig. 3 (b) and 3 (f).

S2, building a teacher student network. The teacher network frame in this embodiment is based on PSPNet, the backbone network is ResNet101, the student network frame is also based on PSPNet, and the backbone network is ResNet18. Other neural networks may be used in practice. The PSPNet used in the embodiment aggregates the context information of different areas, thereby improving the capability of acquiring global information. The PSPNet can embed scene information features which are difficult to analyze into a prediction framework based on a convolutional neural network, the module has hierarchical global priority, contains information of different scales among different subareas, and establishes an effective optimization strategy on the basis of depth supervision loss ResNet. The embodiment also comprises a self-attention discriminator which is composed of full convolution and can lead the characteristic diagram refined by the knowledge of the middle layer of the teacher network to be in countermeasure training with the characteristic diagram of the middle layer of the student network, and two attention modules are inserted between the last three modules to capture the structural information.

And S3, adding the designed feature refinement module into a teacher student network knowledge distillation training process. The feature refinement module of this embodiment takes as input a feature map output by the teacher network penultimate layer, and proceeds through four branches, two of which extract horizontal and vertical spatial structured knowledge from two-dimensional convolution layers. The third branch uses two hole convolutions to obtain spatially remote dependent knowledge, thereby extracting spatial structure knowledge of the teacher network through the three branches. The fourth branch uses a multi-layer perceptron to extract channel distribution knowledge learned by teachers, and uses a residual learning scheme for the four branches, so that important space and channel knowledge are extracted in advance before the teacher network is reduced in dimension to adapt to the network scale of students. Because the inputs to the student network and the teacher network are the same, but the resulting intermediate feature map and the final result are different, due to the performance gap between the two, the effect of extracting knowledge is to try to pass this knowledge to the student network to narrow the gap in student network and teacher network performance. And then taking the feature map with 128 channel numbers generated by the penultimate layer of the student network as a false sample, taking the feature map with 128 channel numbers of the penultimate layer of the teacher network through the thinning module as a true sample, and inputting the true sample and the false sample into the self-attention discriminator for countermeasure training.

The step S3 specifically includes the following:

feature map F generated by the penultimate layer of the teacher's network _T The method is characterized in that 512×H×W, wherein 512 is the number of channels, H is the height of the feature map, W is the width of the feature map, the feature map is used as input of a thinning module, the input is firstly carried out through two branches, and the first branch directly reduces the dimension of the feature map with the number of channels being 512 into the feature map with the same height and width as those of the last layer of the student network, wherein the number of channels is the same as 128 channels through a convolution kernel with the core size of 1×01. The second branch first reduces the 512 channels to 256 channels through a convolution kernel with a kernel size of 3 x 13 and a fill value of 1, with the feature map height width unchanged. This branch will then go through four branches to refine the spatial and channel knowledge, respectively. Three of the branches belong to a space refinement module, and the fourth branch belongs to a channel refinement module. The first branch of the spatial refinement module firstly uses a two-dimensional convolution with a kernel size of 7×21 and a filling value of 3 to extract the horizontal spatial structure knowledge in the W dimension, and simultaneously reduces the channel number from 256 to 128, and the size of the feature map is unchanged. A two-dimensional convolution with a kernel size of 1 x 37 and a fill value of 3 is then used to extract the vertical spatial structure knowledge in the H dimension while reducing the number of channels from 128 to 1. The first branch first uses a two-dimensional convolution with a kernel size of 1 x 47 with a fill value of 3 to extract knowledge of the vertical spatial structure in the H dimension while reducing the number of channels from 256 to 128, with the feature map size unchanged. A two-dimensional convolution with a kernel size of 7 x 51 with a fill value of 3 is then used to extract the horizontal spatial structure knowledge in the W dimension while reducing the number of channels from 128 to 1. The third branch is subjected to two kernel-size 3 x 3 hole convolutions with 4 padding values of 4 to extract the remote dependent knowledge, the number of channels is reduced from 256 to 128 and then to 1. And then cascading the characteristic graphs with the size H multiplied by W and the number of 3 channels to obtain the characteristic graph with the size H multiplied by W and the number of channels is reduced to 1 through convolution with the size of 1 multiplied by 1 by a convolution kernel to obtain the final spatial structuring knowledge. The refinement module converts the 256 XH XW feature map into 128 channels through global average pooling and then through a multi-layer perceptron to compress the channel knowledge to obtain the channel distribution knowledge learned by the teacher. Then 1 XH WThe space structuring knowledge is multiplied by the channel distribution knowledge of 128 multiplied by 1, and then the result is a Sigmoid function, so that the space structuring knowledge and the channel distribution knowledge extracted from the teacher network intermediate feature map with 512 channel numbers can be finally obtained, and the space structuring knowledge and the channel distribution knowledge are mapped between (0 and 1) and have 128 channel numbers. And transmitting the information to a teacher network intermediate feature map of 128 channel numbers, which is directly obtained through 1X 1 convolution and loses the spatial structural knowledge and the channel distribution knowledge, by using a residual learning mode. Finally obtaining the thinned F' _T . The result is classified by a classifier for each pixel to obtain a semantic segmentation prediction graph result, and the refinement module is trained by using the prediction graph and a truth graph in the dataset. And taking the feature map with 128 channels generated by the penultimate layer of the student network as a false sample, taking the feature map with 128 channels refined by the middle layer of the teacher network as a true sample, inputting the true and false sample into the self-attention discriminator for countermeasure training, fixing parameters of the teacher network in the process of countermeasure training, and optimizing only the refinement module and the student module part.

S4, setting network model parameters.

The GPU used in the invention is two NVIDIA 1080TI with 11GB video memory for each.

When the neural network model is in the initial training stage, the optimal point is far away from the extreme point, the learning rate is generally set to be larger, and the larger learning rate can be rapidly close to the extreme point; in the middle and later stages of training, the model will converge because it is already near the extreme point, and in both stages, a smaller learning rate is used, and a larger learning rate easily causes fluctuations around the true extreme point, and cannot converge to the extreme point. The neural network model is trained by adopting a self-adaptive learning rate method. The mathematical expression of the self-adaptive learning rate is:

the loss function expression used when training the refinement module by using the prediction graph and the truth graph in the dataset is:

The mathematical expression of the loss function of the generation of the resistance of the N-Neisserian for performing resistance training by defining the basic distance and the difference between the characteristic mapping refined by the teacher network middle layer and the characteristic mapping distribution of the student network middle layer is as follows:

F _I is a characteristic diagram input by teacher student network, and the characteristic diagram F of student network prediction _S Considered as a false sample, the feature map generated by the teacher network is refined by the refinement module to generate a feature map F _T Is a true sample, P _S (F _S ) And P _T (F _T ) The feature distributions of the two feature maps are respectively.And->Respectively input F _S And F _T To the expectations obtained by the arbiter. D is the embedded arbiter network.

And evaluating the segmentation effect graph of the trained student network, and calculating by taking the score of the cross ratio (IoU) as the ratio of the sum value of the intervals between the truth graph mask and the predictive graph mask of each class. We used the average IoU (mIoU) of all classes to study the effectiveness of distillation. Pixel precision is the ratio of pixels with the correct semantic label to the total pixels.

Where k is the number of categories, P is the predicted value of the student network, and G is the true value.

S5, analyzing according to the obtained segmentation map and the average cross comparison result.

Fig. 3 (a) is an original view of a picture a, fig. 3 (b) is a label view of a picture a, and fig. 3 (c) is a division view of a picture a using a method based on the current most advanced document "Intra-class Feature Variation Distillation for Semantic Segmentation" (IFVD); FIG. 3 (d) is a segmentation diagram of the student network under the knowledge distillation method of the present invention of panel A; fig. 3 (e) is an original drawing of picture B; fig. 3 (f) is a label diagram of picture B; fig. 3 (g) is a segmentation diagram of a method based on the current state-of-the-art document "Intra-class Feature Variation Distillation for Semantic Segmentation" (IFVD) for picture B; fig. 3 (h) is a segmentation diagram of the student network under the knowledge distillation method of the present invention of picture B. From the average cross-over result, the method of the embodiment can improve 66.63% of mIOU obtained by the student network on the Cityscapes data set by using the IFVD method to 68.57%, and improve 1.94%. As can be seen from the segmentation effect diagram, the accuracy of the effect diagram segmented by the student network obtained by the knowledge distillation method used in the embodiment is higher, which indicates that the new knowledge distillation method adopted in the embodiment can further improve the knowledge distillation effect and obtain the student model with higher accuracy.

According to an embodiment of the invention, the invention further provides an electronic device and a computer readable medium.

Wherein the electronic device comprises:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

In specific use, a user can interact with a server serving as the electronic device through the electronic device serving as the terminal device and based on a network, so that functions of receiving or sending messages and the like are realized. Terminal devices are typically various electronic devices provided with a display device for use based on a human-machine interface, including but not limited to smartphones, tablet computers, notebook computers, desktop computers, etc. Various specific application software can be installed on the terminal equipment according to requirements, including but not limited to web browser software, instant messaging software, social platform software, shopping software and the like.

The server is a network server for providing various services, such as a background server for processing received autopilot pictures transmitted from the terminal device. The method comprises the steps of training and carrying out semantic segmentation on the received picture data based on a model, and returning a final semantic segmentation result to the terminal equipment.

The semantic segmentation method provided in this embodiment is generally executed by a server, and in actual application, the terminal device may also directly execute semantic segmentation under the condition that the requirement is met.

Similarly, the computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a semantic segmentation method of embodiments of the present invention.

According to the invention, the teacher network is added in knowledge distillation to monitor the middle characteristics of the student network, and the monitoring of the middle layer addition of the student network is beneficial to training of the student network, so that the student network can obtain a better effect. The knowledge refinement module is added to refine relevant useful knowledge of a teacher and convert the knowledge into knowledge which is easier to understand and learn by a student network, redundant and noisy over-parameterized information which is unfavorable for supervising students and contained in a high-dimensional feature map of the teacher network is deleted, and the teacher network supervises the learning of the student network in an intermediate layer and performs countermeasure training by generating a countermeasure network structure, so that an intermediate feature map generated by the student network gradually approaches to intermediate features generated by the teacher network. In the training process, a teacher network is allowed to supervise the multi-layer learning of the student network, so that the student can learn not only the characteristic information but also how to extract useful knowledge in the knowledge distillation process.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. An automatic driving scene semantic segmentation method is characterized by comprising the following steps:

step 5, inputting the real automatic driving scene picture to be subjected to semantic segmentation into a trained student model to obtain a segmentation result;

in the step 1, the refinement module performs convolution operation by using a convolution check input feature map with the size of 3×3 to obtain a feature map with 256 channels; then four branches are passed, wherein the first two branches are provided with two layers of two-dimensional convolution, the first layer of convolution kernels are 1 multiplied by 7 and 7 multiplied by 1, and the second layer of convolution kernels are 7 multiplied by 1 and 1 multiplied by 7 so as to extract horizontal and vertical space structure knowledge respectively; the third branch uses two convolution kernels with the size of 3 multiplied by 3 and the hole convolution with the expansion rate of 4 to obtain space remote dependent knowledge; the fourth branch uses a multi-layer perceptron to extract channel distribution knowledge learned by a teacher network; then cascading the feature graphs obtained by the first three branches, performing element-by-element multiplication by using channel distribution knowledge obtained by the first three branches after 1×1 convolution dimension reduction, mapping the feature graphs of the 128 channel numbers to a space between (0, 1) through a Sigmoid function, performing element-by-element multiplication on the mapped feature graphs and the feature graphs of the 128 channel numbers obtained after 1×1 convolution dimension reduction of the second layer of the feature graphs of the teacher network, and performing element-by-element addition on the obtained result and the feature graphs of the second layer of the teacher network to obtain a processing result of the refinement module;

in the step 3, the countermeasure training is as follows:

F _I is a characteristic diagram input by teacher student network, and the characteristic diagram F of student network prediction _S Considered as a false sample, the feature map generated by the teacher network is refined by the refinement module to generate a feature map F _T Considered as a true sample, P _S (F _S ) And P _T (F _T ) Feature distribution of dummy and real samples, respectively，And->Respectively input F _S And F _T To the expectations of the arbiter, D is the embedded arbiter network.

2. The method of claim 1, wherein in step 2, the teacher network is a neural network that has been pre-trained.

3. The method of claim 1, wherein in step 2, the label graph in the training set is formed by manually marking an original graph.

4. The method according to claim 1, wherein in the step 2, the input of the refinement module is a feature map having 512 channels generated by a second-last layer of the teacher's network; in the step 3, the feature map generated by the student network is a feature map with 128 channels generated by the penultimate layer of the student network.

5. The method according to claim 1, wherein in step 2, training the refinement module includes:

the Loss function Loss is expressed as:

wherein y=y _truth ,y′＝y _pred ，y _truth Representing a label map, y _pred Representing a predictive map;w represents the limited variable and n represents the total number of samples.

6. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-5.

7. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.