CN113793341A

CN113793341A - Automatic driving scene semantic segmentation method, electronic device and readable medium

Info

Publication number: CN113793341A
Application number: CN202111086495.0A
Authority: CN
Inventors: 周彦; 袁指天; 王冬丽; 李云燕
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-14
Anticipated expiration: 2041-09-16
Also published as: CN113793341B

Abstract

The invention discloses an automatic driving scene semantic segmentation method, electronic equipment and a readable medium. The invention extracts the knowledge of the teacher network middle layer by using four branches by refining the high-confidence-degree knowledge of the teacher network in knowledge distillation: the two branches respectively extract horizontal and vertical spatial structure knowledge by two-dimensional convolutional layers; the third branch uses two layers of hole convolution to obtain spatial remote dependence knowledge; the fourth branch uses multi-layer perceptrons to extract channel distribution knowledge learned by the teacher network. And taking the refined knowledge as a true sample, and taking the student network intermediate characteristic graph as a false sample to perform confrontation training. The teacher network efficiently transfers the refined space and channel knowledge to the student network, and the model performance of the student network obtained finally after knowledge distillation is improved. The module is optimized only in the training process, and does not participate in calculation in the reasoning process, so that the parameter number of the student model cannot be increased.

Description

Automatic driving scene semantic segmentation method, electronic device and readable medium

Technical Field

The invention belongs to the field of deep learning and machine vision, and particularly relates to an automatic driving scene semantic segmentation method, electronic equipment and a readable medium.

Background

With the support of deep convolutional neural networks (DNNs), applications such as object detection and semantic segmentation are currently being developed at an extraordinary speed. Introducing more parameters generally improves the accuracy of the model. Semantic segmentation is an important task in computer vision. Currently, most advanced semantic segmentation methods usually require a large amount of computing resources to achieve accurate semantic segmentation, although the performance of current DNNs is significantly improved, the efficiency is very important for semantic segmentation, and the huge memory cost and the huge computation load of these deep networks make it difficult to directly deploy trained networks in real-time applications, such as embedded systems and autonomous vehicles. Model compression techniques have emerged to address these issues, and include lightweight network design, pruning, quantization, and knowledge distillation. Among these methods, knowledge distillation has proven to be an effective way to obtain lightweight models, which simplifies the training of deep neural networks by following the student-teacher paradigm in which students are penalized based on the softened version of the teacher's output. These lightweight models maintain a high level of accuracy. Knowledge distillation is an attractive network compression method, and inspiration comes from the transfer of knowledge from teachers to students. It is essentially a compact student model to approximate an over-parameterized teacher model. Thus, the student model may achieve significant performance improvements, occasionally exceeding that of the teacher. By replacing the over-parameterized teacher model with a compact student model, a high percentage of model compression can be achieved. FOR example, the document "FITTNETS: HINTS FOR THIN DEEP NETS" uses a method of aligning student network and teacher network feature maps. THE method proposed in THE document "trying mechanism TO mechanism" which forces THE student model TO mimic THE ATTENTION profile OF a powerful teacher model. The document "Structured Knowledge Distillation for the depth Prediction" proposes a strategy of point-wise Distillation and local pairwise Distillation, using various loss functions to optimize the student network. The document "Intra-class Feature Variation for Semantic Segmentation" considers that a teacher model generally learns more robust in-class Feature representation than a student model, and therefore proposes in-class Feature Variation in which the student simulates Feature distribution of the student, so that the student better simulates the teacher and improves Segmentation accuracy.

The defects of the prior art are as follows: many knowledge distillation methods are limited in that the student network is forced to model only the output probability distribution of the teacher network to transfer the knowledge embedded in the soft targets. Lightweight, shallow student networks have difficulty learning the high level of distilled knowledge output by over-parameterized and redundant teacher networks due to performance differences between teacher networks and student networks. Typically, the profile of the teacher's network contains redundant and noisy information that is not conducive to supervising students. The document "fitets: hitts FOR THIN DEEP NETS" learns the intermediate representation by directly aligning feature maps, but ignores the differences in original scale between the huge teacher model and the compact student model. THE ATTENTION TRANSFER (AT) method OF THE document "walking mechanism TO mechanism", THE goal OF which is TO mimic THE ATTENTION map between student and teacher models, so that THE sum OF THE feature maps across THE channel dimensions can represent THE ATTENTION distribution OF THE image classification task. However, this may not be suitable for the segmentation task at the pixel level, since different channels represent different types of activation. Furthermore, the document "Structured Knowledge Distillation for the concentration Prediction" describes a characteristic Distillation method for the features, including point-wise Distillation and partial pairwise Distillation. However, strict tuning of point classification scores or feature activation between the teacher network and the compact student network may impose too strict constraints and result in sub-optimal solutions. Local paired distillation has a fixed reception field and the teacher's network cannot transfer the ability to capture remote context to the student's network. The document "Intra-class Feature Variation removal for Semantic Segmentation" method of delivering knowledge of Intra-class features, which only considers high-dimensional features, neglecting the learning of mid-layer features. Two problems are generally existed in the current knowledge distillation method in semantic segmentation. The first problem is that in the course of the student network mimicking the teacher network, students simply mimic the feature or score maps of the teacher network, and in this course also learn over-parameterized information of the teacher network without focusing on the most meaningful areas, i.e., the most useful information in the semantic segmentation task, learning redundant and useless information, resulting in inefficient knowledge distillation. The second problem is that the difference between the output of the teacher network and the student network is due to the difference in model performance. If the student network blindly simulates the output of the score graph or the last layer of feature graph of the teacher network, the supervised learning of the teacher network on the features of the middle layer of the student network is lacked, so that the knowledge learned by the student network is limited, and the student network training can reach limited precision.

Disclosure of Invention

Aiming at the technical problem that a teacher network lacks of effectively supervising the generation process of an intermediate feature map of a student network in the existing knowledge distillation method suitable for semantic segmentation, the invention provides an automatic driving scene semantic segmentation method, electronic equipment and a readable medium, which enable the teacher network to supervise the generation of the intermediate feature of the student network.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

an automatic driving scene semantic segmentation method comprises the following steps:

step 1, establishing an confrontation network structure which comprises a teacher network, a student network, a classifier, a refining module and a self-attention discriminator;

step 2, a feature map generated by the teacher network based on an original map of the training set is used as the input of a refining module, then the feature map output by the refining module is input into a classifier to obtain a semantic segmentation prediction map result, and then the refining module is trained according to the semantic segmentation prediction map result and a label map of the training set;

step 3, taking the feature graph output by the refining module as a true sample, taking the feature graph generated by the student network as a false sample, inputting the true sample and the false sample into a self-attention discriminator together for confrontation training, optimizing the student network in the confrontation training, and not optimizing the teacher network;

step 4, inputting a scene picture data set obtained in an automatic driving scene into a student network model for generating a confrontation network structure, and training a student network until the training is finished;

and 5, inputting the real automatic driving scene picture to be subjected to semantic segmentation into the trained student model to obtain a segmentation result.

In the method, in the step 2, the teacher network is a neural network which is trained in advance.

In the method, in the step 2, the label graph in the training set is formed by manually marking an original graph.

In the method, in the step 1, the refining module firstly uses a convolution kernel with the size of 3 multiplied by 3 to carry out convolution operation on the input feature map to obtain the feature map with 256 channels; then extracting spatial structure knowledge of the teacher network through four branches, wherein three branches extract spatial structure knowledge of the teacher network, the first two branches have two layers of two-dimensional convolution, the sizes of convolution kernels of the first layer are 1 multiplied by 7 and 7 multiplied by 1, and the sizes of convolution kernels of the second layer are 7 multiplied by 1 and 1 multiplied by 7, so as to respectively extract horizontal and vertical spatial structure knowledge; the third branch is convolved by two cavities with convolution kernels of 3 multiplied by 3 and expansion rate of 4 to obtain spatial remote dependence knowledge; the fourth branch uses a multilayer perceptron to extract channel distribution knowledge learned by the teacher through network; and then, cascading the feature graphs obtained by the first three branches, performing element-by-element multiplication on the feature graphs obtained by the first three branches and the channel distribution knowledge obtained by the fourth branch after dimension reduction by using 1 × 1 convolution, mapping the obtained feature graphs with the number of 128 channels between (0,1) by using a Sigmoid function, performing element-by-element multiplication on the mapped feature graphs and the feature graphs with the number of 128 channels obtained by performing 1 × 1 convolution dimension reduction on the feature graphs of the second layer from the last of the teacher network, and performing element-by-element addition on the obtained results and the feature graphs of the second layer from the last of the teacher network to obtain a processing result of a refining module.

In the method, in the step 2, the input of the refinement module is a feature map with 512 channels generated in the penultimate layer of the teacher network; in the step 3, the feature map generated by the student network is a feature map with 128 channels generated by the penultimate layer of the student network.

In the method, in the step 2, training the refinement module includes:

training a neural network model by adopting a self-adaptive learning rate method, and taking cross entropy loss as a loss function, wherein the mathematical expression of the self-adaptive learning rate Current _ rate is as follows:

wherein, Current _ step is the Current learning rate, base _ rate is the initial learning rate, Current _ step is the Current iteration step number, max _ step is the maximum iteration step number, power constant is 0.9, and the initial learning rate is set to be 0.01;

the expression of the Loss function Loss is:

wherein y is_truth,y'＝y_pred，y_truthRepresents a label graph, y_predRepresenting a prediction graph;

w represents the constrained variable and n represents the total number of samples.

In the method, in the step 3, the confrontation training is:

performing countermeasure training by adopting a method for generating an countermeasure network structure by Watherstein, and defining basic distance and difference between fine feature mapping distributions of a teacher intermediate layer and a student intermediate layer, wherein the mathematical expression of a Watherstein generated countermeasure Loss function W _ Loss is as follows:

F_Iis a feature graph input by a teacher and a student network and a feature graph F predicted by the student network_SThe characteristic diagram F is generated after the characteristic diagram generated by the teacher network is refined by the refining module and is regarded as a false sample_TConsidered as a true sample, P_S(F_S) And P_T(F_T) Respectively of the false and true samples,

and

are respectively inputs F_SAnd F_TTo the expectation of the arbiter, D is the embedded arbiter network.

An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method.

A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method.

The invention has the technical effects that:

(1) the invention adds the supervision of teacher network to the student network to generate middle characteristic in knowledge distillation, and the supervision added to the middle layer of the student network is helpful for the training of the student network, so that the student network can obtain better effect.

(2) The invention adds a knowledge refining module based on DNNs, which refines the relevant useful knowledge of teachers and converts the relevant useful knowledge into the knowledge which is easier to understand and learn by a student network, deletes redundant and noisy excessive parameterization information which is unfavorable for supervising students and is contained in a high-dimensional characteristic diagram of the teacher network, and then transfers the knowledge which is easier to understand and learn by the refined student network from the teacher network to the student network.

(3) The present invention intermediate feature distillation aligns each channel of the student feature map with the feature map of each channel of the teacher network through counterstudy. The characteristic knowledge of the middle layer of the teacher network is perfected, and the generated confrontation network structure is utilized to enable the teacher network to supervise the network learning of students in the middle layer, and carry out confrontation training, so that the middle characteristic graph generated by the student network is close to the middle characteristic generated by the teacher network.

(4) The method comprises the steps of extracting horizontal and vertical spatial structured knowledge respectively by using two-dimensional convolutions, acquiring spatial remote dependence knowledge by using two layers of cavity convolutions, extracting channel distribution knowledge learned by a teacher by using a multilayer perceptron, and transferring the refined knowledge to an intermediate layer of a student network channel through a channel and a space by using a residual error learning scheme. In the training process, teachers are allowed to supervise the network multi-layer learning of students, so that the students can learn not only characteristic information, but also how to extract useful knowledge in the knowledge distillation process.

Drawings

FIG. 1 is a block diagram of a refinement module of the present invention.

FIG. 2 is a schematic overall flow chart of the present invention.

Fig. 3(a) shows an original image of the picture a.

Fig. 3(b) is a label diagram of picture a.

Fig. 3(c) is a segmentation diagram of the picture a based on the current state-of-the-art method IFVD.

FIG. 3(d) is a graph showing the segmentation of the student network under the distillation method of the knowledge of the present invention in the image A.

Fig. 3(e) shows the original image of the picture B.

Fig. 3(f) is a label diagram of picture B.

Fig. 3(g) is a segmentation diagram of the picture B based on the current state-of-the-art method IFVD.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. Here, the technical features involved in the respective embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the structure diagram of the refining module of the present invention includes, in sequence, a 3 × 3 convolution, a 7 × 1 convolution, a 1 × 7 convolution, a 3 × 3 hole convolution with an expansion rate of 4, a 1 × 1 convolution, a Sigmoid function, a multi-layer perceptron, a global average pooling, a classifier, and a discriminator in the structure diagram of the refining module shown in fig. 1.

Fig. 2 is a schematic flow chart of the present invention, which mainly includes the following steps: 1) collecting semantic segmentation data; 2) building a teacher student network model; 3) designing a characteristic refining module; 4) setting parameters of a network model; 5) and obtaining a segmentation graph and an average cross-over ratio, and analyzing the result. In order to achieve the purpose, the invention specifically relates to a semantic segmentation method aiming at the distillation of the anti-learning knowledge of the neural network intermediate layer feature diagram in automatic driving, and the method specifically comprises the following steps.

And S1, collecting semantic segmentation data sets required by training the student network.

S1.1, as shown in fig. 3(a) and 3(e), the input image during training is an RGB photograph of a real scene.

S1.2, dividing the images into 19 categories according to road scenes by a data set label graph, wherein the 19 categories are roads, sidewalks, buildings, walls, fences, poles, traffic lights, traffic signboards, plants, terrains, sky, people, riders, automobiles, trucks, buses, trains, motorcycles and bicycles. Marking each large class with a corresponding color, and marking the rest colors which are not divided as black, wherein the RGB values are as follows in sequence: (128,64,128), (244,35,232), (70,70,70), (102,102,156), (190,153,153), (153,153,153), (250,170,30), (220, 0), (107,142,35), (152,251,152), (70,130,180), (220,20,60), (255,0,0), (0,0,142), (0,0,70), (0,60,100), (0,80,100), (0,0,230), (119,11, 32). The marked picture is the true value map.

And S1.3, marking the colors in the true value graph from 0 to 18 in sequence according to the categories, thereby making a final label graph containing 19 categories. As shown in fig. 3(b) and 3 (f).

And S2, building a teacher student network. The teacher network framework in this embodiment is based on PSPNet, and the backbone network is ResNet101, and the student network framework is also based on PSPNet, and the backbone network is ResNet 18. Other neural network implementations may also be used in practice. The PSPNet used in the embodiment aggregates the context information of different areas, thereby improving the capability of acquiring global information. The PSPNet can embed the scene information features which are difficult to resolve into a prediction framework based on a convolutional neural network, the module has hierarchical global priority, contains information of different scales among different sub-regions, and an effective optimization strategy is formulated on the basis of deep supervision loss ResNet. The embodiment also comprises a self-attention discriminator which is composed of full convolution and can carry out confrontation training on the feature diagram obtained by refining the teacher network middle layer knowledge and the feature diagram of the student network middle layer, wherein two attention modules are inserted between the last three modules to capture the structural information.

And S3, adding the designed feature refining module into the teacher student network knowledge distillation training process. The feature refinement module of the present embodiment takes as input the feature map output by the teacher's network second to last layer, and will proceed through four branches, where two branches consist of two-dimensional convolutional layers to extract knowledge structured in horizontal and vertical space. The third branch uses two hole convolutions to obtain spatially remote dependent knowledge, thereby extracting spatial structure knowledge of the teacher's network through the three branches. The fourth branch uses a multilayer perceptron to extract channel distribution knowledge learned by a teacher and uses a residual learning scheme for the four branches, so that important space and channel knowledge can be extracted in advance before dimension reduction is performed on a teacher network to adapt to the scale of a student network. Because the inputs to the student network and the teacher network are the same, but the intermediate feature maps generated and the final results are different due to the performance gap between the two, the effect of extracting knowledge is to attempt to pass this knowledge to the student network to close the gap in performance between the student network and the teacher network. And then, taking a feature map with 128 channels generated by the penultimate layer of the student network as a false sample, taking a feature map with 128 channels obtained by the refinement module of the penultimate layer of the teacher network as a true sample, and inputting the true sample and the false sample into a self-attention discriminator for confrontation training.

Wherein, step S3 specifically includes the following contents:

feature map F generated by penultimate layer of teacher network_TThe feature map is used as the input of a refinement module, the input firstly passes through two branches, the first branch directly reduces the feature map with the channel number of 512 into the feature map with the same height and width as those of the last layer of the student network with the same channel number of 128 channels through a convolution kernel with the kernel size of 1 × 1. The second branch is first reduced to 256 channels by a convolution kernel with kernel size 3 x 3 and padding value 1, with the feature map height and width unchanged. This branch will then go through four branches to refine the spatial and channel knowledge, respectively. Three of the branches belong to the space refining module and the fourth branch belongs to the channel refining module. The first branch of the spatial refinement module first uses a two-dimensional convolution with a kernel size of 7 x 1 and a padding value of 3 to extract the knowledge of the horizontal spatial structure in the W dimension, while reducing the number of channels from 256 to 128, with the feature map size unchanged. A two-dimensional convolution with a kernel size of 1 x 7 and a padding value of 3 is then used to extract the knowledge of the vertical spatial structure in the H dimension while reducing the number of channels from 128 to 1. The first branch uses a two-dimensional convolution with a kernel size of 1 × 7 and a padding value of 3 to extract the knowledge of the vertical spatial structure in the H dimension, while reducing the number of channels from 256 to 128, with the feature size unchanged. A two-dimensional convolution with a kernel size of 7 x 1 and a padding value of 3 is then used to extract the knowledge of the horizontal spatial structure in the W dimension while reducing the number of channels from 128 to 1. The third branch is extracted by two hole convolutions with a kernel size of 3 x 3, a dilation value of 4, and a fill value of 4Remote dependent knowledge, the number of channels is reduced from 256 to 128 and then to 1. Then, the feature maps with the size of H multiplied by W and 3 channels of 1 are cascaded to obtain the feature map with the size of H multiplied by W and the number of channels of 3, and then the number of channels is reduced to 1 through convolution with the convolution kernel size of 1 multiplied by 1 to obtain the final spatial structuring knowledge. The refinement module converts the 256 × H × W feature map into a multi-layer perceptron by first global average pooling and then changes the channel number into 128 by a multi-layer perceptron to compress the channel knowledge to obtain the channel distribution knowledge learned by the teacher. Then multiplying the spatial structural knowledge of 1 × H × W by the channel distribution knowledge of 128 × 1 × 1 element by element, and then obtaining the spatial structural knowledge and the channel distribution knowledge extracted from the teacher network intermediate feature map with 512 channels, and mapping the spatial structural knowledge and the channel distribution knowledge between (0,1) with 128 channels. And transferring the information to a teacher network intermediate feature map with the number of 128 channels, which is directly obtained through 1 × 1 convolution and loses spatial structural knowledge and channel distribution knowledge, by using a residual learning mode. Finally obtaining refined F'_T. The result is classified by a classifier to obtain a semantic segmentation prediction graph result, and a thinning module is trained by using a prediction graph and a truth graph in a data set. And (3) taking a feature map with 128 channels generated by the penultimate layer of the student network as a false sample, taking a feature map with 128 channels refined by the intermediate layer of the teacher network as a true sample, inputting the true and false sample into a self-attention discriminator for confrontation training, wherein in the confrontation training process, the teacher network parameters are fixed, and only a refined module and a student module part are optimized.

And S4, setting network model parameters.

The GPU used by the invention is two NVIDIA 1080TI pieces, and each piece has 11GB video memory.

When the neural network model is trained in the initial stage, the optimal point is far away from the extreme point, the learning rate is generally set to be larger, and the larger learning rate can be quickly close to the extreme point; in the middle and later training stages, the model is about to converge because the extreme point is close to, and in the two stages, a smaller learning speed is adoptedThe larger learning rate is likely to cause the fluctuation around the true extreme point, and the convergence to the extreme point is not possible. The invention trains the neural network model of the invention by adopting a method of self-adaptive learning rate. The mathematical expression of the adaptive learning rate is as follows:

the loss function expression used when the thinning module is trained by using the prediction graph and the truth graph in the data set is as follows:

The mathematical expression of the Wasepstein generated countermeasure loss function defining the basic distance and the difference between the feature mapping after the teacher network intermediate layer is refined and the feature mapping distribution of the student network intermediate layer and carrying out countermeasure training is as follows:

F_Iis a feature graph input by a teacher and a student network and a feature graph F predicted by the student network_SThe characteristic diagram F is generated after the characteristic diagram generated by the teacher network is refined by the refining module and is regarded as a false sample_TIs a real sample, P_S(F_S) And P_T(F_T) Respectively two kinds of characteristic diagramsThe characteristic distribution of (c).

And

are respectively inputs F_SAnd F_TTo the expectation obtained by the arbiter. D is the embedded arbiter network.

And evaluating the segmentation effect graph of the trained student network, and calculating the ratio of the interval sum and the sum between the truth map mask and the prediction map mask of each class by using the score of the intersection ratio (IoU). We used the average iou (mlou) of all classes to investigate the effectiveness of distillation. The pixel precision is the ratio of pixels with the correct semantic label to the total pixels.

Wherein k is the number of categories, P is the predicted value of the student network, and G is the true value.

And S5, analyzing according to the obtained segmentation graph and the average intersection comparison result.

FIG. 3(a) is an original image of Picture A, FIG. 3(b) is a label image of Picture A, and FIG. 3(c) is a Segmentation image of Picture A using a method based on the currently state-of-the-art document "Intra-class motion Variation for magnetic Segmentation" (IFVD); FIG. 3(d) is a fragmentary view of a student's network under the knowledge distillation method of the invention of panel A; FIG. 3(e) is the original drawing of the picture B; FIG. 3(f) is a label diagram of Picture B; FIG. 3(g) is a Segmentation diagram of Picture B using a method based on the currently state-of-the-art document "Intra-class motion Variation for Semantic Segmentation" (IFVD); FIG. 3(h) is a graph of the segmentation of the student network under the knowledge distillation method of the present invention of the picture B. From the result of average cross-over ratio, the method of the embodiment can enable 66.63% mIOU of the student network obtained by the IFVD method on the Cityscapes data set to be improved to 68.57%, and the mIOU is improved by 1.94%. The effect graph obtained by segmenting the student network by the knowledge distillation method has higher accuracy, which shows that the knowledge distillation method adopted by the embodiment can further improve the knowledge distillation effect and obtain a student model with higher accuracy.

The invention also provides an electronic device and a computer readable medium according to the embodiment of the invention.

Wherein electronic equipment includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

In specific use, a user can interact with a server which is also used as a terminal device through an electronic device which is used as the terminal device and based on a network, and functions of receiving or sending messages and the like are realized. The terminal device is generally a variety of electronic devices provided with a display device and used based on a human-computer interface, including but not limited to a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. Various specific application software can be installed on the terminal device according to needs, including but not limited to web browser software, instant messaging software, social platform software, shopping software and the like.

The server is a network server for providing various services, such as a background server for processing received automatic driving scene pictures transmitted from the terminal device. The received image data are trained based on the model and subjected to semantic segmentation, and a final semantic segmentation result is returned to the terminal equipment.

The semantic segmentation method provided by the embodiment is generally executed by a server, and in practical application, the terminal device can also directly execute semantic segmentation under the condition that necessary conditions are met.

Similarly, the computer readable medium of the present invention has stored thereon a computer program which, when executed by a processor, implements a semantic segmentation method of an embodiment of the present invention.

The invention adds the supervision of teacher network to the student network to generate middle characteristic in knowledge distillation, and the supervision added to the middle layer of the student network is helpful for the training of the student network, so that the student network can obtain better effect. The addition of the knowledge refining module refines the relevant useful knowledge of the teacher and converts the relevant useful knowledge into the knowledge which is easier to understand and learn by the student network, removes redundant and noisy excessive parameterization information which is not beneficial to supervising students and is contained in a high-dimensional characteristic diagram of the teacher network, and utilizes the generated countermeasure network structure to enable the teacher network to supervise the learning of the student network in the middle layer and carry out countermeasure training, so that the middle characteristic diagram generated by the student network gradually approaches to the middle characteristic generated by the teacher network. In the training process, teachers are allowed to supervise the network multi-layer learning of students, so that the students can learn not only characteristic information, but also how to extract useful knowledge in the knowledge distillation process.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. An automatic driving scene semantic segmentation method is characterized by comprising the following steps:

2. The method of claim 1, wherein in step 2, the teacher network is a neural network that has been trained in advance.

3. The method of claim 1, wherein in step 2, the label graph in the training set is formed by manually labeling an original graph.

4. The method according to claim 1, wherein in step 1, the refining module firstly uses a convolution kernel with a size of 3 x 3 to perform convolution operation on the input feature map, so as to obtain a feature map with 256 channels; then, the two branches are subjected to two-layer two-dimensional convolution, the sizes of convolution kernels of the first layer are 1 multiplied by 7 and 7 multiplied by 1, and the sizes of convolution kernels of the second layer are 7 multiplied by 1 and 1 multiplied by 7, so that horizontal and vertical spatial structure knowledge can be respectively extracted; the third branch uses the cavity convolution with two convolution kernels of 3 multiplied by 3 and expansion rate of 4 to obtain the spatial remote dependence knowledge; the fourth branch uses a multilayer perceptron to extract channel distribution knowledge learned by the teacher through network; and then, cascading the feature graphs obtained by the first three branches, performing element-by-element multiplication on the feature graphs obtained by the first three branches and the channel distribution knowledge obtained by the fourth branch after dimension reduction by using 1 × 1 convolution, mapping the obtained feature graphs with the number of 128 channels between (0,1) by using a Sigmoid function, performing element-by-element multiplication on the mapped feature graphs and the feature graphs with the number of 128 channels obtained by performing 1 × 1 convolution dimension reduction on the feature graphs of the second layer from the last of the teacher network, and performing element-by-element addition on the obtained results and the feature graphs of the second layer from the last of the teacher network to obtain a processing result of a refining module.

5. The method of claim 1, wherein in step 2, the input of the refinement module is a feature map with 512 channels generated in the penultimate layer of the teacher network; in the step 3, the feature map generated by the student network is a feature map with 128 channels generated by the penultimate layer of the student network.

6. The method of claim 1, wherein the step 2 of training the refining module comprises:

the expression of the Loss function Loss is:

7. The method of claim 1, wherein in step 3, the confrontational training is:

and

8. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

9. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.