CN115171029B

CN115171029B - Unmanned-driving-based method and system for segmenting instances in urban scene

Info

Publication number: CN115171029B
Application number: CN202211098488.7A
Authority: CN
Inventors: 徐龙生; 孙振行; 庞世玺; 杨继冲
Original assignee: Shandong Kailin Environmental Protection Equipment Co ltd
Current assignee: Shandong Kailin Environmental Protection Equipment Co ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2022-12-30
Anticipated expiration: 2042-09-09
Also published as: CN115171029A

Abstract

The invention discloses an example segmentation method and system based on unmanned city scene, belonging to the technical field of video understanding and analysis, comprising the following steps: acquiring an original pixel-level feature sequence from a scene video; carrying out space-time position coding on the original pixel level characteristic sequence; acquiring instance identity characteristics of each frame according to an original pixel level characteristic sequence, a space-time position coding result and a full space-time offset Transformer coder-decoder; calculating incidence matrixes of the example identity characteristics of the two adjacent frames, and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes; and inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and obtaining an example segmentation result according to the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics. The accuracy of instance segmentation in the scene is improved.

Description

Unmanned-driving-based method and system for segmenting instances in urban scene

Technical Field

The invention relates to the technical field of video understanding and analysis, in particular to an unmanned-based urban scene example segmentation method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The automatic driving is mainly realized by acquiring a front city scene video, analyzing the city scene video, identifying and segmenting examples in the scene, and further performing automatic driving according to example segmentation results, most of the existing example segmentation methods are based on a Mask-RCNN framework, wherein the target appearance and motion information for data matching can increase the calculation cost and influence the real-time performance of segmentation, and in an unmanned city scene, extremely serious example identity changes can occur to people and vehicles on roads, and the reason is as follows: (1) disappearance and appearance of instances due to occlusion, (2) departure of instances from the scene, (3) entry of new instances into the scene; all resulting in inaccurate example segmentation results.

Disclosure of Invention

In order to solve the problems, the invention provides an example segmentation method and system based on an unmanned city scene, a single-stage full time-space migration Transformer is used for feature extraction to obtain example candidates (instance proxies), and then a data association module aiming at example identity change is used for data association, so that the accuracy of example segmentation in the city scene is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, an example segmentation method in an unmanned city scene is provided, including:

acquiring a city scene video;

acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;

carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;

obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full space-time offset Transformer coder-decoder;

calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;

inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.

In a second aspect, an example segmentation system in an unmanned-based city scene is provided, including:

the video acquisition module is used for acquiring a city scene video;

the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;

the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;

the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;

the example identity characteristic screening module is used for calculating the incidence matrix of the example identity characteristics of two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrix;

and the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to acquire an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to acquire an example segmentation result.

In a third aspect, an electronic device is provided, which includes a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the method for partitioning instances in an unmanned city scene are performed.

In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions that, when executed by a processor, perform the steps of the example segmentation method in an unmanned-based city scenario.

Compared with the prior art, the invention has the following beneficial effects:

1. the method uses the panoramic segmentation technology based on the full-time-space migration transform, can effectively model long-term dependence and historical tracks, relieves the high complexity problem of the full-time-space transform through the migration attention mechanism, improves the operation speed, accelerates the model convergence, reduces the operation amount, and can effectively identify the instance identity change through the data association module of the instance identity change, thereby being fast suitable for the complex environment under the unmanned city scene.

2. According to the method, after the instance identity characteristics are obtained through the Transformer encoder-decoder, the incidence matrix of the identity characteristics of two adjacent frames is calculated, the instance identity characteristics are screened according to the incidence matrix, the space-time dependency relationship of the instances in the image is deeply mined, and therefore the instances are segmented according to the screened instance identity characteristics, and the accuracy of instance segmentation is improved.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flow chart of the method disclosed in example 1;

FIG. 2 is a data association description of frames one through 30;

FIG. 3 is a block diagram of the method disclosed in example 1.

Detailed Description

The invention is further described with reference to the following figures and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example 1

In this embodiment, an example segmentation method based on an unmanned city scene is disclosed, as shown in fig. 1 and 3, the method includes:

acquiring a city scene video;

acquiring instance identity characteristics of each frame according to an original pixel level characteristic sequence, a space-time position coding result and a full space-time offset Transformer coder-decoder;

Specifically, the urban scene video is divided into a video frame sequence.

The feature extraction network comprises a backbone network ResNet101, and multi-scale feature extraction is carried out on the video frame sequence through the backbone network ResNet101 to obtain a first feature map sequenceF ₃ And the second characteristic diagram sequenceF ₄ And a third profile sequenceF ₅ It is preferable that the reaction mixture contains, in particular,F ₃ 、F ₄ andF ₅ with respect to the input video frameThe ratio is respectively 1/32, 1/16 and 1/8, and the number of channels is 256;

the first feature map sequenceF ₃ After up-sampling, the second characteristic diagram sequenceF ₄ Splicing to obtain a fourth characteristic diagram sequenceF ₆ Preferably, forF ₃ The sampling is carried out for 2 times in an up sampling way,F ₆ the number of channels of (a) is 512;

sequencing the fourth feature mapF ₆ After upsampling, the third feature map sequenceF ₅ Splicing to obtain a fifth characteristic diagram sequenceF ₈ Preferably, forF ₆ Upsampling by a factor of 2, dimensionality reduction of the number of channels to 256, andF ₅ are spliced to obtainF ₈ The number of channels becomes 512;

sequencing the fifth feature mapF ₈ After convolution processing, the original pixel-level feature sequence is obtained by compressing the convolution into one dimension, preferably, a 1x1 convolution layer is usedF ₈ Down to 256. Will be provided withF ₈ The dimension of (a), i.e. time T, height H and width W, is compressed into one dimension, i.e. the feature map of the dimension dxTxHxW obtained in the previous step is resized by reshaping into dxn, n = TxHxW.

And performing space-time position coding on the original pixel level characteristic sequence by using sine and cosine functions with different frequencies to obtain a space-time position coding result.

Wherein,

is the position of the element in the sequence and i is the dimension. d is to be divisible by 3, pass the position encoding once at the transform encoder input, and add an attention layer on each encoding block.

And inputting the original pixel-level characteristic sequence and the space-time position coding result into a full-time space-offset Transformer coder-decoder to obtain the example identity characteristic of each frame.

The full-time-space offset Transformer encoder-decoder introduces an offset attention mechanism, and the full-time-space offset Transformer encoder-decoder based on the offset attention mechanism comprises 3 basic components: a multi-headed offset attention module, a feed-forward neural network, and a regularization layer. The multi-headed offset attention module uses multiple offset attention modules in parallel, each of which decomposes the input into three vectors: query vector Q, key vector K, and value vector V. The method aims to obtain the weight sum of the weight acting on the value vector calculated according to the local query vector and the local key vector, carry out offset sampling in a decoupling mode, reduce the high complexity problem of a full-time-space attention machine mechanism, focus attention on a local region of interest and obtain local features with more distinctiveness.

The offset attention module localatention is expressed as:

wherein,P _Q from one toQThe local sampling area of (a) is,

is the offset of the learned sample point,

is thatP _Q The corresponding local key vector is then calculated,

is that

Softmax () is the activation function. And splicing the outputs of the parallel multiple offset attention modules to obtain the output of the multi-head offset attention module.

The feed-forward neural network FFN is composed of a 3-layer perceptron with ReLU activation and with hidden layers and linear layers. The regularization Layer performs Normalization operations in units of channels using a Layer Normalization (LN) approach.

The full space-time offset Transformer encoder consists of 8 coding blocks, and each coding block consists of 1 multi-head offset attention module + regularization layer + FFN + regularization layer. The full space-time offset Transformer decoder consists of 8 decoding blocks, and each decoding block consists of 1 multi-head offset attention module + regularization layer +1 deformable multi-head attention layer + regularization layer + FFN + regularization layer. The encoder and decoder of the full-time-space offset Transformer encoder are symmetrical in structure, the input of the encoder is an original pixel-level characteristic sequence, the output of each encoding block is the input of the next encoding block, and the output of the encoder and the space-time position encoding result are added to be used as a part of the input of each decoding block. The output of each decoding block is input into the next decoding block. The Transformer encoder-decoder directly outputs N different instance identities per frame. N is much larger than the number of all IDs in the panorama.

And performing historical track data association on the example identity characteristics of each frame by using a data association module, calculating association matrixes of the example identity characteristics of two adjacent frames, and screening the example identity characteristics of the two adjacent frames according to the association matrixes.

And performing historical track data association on the instance identity characteristics of each frame output by the full-time-space migration Transformer by using a data association module aiming at the instance identity change, so that the learned instance identity characteristics correspond to real instances one by one.F _t AndF _t-n the identity characteristics of the instances of the tth frame and the tth-nth frame output by the transform are combined into a characteristic vector Ψ (t-n, t) with the size of NxNx1024. And then mapping the eigenvector psi (t-n, t) into associated features with the size of NxN through a compression network, and after the processing of the Softmax function, setting the eigenvalue to be greater than 0.5 as 1, and setting the eigenvalue to be less than 0.5 as 0, so as to obtain the association matrix M.

As shown in FIG. 2, FIG. 2 shows the data association description from the 1 st frame to the 30 th frame, and the 1 st frame and the 30 th frame contain at most 5 instancesI.e. N =5. The numbers in the columns in the matrix represent all instances in frame 1, the numbers in the rows represent all instance numbers in frame 30, like numbers represent the same instance, a value of 1 indicates both present in frame 1 and present in frame 30, otherwise 0, and the X padding indicates an absent instance. As shown on the right-hand side of the figure,

indicating the entry and exit of an instance into a video frame. For example, a 1 in the last row indicates that object 5 entered frame 30, and a 1 in the last column indicates that instance 4 exists at frame 1, but leaves at frame 30.

The compression network uses convolution kernels to progressively reduce the dimensions along the depth of the input tensor, not allowing adjacent elements of the feature map to interact. However, the correlation matrix M does not take into account the instance object that enters or leaves the video between two input frames. To care for these objects, an extra column and row, respectively, is added to the correlation matrix M to form a matrix

And

. The addition of a row vector and a column vector represents the probability of an instance leaving the video and an instance entering the video when the t-th frame is associated with an instance of the t-n-th frame, respectively. Next, for M ₁ Performing Softmax operation in a unit of a line to obtain a probability matrix

And expressing the association relation between the identity characteristic prediction results of different instances of the t-th frame and the t-n-th frame in a probability form. Then to M ₂ Obtaining probability matrix by performing Softmax operation in units of columns

The similarity probability corresponding to each column is shown. Finally will beA ₁ AndA ₂ true correlation matrix between objects in video framesL _t-n,t A comparison is made to obtain a match penalty.

Wherein,L _t-n,t and representing a binary data incidence matrix, representing the corresponding relation between the t-n th frame and the detected instance object in the instance identity characteristics of the t-th frame. For example, if instance object 1 in the t-nth frame corresponds to the nth instance object in the t-th frame, thenL _t-n,t The nth element of the first row is 1.

Based on the above analysis, the correlation process can be supervised based on loop-straightness, including forward lossL _f And reverse lossL _b . The forward loss ensures that the instances are correctly associated from the t-th frame to the t-th frame, and the reverse loss ensures that the instances are correctly associated from the t-th frame to the t-n-th frame. At the same time, to suppress the correlation between non-maximum similarity instances, non-maximum loss is addedL _a The actual instance correlation probability matrix is maximized. The final match penalty is the average of these three components:

wherein,L ₁ andL ₂ are respectivelyL _t-n,t Deleting the pruned matrices of the last row and the last column respectively,L ₃ are respectivelyL _t-n,t Deleting the pruned matrices of the last row and the last column at the same time,

expressed as a product of the hadamard functions,

is a matrix

And

the matrices obtained by the last column and the last row are removed respectively.

And after calculating the incidence matrix M of two adjacent frames, summing the incidence matrix M by using a row unit to obtain a N multiplied by 1 sum vector, and reserving the example identity characteristics more than 1 in the sum vector to obtain the screened example identity characteristics. And (3) finding the row index with the median value of the sum vector larger than 1, screening according to the found row index, and reserving the example identity characteristics of the corresponding row as screened example identity characteristics.

Inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module (self-attention) to obtain an initial attention mapping, and splicing and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.

Specifically, the method comprises the following steps: and fusing the initial attention mapping, the original pixel level feature sequence of the corresponding frame and the screened example identity features to obtain a prediction result of each frame, wherein the prediction result comprises a mask of each ID, a category and a confidence score, and the prediction result with the confidence score higher than a first set value is selected as an example segmentation result.

And splicing and fusing the initial attention mapping with the original pixel-level feature sequence of the corresponding frame and the output of a transform encoder, and outputting the prediction result of each frame through one 3D convolution and three parallel branches, wherein the prediction result comprises the mask, the category and the confidence score of each ID. The first branch is a deformable convolution layer, outputting a mask m for each ID of different frames; the second branch is a convolutional layer and activation function, outputting the class c of each ID; the third branch is a convolutional layer and activation function, which outputs a confidence score s.

And obtaining a predicted class c, a confidence score s and a predicted mask m, outputting a semantic mask SemMsk and an instance ID mask IdMsk, and allocating a class label and an instance ID to each pixel. Specifically, semMsk and IdMsk are first initialized to zero. Then, the prediction results are sorted in descending order according to the confidence score, and the sorted prediction masks are filled into SemMsk and IdMsk. Results with confidence scores below the first set point (thrcls) are discarded and overlapping portions with lower confidence (above the first set point, below the second set point) are deleted to produce a full view result without overlap. And finally, adding the class label and the instance ID to obtain an instance segmentation result. Here, to constrain the class and mask of the output, the penalty function of adding the instance partitioning module is as follows:

of class branching

Using Focal local, masked branching

For cross-entropy loss, confidence branching

Is a log-likelihood function.

In the method for segmenting the instances in the urban scene based on unmanned driving, a single-stage full-time-space offset Transformer is used for feature extraction to obtain instance identity features (instance identities) of each frame, a data association module for instance identity change is used for data association of the instance identity features of two adjacent frames, and the spatio-temporal dependency relationship of the instances in the images is deeply mined based on the similarity of the images in the videos. The method uses the panorama segmentation technology based on the full-time-space migration transform, can effectively model long-term dependence and historical tracks, relieves the high complexity problem of the full-time-space transform through a migration attention mechanism, improves the operation speed, accelerates the model convergence, and reduces the operation amount. The data association module for instance identity change can effectively identify instance identity change and quickly adapt to complex environment in unmanned city scene.

Example 2

In this embodiment, an example segmentation system based on an unmanned city scene is disclosed, comprising:

the video acquisition module is used for acquiring urban scene videos;

the example identity characteristic screening module is used for calculating incidence matrixes of the example identity characteristics of the two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes;

Example 3

In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the method for partitioning an example in an unmanned city scene disclosed in embodiment 1 are performed.

Example 4

In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of the example segmentation method in the unmanned-based city scenario disclosed in embodiment 1.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. The method for segmenting the instances in the city scene based on unmanned driving is characterized by comprising the following steps:

acquiring a city scene video;

inputting the screened example identity characteristics of each frame and the output of a Transformer decoder of the corresponding frame into a self-attention module to obtain initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result;

the encoder of the full-time-space offset Transformer encoder-decoder comprises a plurality of coding blocks, the output of each coding block is the input of the next coding block, each coding block comprises a multi-head offset attention module, a regularization layer, an FFN (fast Fourier transform) and a regularization layer which are connected in sequence, the decoder of the full-time-space offset Transformer encoder-decoder comprises a plurality of decoding blocks, each decoding block comprises a multi-head offset attention module, a regularization layer, a deformable multi-head attention layer, a regularization layer, an FFN (fast Fourier transform) and a regularization layer which are connected in sequence, the output of the encoder and the space-time position coding result are added to be used as a part of the input of each decoding block, the output of each decoding block is input into the next decoding block, and the Transformer encoder-decoder directly outputs example identity characteristics of each frame;

the multi-headed offset attention module includes a plurality of offset attention modules, each of which decomposes the input into three vectors: query vector Q, key vector K and value vector V;

offset attention module

Expressed as:

wherein,P _Q from one toQThe local sampling area of (a) is,

is the offset of the learned sample point,

is thatP _Q The corresponding local key vector is then calculated,

is that

The dimension (c) of (a) is,

is an activation function;

and splicing the outputs of the plurality of parallel offset attention modules to obtain the output of the multi-head offset attention module.

2. The method of claim 1, wherein multi-scale feature extraction is performed on the video through a backbone network to obtain a first feature map sequence, a second feature map sequence and a third feature map sequence;

the first characteristic diagram sequence is subjected to up-sampling and then spliced with the second characteristic diagram to obtain a fourth characteristic diagram;

after the fourth characteristic diagram is subjected to up-sampling, the fourth characteristic diagram is spliced with the third characteristic diagram to obtain a fifth characteristic diagram;

and compressing the fifth feature map into one dimension to obtain an original pixel-level feature sequence.

3. The method for partitioning the instances in the urban scene based on the unmanned driving of claim 1, wherein the identity features of the instances in two adjacent frames are combined into a feature vector, the feature vector is compressed to obtain a correlation matrix M, the correlation matrix M is summed in a row unit to obtain a sum vector of Nx1, and the identity features of the instances greater than 1 in the sum vector are retained to obtain the screened identity features of the instances.

4. The method of claim 1, wherein an additional column and row are respectively added to the correlation matrix to obtain a matrix M ₁ Sum matrix M ₂ To M, to ₁ Performing Softmax operation in row units to obtain probability matrix A ₁ To M, to ₂ Obtaining the probability matrix A by executing the softmax operation with the column as the unit ₂ ，A ₁ And A ₂ Respectively associated with the true matrixL _t-n,t And comparing to obtain the matching loss, wherein,L _t-n,t the binary data incidence matrix represents the corresponding relation between the t-n th frame and the detected instance object in the instance identity characteristic of the t-th frame.

5. The method for partitioning the instances in the urban scene based on the unmanned driving of claim 1, wherein the initial attention mapping, the original pixel level feature sequence of the corresponding frame and the screened instance identity features are fused to obtain a prediction result of each frame, the prediction result comprises a prediction result of a predicted mask, a predicted category and a confidence score, and the prediction result with the confidence score higher than a first set value is selected as the instance partitioning result.

6. An example segmentation system in an unmanned city scene is characterized by comprising:

the video acquisition module is used for acquiring a city scene video;

the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to obtain initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result;

the encoder of the full-space-time migration Transformer encoder-decoder comprises a plurality of encoding blocks, the output of each encoding block is the input of the next encoding block, each encoding block comprises a multi-head migration attention module, a regularization layer, an FFN (fringe field noise) and a regularization layer which are sequentially connected, the decoder of the full-space-time migration Transformer encoder-decoder comprises a plurality of decoding blocks, each decoding block comprises a multi-head migration attention module, a regularization layer, a deformable multi-head attention layer, a regularization layer, an FFN (fringe field noise) and a regularization layer which are sequentially connected, the output of the encoder and the space-time position coding result are added to serve as a part of the input of each decoding block, the output of each decoding block is input into the next decoding block, and the Transformer encoder-decoder directly outputs the example identity characteristics of each frame;

offset attention module

Expressed as:

wherein,P _Q from one toQThe local sampling area of (a) is,

is the offset of the learned sample point,

is thatP _Q The corresponding local key vector is then calculated,

is that

The dimension (c) of (a) is,

is an activation function;

and splicing the outputs of the parallel multiple offset attention modules to obtain the output of the multi-head offset attention module.

7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of partitioning instances in an unmanned based city scenario of any of claims 1-5.

8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of example segmentation in an unmanned based city scenario of any of claims 1-5.