CN115171029B - Unmanned-driving-based method and system for segmenting instances in urban scene - Google Patents

Unmanned-driving-based method and system for segmenting instances in urban scene Download PDF

Info

Publication number
CN115171029B
CN115171029B CN202211098488.7A CN202211098488A CN115171029B CN 115171029 B CN115171029 B CN 115171029B CN 202211098488 A CN202211098488 A CN 202211098488A CN 115171029 B CN115171029 B CN 115171029B
Authority
CN
China
Prior art keywords
offset
space
attention
frame
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211098488.7A
Other languages
Chinese (zh)
Other versions
CN115171029A (en
Inventor
徐龙生
孙振行
庞世玺
杨继冲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Kailin Environmental Protection Equipment Co ltd
Original Assignee
Shandong Kailin Environmental Protection Equipment Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Kailin Environmental Protection Equipment Co ltd filed Critical Shandong Kailin Environmental Protection Equipment Co ltd
Priority to CN202211098488.7A priority Critical patent/CN115171029B/en
Publication of CN115171029A publication Critical patent/CN115171029A/en
Application granted granted Critical
Publication of CN115171029B publication Critical patent/CN115171029B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses an example segmentation method and system based on unmanned city scene, belonging to the technical field of video understanding and analysis, comprising the following steps: acquiring an original pixel-level feature sequence from a scene video; carrying out space-time position coding on the original pixel level characteristic sequence; acquiring instance identity characteristics of each frame according to an original pixel level characteristic sequence, a space-time position coding result and a full space-time offset Transformer coder-decoder; calculating incidence matrixes of the example identity characteristics of the two adjacent frames, and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes; and inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and obtaining an example segmentation result according to the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics. The accuracy of instance segmentation in the scene is improved.

Description

Unmanned-driving-based method and system for segmenting instances in urban scene
Technical Field
The invention relates to the technical field of video understanding and analysis, in particular to an unmanned-based urban scene example segmentation method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The automatic driving is mainly realized by acquiring a front city scene video, analyzing the city scene video, identifying and segmenting examples in the scene, and further performing automatic driving according to example segmentation results, most of the existing example segmentation methods are based on a Mask-RCNN framework, wherein the target appearance and motion information for data matching can increase the calculation cost and influence the real-time performance of segmentation, and in an unmanned city scene, extremely serious example identity changes can occur to people and vehicles on roads, and the reason is as follows: (1) disappearance and appearance of instances due to occlusion, (2) departure of instances from the scene, (3) entry of new instances into the scene; all resulting in inaccurate example segmentation results.
Disclosure of Invention
In order to solve the problems, the invention provides an example segmentation method and system based on an unmanned city scene, a single-stage full time-space migration Transformer is used for feature extraction to obtain example candidates (instance proxies), and then a data association module aiming at example identity change is used for data association, so that the accuracy of example segmentation in the city scene is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, an example segmentation method in an unmanned city scene is provided, including:
acquiring a city scene video;
acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full space-time offset Transformer coder-decoder;
calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;
inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.
In a second aspect, an example segmentation system in an unmanned-based city scene is provided, including:
the video acquisition module is used for acquiring a city scene video;
the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;
the example identity characteristic screening module is used for calculating the incidence matrix of the example identity characteristics of two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrix;
and the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to acquire an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to acquire an example segmentation result.
In a third aspect, an electronic device is provided, which includes a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the method for partitioning instances in an unmanned city scene are performed.
In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions that, when executed by a processor, perform the steps of the example segmentation method in an unmanned-based city scenario.
Compared with the prior art, the invention has the following beneficial effects:
1. the method uses the panoramic segmentation technology based on the full-time-space migration transform, can effectively model long-term dependence and historical tracks, relieves the high complexity problem of the full-time-space transform through the migration attention mechanism, improves the operation speed, accelerates the model convergence, reduces the operation amount, and can effectively identify the instance identity change through the data association module of the instance identity change, thereby being fast suitable for the complex environment under the unmanned city scene.
2. According to the method, after the instance identity characteristics are obtained through the Transformer encoder-decoder, the incidence matrix of the identity characteristics of two adjacent frames is calculated, the instance identity characteristics are screened according to the incidence matrix, the space-time dependency relationship of the instances in the image is deeply mined, and therefore the instances are segmented according to the screened instance identity characteristics, and the accuracy of instance segmentation is improved.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the method disclosed in example 1;
FIG. 2 is a data association description of frames one through 30;
FIG. 3 is a block diagram of the method disclosed in example 1.
Detailed Description
The invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example 1
In this embodiment, an example segmentation method based on an unmanned city scene is disclosed, as shown in fig. 1 and 3, the method includes:
acquiring a city scene video;
acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
acquiring instance identity characteristics of each frame according to an original pixel level characteristic sequence, a space-time position coding result and a full space-time offset Transformer coder-decoder;
calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;
inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.
Specifically, the urban scene video is divided into a video frame sequence.
The feature extraction network comprises a backbone network ResNet101, and multi-scale feature extraction is carried out on the video frame sequence through the backbone network ResNet101 to obtain a first feature map sequenceF 3 And the second characteristic diagram sequenceF 4 And a third profile sequenceF 5 It is preferable that the reaction mixture contains, in particular,F 3 F 4 andF 5 with respect to the input video frameThe ratio is respectively 1/32, 1/16 and 1/8, and the number of channels is 256;
the first feature map sequenceF 3 After up-sampling, the second characteristic diagram sequenceF 4 Splicing to obtain a fourth characteristic diagram sequenceF 6 Preferably, forF 3 The sampling is carried out for 2 times in an up sampling way,F 6 the number of channels of (a) is 512;
sequencing the fourth feature mapF 6 After upsampling, the third feature map sequenceF 5 Splicing to obtain a fifth characteristic diagram sequenceF 8 Preferably, forF 6 Upsampling by a factor of 2, dimensionality reduction of the number of channels to 256, andF 5 are spliced to obtainF 8 The number of channels becomes 512;
sequencing the fifth feature mapF 8 After convolution processing, the original pixel-level feature sequence is obtained by compressing the convolution into one dimension, preferably, a 1x1 convolution layer is usedF 8 Down to 256. Will be provided withF 8 The dimension of (a), i.e. time T, height H and width W, is compressed into one dimension, i.e. the feature map of the dimension dxTxHxW obtained in the previous step is resized by reshaping into dxn, n = TxHxW.
And performing space-time position coding on the original pixel level characteristic sequence by using sine and cosine functions with different frequencies to obtain a space-time position coding result.
Figure DEST_PATH_IMAGE001
Wherein,
Figure 88462DEST_PATH_IMAGE002
is the position of the element in the sequence and i is the dimension. d is to be divisible by 3, pass the position encoding once at the transform encoder input, and add an attention layer on each encoding block.
And inputting the original pixel-level characteristic sequence and the space-time position coding result into a full-time space-offset Transformer coder-decoder to obtain the example identity characteristic of each frame.
The full-time-space offset Transformer encoder-decoder introduces an offset attention mechanism, and the full-time-space offset Transformer encoder-decoder based on the offset attention mechanism comprises 3 basic components: a multi-headed offset attention module, a feed-forward neural network, and a regularization layer. The multi-headed offset attention module uses multiple offset attention modules in parallel, each of which decomposes the input into three vectors: query vector Q, key vector K, and value vector V. The method aims to obtain the weight sum of the weight acting on the value vector calculated according to the local query vector and the local key vector, carry out offset sampling in a decoupling mode, reduce the high complexity problem of a full-time-space attention machine mechanism, focus attention on a local region of interest and obtain local features with more distinctiveness.
The offset attention module localatention is expressed as:
Figure 602620DEST_PATH_IMAGE004
wherein,P Q from one toQThe local sampling area of (a) is,
Figure DEST_PATH_IMAGE005
is the offset of the learned sample point,
Figure 932583DEST_PATH_IMAGE006
is thatP Q The corresponding local key vector is then calculated,
Figure DEST_PATH_IMAGE007
is that
Figure 950218DEST_PATH_IMAGE006
Softmax () is the activation function. And splicing the outputs of the parallel multiple offset attention modules to obtain the output of the multi-head offset attention module.
The feed-forward neural network FFN is composed of a 3-layer perceptron with ReLU activation and with hidden layers and linear layers. The regularization Layer performs Normalization operations in units of channels using a Layer Normalization (LN) approach.
The full space-time offset Transformer encoder consists of 8 coding blocks, and each coding block consists of 1 multi-head offset attention module + regularization layer + FFN + regularization layer. The full space-time offset Transformer decoder consists of 8 decoding blocks, and each decoding block consists of 1 multi-head offset attention module + regularization layer +1 deformable multi-head attention layer + regularization layer + FFN + regularization layer. The encoder and decoder of the full-time-space offset Transformer encoder are symmetrical in structure, the input of the encoder is an original pixel-level characteristic sequence, the output of each encoding block is the input of the next encoding block, and the output of the encoder and the space-time position encoding result are added to be used as a part of the input of each decoding block. The output of each decoding block is input into the next decoding block. The Transformer encoder-decoder directly outputs N different instance identities per frame. N is much larger than the number of all IDs in the panorama.
And performing historical track data association on the example identity characteristics of each frame by using a data association module, calculating association matrixes of the example identity characteristics of two adjacent frames, and screening the example identity characteristics of the two adjacent frames according to the association matrixes.
And performing historical track data association on the instance identity characteristics of each frame output by the full-time-space migration Transformer by using a data association module aiming at the instance identity change, so that the learned instance identity characteristics correspond to real instances one by one.F t AndF t-n the identity characteristics of the instances of the tth frame and the tth-nth frame output by the transform are combined into a characteristic vector Ψ (t-n, t) with the size of NxNx1024. And then mapping the eigenvector psi (t-n, t) into associated features with the size of NxN through a compression network, and after the processing of the Softmax function, setting the eigenvalue to be greater than 0.5 as 1, and setting the eigenvalue to be less than 0.5 as 0, so as to obtain the association matrix M.
As shown in FIG. 2, FIG. 2 shows the data association description from the 1 st frame to the 30 th frame, and the 1 st frame and the 30 th frame contain at most 5 instancesI.e. N =5. The numbers in the columns in the matrix represent all instances in frame 1, the numbers in the rows represent all instance numbers in frame 30, like numbers represent the same instance, a value of 1 indicates both present in frame 1 and present in frame 30, otherwise 0, and the X padding indicates an absent instance. As shown on the right-hand side of the figure,
Figure 303839DEST_PATH_IMAGE008
indicating the entry and exit of an instance into a video frame. For example, a 1 in the last row indicates that object 5 entered frame 30, and a 1 in the last column indicates that instance 4 exists at frame 1, but leaves at frame 30.
The compression network uses convolution kernels to progressively reduce the dimensions along the depth of the input tensor, not allowing adjacent elements of the feature map to interact. However, the correlation matrix M does not take into account the instance object that enters or leaves the video between two input frames. To care for these objects, an extra column and row, respectively, is added to the correlation matrix M to form a matrix
Figure DEST_PATH_IMAGE009
And
Figure 508555DEST_PATH_IMAGE010
. The addition of a row vector and a column vector represents the probability of an instance leaving the video and an instance entering the video when the t-th frame is associated with an instance of the t-n-th frame, respectively. Next, for M 1 Performing Softmax operation in a unit of a line to obtain a probability matrix
Figure DEST_PATH_IMAGE011
And expressing the association relation between the identity characteristic prediction results of different instances of the t-th frame and the t-n-th frame in a probability form. Then to M 2 Obtaining probability matrix by performing Softmax operation in units of columns
Figure 645138DEST_PATH_IMAGE012
The similarity probability corresponding to each column is shown. Finally will beA 1 AndA 2 true correlation matrix between objects in video framesL t-n,t A comparison is made to obtain a match penalty.
Wherein,L t-n,t and representing a binary data incidence matrix, representing the corresponding relation between the t-n th frame and the detected instance object in the instance identity characteristics of the t-th frame. For example, if instance object 1 in the t-nth frame corresponds to the nth instance object in the t-th frame, thenL t-n,t The nth element of the first row is 1.
Based on the above analysis, the correlation process can be supervised based on loop-straightness, including forward lossL f And reverse lossL b . The forward loss ensures that the instances are correctly associated from the t-th frame to the t-th frame, and the reverse loss ensures that the instances are correctly associated from the t-th frame to the t-n-th frame. At the same time, to suppress the correlation between non-maximum similarity instances, non-maximum loss is addedL a The actual instance correlation probability matrix is maximized. The final match penalty is the average of these three components:
Figure 720542DEST_PATH_IMAGE014
Figure 979485DEST_PATH_IMAGE016
Figure 405918DEST_PATH_IMAGE018
wherein,L 1 andL 2 are respectivelyL t-n,t Deleting the pruned matrices of the last row and the last column respectively,L 3 are respectivelyL t-n,t Deleting the pruned matrices of the last row and the last column at the same time,
Figure DEST_PATH_IMAGE019
expressed as a product of the hadamard functions,
Figure 80613DEST_PATH_IMAGE020
is a matrix
Figure DEST_PATH_IMAGE021
And
Figure 541681DEST_PATH_IMAGE022
the matrices obtained by the last column and the last row are removed respectively.
And after calculating the incidence matrix M of two adjacent frames, summing the incidence matrix M by using a row unit to obtain a N multiplied by 1 sum vector, and reserving the example identity characteristics more than 1 in the sum vector to obtain the screened example identity characteristics. And (3) finding the row index with the median value of the sum vector larger than 1, screening according to the found row index, and reserving the example identity characteristics of the corresponding row as screened example identity characteristics.
Inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module (self-attention) to obtain an initial attention mapping, and splicing and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.
Specifically, the method comprises the following steps: and fusing the initial attention mapping, the original pixel level feature sequence of the corresponding frame and the screened example identity features to obtain a prediction result of each frame, wherein the prediction result comprises a mask of each ID, a category and a confidence score, and the prediction result with the confidence score higher than a first set value is selected as an example segmentation result.
And splicing and fusing the initial attention mapping with the original pixel-level feature sequence of the corresponding frame and the output of a transform encoder, and outputting the prediction result of each frame through one 3D convolution and three parallel branches, wherein the prediction result comprises the mask, the category and the confidence score of each ID. The first branch is a deformable convolution layer, outputting a mask m for each ID of different frames; the second branch is a convolutional layer and activation function, outputting the class c of each ID; the third branch is a convolutional layer and activation function, which outputs a confidence score s.
And obtaining a predicted class c, a confidence score s and a predicted mask m, outputting a semantic mask SemMsk and an instance ID mask IdMsk, and allocating a class label and an instance ID to each pixel. Specifically, semMsk and IdMsk are first initialized to zero. Then, the prediction results are sorted in descending order according to the confidence score, and the sorted prediction masks are filled into SemMsk and IdMsk. Results with confidence scores below the first set point (thrcls) are discarded and overlapping portions with lower confidence (above the first set point, below the second set point) are deleted to produce a full view result without overlap. And finally, adding the class label and the instance ID to obtain an instance segmentation result. Here, to constrain the class and mask of the output, the penalty function of adding the instance partitioning module is as follows:
Figure DEST_PATH_IMAGE023
of class branching
Figure 640700DEST_PATH_IMAGE024
Using Focal local, masked branching
Figure DEST_PATH_IMAGE025
For cross-entropy loss, confidence branching
Figure 85588DEST_PATH_IMAGE026
Is a log-likelihood function.
In the method for segmenting the instances in the urban scene based on unmanned driving, a single-stage full-time-space offset Transformer is used for feature extraction to obtain instance identity features (instance identities) of each frame, a data association module for instance identity change is used for data association of the instance identity features of two adjacent frames, and the spatio-temporal dependency relationship of the instances in the images is deeply mined based on the similarity of the images in the videos. The method uses the panorama segmentation technology based on the full-time-space migration transform, can effectively model long-term dependence and historical tracks, relieves the high complexity problem of the full-time-space transform through a migration attention mechanism, improves the operation speed, accelerates the model convergence, and reduces the operation amount. The data association module for instance identity change can effectively identify instance identity change and quickly adapt to complex environment in unmanned city scene.
Example 2
In this embodiment, an example segmentation system based on an unmanned city scene is disclosed, comprising:
the video acquisition module is used for acquiring urban scene videos;
the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;
the example identity characteristic screening module is used for calculating incidence matrixes of the example identity characteristics of the two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes;
and the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to acquire an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to acquire an example segmentation result.
Example 3
In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the method for partitioning an example in an unmanned city scene disclosed in embodiment 1 are performed.
Example 4
In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of the example segmentation method in the unmanned-based city scenario disclosed in embodiment 1.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (8)

1. The method for segmenting the instances in the city scene based on unmanned driving is characterized by comprising the following steps:
acquiring a city scene video;
acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full space-time offset Transformer coder-decoder;
calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;
inputting the screened example identity characteristics of each frame and the output of a Transformer decoder of the corresponding frame into a self-attention module to obtain initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result;
the encoder of the full-time-space offset Transformer encoder-decoder comprises a plurality of coding blocks, the output of each coding block is the input of the next coding block, each coding block comprises a multi-head offset attention module, a regularization layer, an FFN (fast Fourier transform) and a regularization layer which are connected in sequence, the decoder of the full-time-space offset Transformer encoder-decoder comprises a plurality of decoding blocks, each decoding block comprises a multi-head offset attention module, a regularization layer, a deformable multi-head attention layer, a regularization layer, an FFN (fast Fourier transform) and a regularization layer which are connected in sequence, the output of the encoder and the space-time position coding result are added to be used as a part of the input of each decoding block, the output of each decoding block is input into the next decoding block, and the Transformer encoder-decoder directly outputs example identity characteristics of each frame;
the multi-headed offset attention module includes a plurality of offset attention modules, each of which decomposes the input into three vectors: query vector Q, key vector K and value vector V;
offset attention module
Figure 97505DEST_PATH_IMAGE001
Expressed as:
Figure 841470DEST_PATH_IMAGE002
wherein,P Q from one toQThe local sampling area of (a) is,
Figure 750520DEST_PATH_IMAGE003
is the offset of the learned sample point,
Figure 679162DEST_PATH_IMAGE004
is thatP Q The corresponding local key vector is then calculated,
Figure 876925DEST_PATH_IMAGE005
is that
Figure 299947DEST_PATH_IMAGE004
The dimension (c) of (a) is,
Figure 1187DEST_PATH_IMAGE006
is an activation function;
and splicing the outputs of the plurality of parallel offset attention modules to obtain the output of the multi-head offset attention module.
2. The method of claim 1, wherein multi-scale feature extraction is performed on the video through a backbone network to obtain a first feature map sequence, a second feature map sequence and a third feature map sequence;
the first characteristic diagram sequence is subjected to up-sampling and then spliced with the second characteristic diagram to obtain a fourth characteristic diagram;
after the fourth characteristic diagram is subjected to up-sampling, the fourth characteristic diagram is spliced with the third characteristic diagram to obtain a fifth characteristic diagram;
and compressing the fifth feature map into one dimension to obtain an original pixel-level feature sequence.
3. The method for partitioning the instances in the urban scene based on the unmanned driving of claim 1, wherein the identity features of the instances in two adjacent frames are combined into a feature vector, the feature vector is compressed to obtain a correlation matrix M, the correlation matrix M is summed in a row unit to obtain a sum vector of Nx1, and the identity features of the instances greater than 1 in the sum vector are retained to obtain the screened identity features of the instances.
4. The method of claim 1, wherein an additional column and row are respectively added to the correlation matrix to obtain a matrix M 1 Sum matrix M 2 To M, to 1 Performing Softmax operation in row units to obtain probability matrix A 1 To M, to 2 Obtaining the probability matrix A by executing the softmax operation with the column as the unit 2 ,A 1 And A 2 Respectively associated with the true matrixL t-n,t And comparing to obtain the matching loss, wherein,L t-n,t the binary data incidence matrix represents the corresponding relation between the t-n th frame and the detected instance object in the instance identity characteristic of the t-th frame.
5. The method for partitioning the instances in the urban scene based on the unmanned driving of claim 1, wherein the initial attention mapping, the original pixel level feature sequence of the corresponding frame and the screened instance identity features are fused to obtain a prediction result of each frame, the prediction result comprises a prediction result of a predicted mask, a predicted category and a confidence score, and the prediction result with the confidence score higher than a first set value is selected as the instance partitioning result.
6. An example segmentation system in an unmanned city scene is characterized by comprising:
the video acquisition module is used for acquiring a city scene video;
the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;
the example identity characteristic screening module is used for calculating incidence matrixes of the example identity characteristics of the two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes;
the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to obtain initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result;
the encoder of the full-space-time migration Transformer encoder-decoder comprises a plurality of encoding blocks, the output of each encoding block is the input of the next encoding block, each encoding block comprises a multi-head migration attention module, a regularization layer, an FFN (fringe field noise) and a regularization layer which are sequentially connected, the decoder of the full-space-time migration Transformer encoder-decoder comprises a plurality of decoding blocks, each decoding block comprises a multi-head migration attention module, a regularization layer, a deformable multi-head attention layer, a regularization layer, an FFN (fringe field noise) and a regularization layer which are sequentially connected, the output of the encoder and the space-time position coding result are added to serve as a part of the input of each decoding block, the output of each decoding block is input into the next decoding block, and the Transformer encoder-decoder directly outputs the example identity characteristics of each frame;
the multi-headed offset attention module includes a plurality of offset attention modules, each of which decomposes the input into three vectors: query vector Q, key vector K and value vector V;
offset attention module
Figure 366309DEST_PATH_IMAGE001
Expressed as:
Figure 785789DEST_PATH_IMAGE007
wherein,P Q from one toQThe local sampling area of (a) is,
Figure 511037DEST_PATH_IMAGE003
is the offset of the learned sample point,
Figure 129101DEST_PATH_IMAGE008
is thatP Q The corresponding local key vector is then calculated,
Figure 540490DEST_PATH_IMAGE009
is that
Figure 306321DEST_PATH_IMAGE008
The dimension (c) of (a) is,
Figure 195780DEST_PATH_IMAGE006
is an activation function;
and splicing the outputs of the parallel multiple offset attention modules to obtain the output of the multi-head offset attention module.
7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of partitioning instances in an unmanned based city scenario of any of claims 1-5.
8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of example segmentation in an unmanned based city scenario of any of claims 1-5.
CN202211098488.7A 2022-09-09 2022-09-09 Unmanned-driving-based method and system for segmenting instances in urban scene Active CN115171029B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211098488.7A CN115171029B (en) 2022-09-09 2022-09-09 Unmanned-driving-based method and system for segmenting instances in urban scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211098488.7A CN115171029B (en) 2022-09-09 2022-09-09 Unmanned-driving-based method and system for segmenting instances in urban scene

Publications (2)

Publication Number Publication Date
CN115171029A CN115171029A (en) 2022-10-11
CN115171029B true CN115171029B (en) 2022-12-30

Family

ID=83482387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211098488.7A Active CN115171029B (en) 2022-09-09 2022-09-09 Unmanned-driving-based method and system for segmenting instances in urban scene

Country Status (1)

Country Link
CN (1) CN115171029B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117893933B (en) * 2024-03-14 2024-05-24 国网上海市电力公司 Unmanned inspection fault detection method and system for power transmission and transformation equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5915044A (en) * 1995-09-29 1999-06-22 Intel Corporation Encoding video images using foreground/background segmentation
WO2021136528A1 (en) * 2019-12-31 2021-07-08 华为技术有限公司 Instance segmentation method and apparatus
CN113177940A (en) * 2021-05-26 2021-07-27 复旦大学附属中山医院 Gastroscope video part identification network structure based on Transformer
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer
CN114049362A (en) * 2021-11-09 2022-02-15 中国石油大学(华东) Transform-based point cloud instance segmentation method
CN114743020A (en) * 2022-04-02 2022-07-12 华南理工大学 Food identification method combining tag semantic embedding and attention fusion
CN114842394A (en) * 2022-05-17 2022-08-02 西安邮电大学 Swin transform-based automatic identification method for surgical video flow
CN114898243A (en) * 2022-03-23 2022-08-12 超级视线科技有限公司 Traffic scene analysis method and device based on video stream
CN114998592A (en) * 2022-06-18 2022-09-02 脸萌有限公司 Method, apparatus, device and storage medium for instance partitioning
CN114998815A (en) * 2022-08-04 2022-09-02 江苏三棱智慧物联发展股份有限公司 Traffic vehicle identification tracking method and system based on video analysis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112184780A (en) * 2020-10-13 2021-01-05 武汉斌果科技有限公司 Moving object instance segmentation method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5915044A (en) * 1995-09-29 1999-06-22 Intel Corporation Encoding video images using foreground/background segmentation
WO2021136528A1 (en) * 2019-12-31 2021-07-08 华为技术有限公司 Instance segmentation method and apparatus
CN113177940A (en) * 2021-05-26 2021-07-27 复旦大学附属中山医院 Gastroscope video part identification network structure based on Transformer
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer
CN114049362A (en) * 2021-11-09 2022-02-15 中国石油大学(华东) Transform-based point cloud instance segmentation method
CN114898243A (en) * 2022-03-23 2022-08-12 超级视线科技有限公司 Traffic scene analysis method and device based on video stream
CN114743020A (en) * 2022-04-02 2022-07-12 华南理工大学 Food identification method combining tag semantic embedding and attention fusion
CN114842394A (en) * 2022-05-17 2022-08-02 西安邮电大学 Swin transform-based automatic identification method for surgical video flow
CN114998592A (en) * 2022-06-18 2022-09-02 脸萌有限公司 Method, apparatus, device and storage medium for instance partitioning
CN114998815A (en) * 2022-08-04 2022-09-02 江苏三棱智慧物联发展股份有限公司 Traffic vehicle identification tracking method and system based on video analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Dynamic Convolution for 3D Point Cloud nstance Segmentation;Tong He et al.;《arXiv:2107.08392v2》;20220814;第1-15页 *
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows;Ze Liu et al.;《Computer Vision Foundation》;20211231;第10012-10022页 *
多尺度Transformer激光雷达点云3D物体检测;孙刘杰 等;《计算机工程与应用》;20211117;第1-14页 *

Also Published As

Publication number Publication date
CN115171029A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN110399850B (en) Continuous sign language recognition method based on deep neural network
WO2022083335A1 (en) Self-attention mechanism-based behavior recognition method
CN108509880A (en) A kind of video personage behavior method for recognizing semantics
CN111968150B (en) Weak surveillance video target segmentation method based on full convolution neural network
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
CN111369565A (en) Digital pathological image segmentation and classification method based on graph convolution network
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN112801068B (en) Video multi-target tracking and segmenting system and method
CN116309725A (en) Multi-target tracking method based on multi-scale deformable attention mechanism
CN114360067A (en) Dynamic gesture recognition method based on deep learning
CN116469100A (en) Dual-band image semantic segmentation method based on Transformer
CN115171029B (en) Unmanned-driving-based method and system for segmenting instances in urban scene
CN111696136A (en) Target tracking method based on coding and decoding structure
CN117171582A (en) Vehicle track prediction method and system based on space-time attention mechanism
CN116863384A (en) CNN-Transfomer-based self-supervision video segmentation method and system
CN117315293A (en) Transformer-based space-time context target tracking method and system
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
CN114693577A (en) Infrared polarization image fusion method based on Transformer
CN117935088A (en) Unmanned aerial vehicle image target detection method, system and storage medium based on full-scale feature perception and feature reconstruction
CN117876679A (en) Remote sensing image scene segmentation method based on convolutional neural network
CN116994264A (en) Text recognition method, chip and terminal
Hu et al. Lightweight asymmetric dilation network for real-time semantic segmentation
CN115761229A (en) Image semantic segmentation method based on multiple classifiers
Park et al. Rainunet for super-resolution rain movie prediction under spatio-temporal shifts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant