CN115171029B - Unmanned-driving-based method and system for segmenting instances in urban scene - Google Patents
Unmanned-driving-based method and system for segmenting instances in urban scene Download PDFInfo
- Publication number
- CN115171029B CN115171029B CN202211098488.7A CN202211098488A CN115171029B CN 115171029 B CN115171029 B CN 115171029B CN 202211098488 A CN202211098488 A CN 202211098488A CN 115171029 B CN115171029 B CN 115171029B
- Authority
- CN
- China
- Prior art keywords
- offset
- space
- attention
- frame
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000011218 segmentation Effects 0.000 claims abstract description 33
- 238000013507 mapping Methods 0.000 claims abstract description 20
- 238000012216 screening Methods 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 28
- 239000011159 matrix material Substances 0.000 claims description 26
- 238000000605 extraction Methods 0.000 claims description 14
- 238000010586 diagram Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 10
- 230000005012 migration Effects 0.000 claims description 10
- 238000013508 migration Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 238000000638 solvent extraction Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000000717 retained effect Effects 0.000 claims 1
- 238000004458 analytical method Methods 0.000 abstract description 3
- 230000008859 change Effects 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 5
- 101100391179 Dictyostelium discoideum forF gene Proteins 0.000 description 2
- 101100001671 Emericella variicolor andF gene Proteins 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 101100001677 Emericella variicolor andL gene Proteins 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 239000011541 reaction mixture Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Image Analysis (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses an example segmentation method and system based on unmanned city scene, belonging to the technical field of video understanding and analysis, comprising the following steps: acquiring an original pixel-level feature sequence from a scene video; carrying out space-time position coding on the original pixel level characteristic sequence; acquiring instance identity characteristics of each frame according to an original pixel level characteristic sequence, a space-time position coding result and a full space-time offset Transformer coder-decoder; calculating incidence matrixes of the example identity characteristics of the two adjacent frames, and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes; and inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and obtaining an example segmentation result according to the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics. The accuracy of instance segmentation in the scene is improved.
Description
Technical Field
The invention relates to the technical field of video understanding and analysis, in particular to an unmanned-based urban scene example segmentation method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The automatic driving is mainly realized by acquiring a front city scene video, analyzing the city scene video, identifying and segmenting examples in the scene, and further performing automatic driving according to example segmentation results, most of the existing example segmentation methods are based on a Mask-RCNN framework, wherein the target appearance and motion information for data matching can increase the calculation cost and influence the real-time performance of segmentation, and in an unmanned city scene, extremely serious example identity changes can occur to people and vehicles on roads, and the reason is as follows: (1) disappearance and appearance of instances due to occlusion, (2) departure of instances from the scene, (3) entry of new instances into the scene; all resulting in inaccurate example segmentation results.
Disclosure of Invention
In order to solve the problems, the invention provides an example segmentation method and system based on an unmanned city scene, a single-stage full time-space migration Transformer is used for feature extraction to obtain example candidates (instance proxies), and then a data association module aiming at example identity change is used for data association, so that the accuracy of example segmentation in the city scene is improved.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, an example segmentation method in an unmanned city scene is provided, including:
acquiring a city scene video;
acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full space-time offset Transformer coder-decoder;
calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;
inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.
In a second aspect, an example segmentation system in an unmanned-based city scene is provided, including:
the video acquisition module is used for acquiring a city scene video;
the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;
the example identity characteristic screening module is used for calculating the incidence matrix of the example identity characteristics of two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrix;
and the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to acquire an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to acquire an example segmentation result.
In a third aspect, an electronic device is provided, which includes a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the method for partitioning instances in an unmanned city scene are performed.
In a fourth aspect, a computer-readable storage medium is provided for storing computer instructions that, when executed by a processor, perform the steps of the example segmentation method in an unmanned-based city scenario.
Compared with the prior art, the invention has the following beneficial effects:
1. the method uses the panoramic segmentation technology based on the full-time-space migration transform, can effectively model long-term dependence and historical tracks, relieves the high complexity problem of the full-time-space transform through the migration attention mechanism, improves the operation speed, accelerates the model convergence, reduces the operation amount, and can effectively identify the instance identity change through the data association module of the instance identity change, thereby being fast suitable for the complex environment under the unmanned city scene.
2. According to the method, after the instance identity characteristics are obtained through the Transformer encoder-decoder, the incidence matrix of the identity characteristics of two adjacent frames is calculated, the instance identity characteristics are screened according to the incidence matrix, the space-time dependency relationship of the instances in the image is deeply mined, and therefore the instances are segmented according to the screened instance identity characteristics, and the accuracy of instance segmentation is improved.
Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.
FIG. 1 is a flow chart of the method disclosed in example 1;
FIG. 2 is a data association description of frames one through 30;
FIG. 3 is a block diagram of the method disclosed in example 1.
Detailed Description
The invention is further described with reference to the following figures and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example 1
In this embodiment, an example segmentation method based on an unmanned city scene is disclosed, as shown in fig. 1 and 3, the method includes:
acquiring a city scene video;
acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
acquiring instance identity characteristics of each frame according to an original pixel level characteristic sequence, a space-time position coding result and a full space-time offset Transformer coder-decoder;
calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;
inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module to obtain an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.
Specifically, the urban scene video is divided into a video frame sequence.
The feature extraction network comprises a backbone network ResNet101, and multi-scale feature extraction is carried out on the video frame sequence through the backbone network ResNet101 to obtain a first feature map sequenceF 3 And the second characteristic diagram sequenceF 4 And a third profile sequenceF 5 It is preferable that the reaction mixture contains, in particular,F 3 、F 4 andF 5 with respect to the input video frameThe ratio is respectively 1/32, 1/16 and 1/8, and the number of channels is 256;
the first feature map sequenceF 3 After up-sampling, the second characteristic diagram sequenceF 4 Splicing to obtain a fourth characteristic diagram sequenceF 6 Preferably, forF 3 The sampling is carried out for 2 times in an up sampling way,F 6 the number of channels of (a) is 512;
sequencing the fourth feature mapF 6 After upsampling, the third feature map sequenceF 5 Splicing to obtain a fifth characteristic diagram sequenceF 8 Preferably, forF 6 Upsampling by a factor of 2, dimensionality reduction of the number of channels to 256, andF 5 are spliced to obtainF 8 The number of channels becomes 512;
sequencing the fifth feature mapF 8 After convolution processing, the original pixel-level feature sequence is obtained by compressing the convolution into one dimension, preferably, a 1x1 convolution layer is usedF 8 Down to 256. Will be provided withF 8 The dimension of (a), i.e. time T, height H and width W, is compressed into one dimension, i.e. the feature map of the dimension dxTxHxW obtained in the previous step is resized by reshaping into dxn, n = TxHxW.
And performing space-time position coding on the original pixel level characteristic sequence by using sine and cosine functions with different frequencies to obtain a space-time position coding result.
Wherein,is the position of the element in the sequence and i is the dimension. d is to be divisible by 3, pass the position encoding once at the transform encoder input, and add an attention layer on each encoding block.
And inputting the original pixel-level characteristic sequence and the space-time position coding result into a full-time space-offset Transformer coder-decoder to obtain the example identity characteristic of each frame.
The full-time-space offset Transformer encoder-decoder introduces an offset attention mechanism, and the full-time-space offset Transformer encoder-decoder based on the offset attention mechanism comprises 3 basic components: a multi-headed offset attention module, a feed-forward neural network, and a regularization layer. The multi-headed offset attention module uses multiple offset attention modules in parallel, each of which decomposes the input into three vectors: query vector Q, key vector K, and value vector V. The method aims to obtain the weight sum of the weight acting on the value vector calculated according to the local query vector and the local key vector, carry out offset sampling in a decoupling mode, reduce the high complexity problem of a full-time-space attention machine mechanism, focus attention on a local region of interest and obtain local features with more distinctiveness.
The offset attention module localatention is expressed as:
wherein,P Q from one toQThe local sampling area of (a) is,is the offset of the learned sample point,is thatP Q The corresponding local key vector is then calculated,is thatSoftmax () is the activation function. And splicing the outputs of the parallel multiple offset attention modules to obtain the output of the multi-head offset attention module.
The feed-forward neural network FFN is composed of a 3-layer perceptron with ReLU activation and with hidden layers and linear layers. The regularization Layer performs Normalization operations in units of channels using a Layer Normalization (LN) approach.
The full space-time offset Transformer encoder consists of 8 coding blocks, and each coding block consists of 1 multi-head offset attention module + regularization layer + FFN + regularization layer. The full space-time offset Transformer decoder consists of 8 decoding blocks, and each decoding block consists of 1 multi-head offset attention module + regularization layer +1 deformable multi-head attention layer + regularization layer + FFN + regularization layer. The encoder and decoder of the full-time-space offset Transformer encoder are symmetrical in structure, the input of the encoder is an original pixel-level characteristic sequence, the output of each encoding block is the input of the next encoding block, and the output of the encoder and the space-time position encoding result are added to be used as a part of the input of each decoding block. The output of each decoding block is input into the next decoding block. The Transformer encoder-decoder directly outputs N different instance identities per frame. N is much larger than the number of all IDs in the panorama.
And performing historical track data association on the example identity characteristics of each frame by using a data association module, calculating association matrixes of the example identity characteristics of two adjacent frames, and screening the example identity characteristics of the two adjacent frames according to the association matrixes.
And performing historical track data association on the instance identity characteristics of each frame output by the full-time-space migration Transformer by using a data association module aiming at the instance identity change, so that the learned instance identity characteristics correspond to real instances one by one.F t AndF t-n the identity characteristics of the instances of the tth frame and the tth-nth frame output by the transform are combined into a characteristic vector Ψ (t-n, t) with the size of NxNx1024. And then mapping the eigenvector psi (t-n, t) into associated features with the size of NxN through a compression network, and after the processing of the Softmax function, setting the eigenvalue to be greater than 0.5 as 1, and setting the eigenvalue to be less than 0.5 as 0, so as to obtain the association matrix M.
As shown in FIG. 2, FIG. 2 shows the data association description from the 1 st frame to the 30 th frame, and the 1 st frame and the 30 th frame contain at most 5 instancesI.e. N =5. The numbers in the columns in the matrix represent all instances in frame 1, the numbers in the rows represent all instance numbers in frame 30, like numbers represent the same instance, a value of 1 indicates both present in frame 1 and present in frame 30, otherwise 0, and the X padding indicates an absent instance. As shown on the right-hand side of the figure,indicating the entry and exit of an instance into a video frame. For example, a 1 in the last row indicates that object 5 entered frame 30, and a 1 in the last column indicates that instance 4 exists at frame 1, but leaves at frame 30.
The compression network uses convolution kernels to progressively reduce the dimensions along the depth of the input tensor, not allowing adjacent elements of the feature map to interact. However, the correlation matrix M does not take into account the instance object that enters or leaves the video between two input frames. To care for these objects, an extra column and row, respectively, is added to the correlation matrix M to form a matrixAnd. The addition of a row vector and a column vector represents the probability of an instance leaving the video and an instance entering the video when the t-th frame is associated with an instance of the t-n-th frame, respectively. Next, for M 1 Performing Softmax operation in a unit of a line to obtain a probability matrixAnd expressing the association relation between the identity characteristic prediction results of different instances of the t-th frame and the t-n-th frame in a probability form. Then to M 2 Obtaining probability matrix by performing Softmax operation in units of columnsThe similarity probability corresponding to each column is shown. Finally will beA 1 AndA 2 true correlation matrix between objects in video framesL t-n,t A comparison is made to obtain a match penalty.
Wherein,L t-n,t and representing a binary data incidence matrix, representing the corresponding relation between the t-n th frame and the detected instance object in the instance identity characteristics of the t-th frame. For example, if instance object 1 in the t-nth frame corresponds to the nth instance object in the t-th frame, thenL t-n,t The nth element of the first row is 1.
Based on the above analysis, the correlation process can be supervised based on loop-straightness, including forward lossL f And reverse lossL b . The forward loss ensures that the instances are correctly associated from the t-th frame to the t-th frame, and the reverse loss ensures that the instances are correctly associated from the t-th frame to the t-n-th frame. At the same time, to suppress the correlation between non-maximum similarity instances, non-maximum loss is addedL a The actual instance correlation probability matrix is maximized. The final match penalty is the average of these three components:
wherein,L 1 andL 2 are respectivelyL t-n,t Deleting the pruned matrices of the last row and the last column respectively,L 3 are respectivelyL t-n,t Deleting the pruned matrices of the last row and the last column at the same time,expressed as a product of the hadamard functions,is a matrixAndthe matrices obtained by the last column and the last row are removed respectively.
And after calculating the incidence matrix M of two adjacent frames, summing the incidence matrix M by using a row unit to obtain a N multiplied by 1 sum vector, and reserving the example identity characteristics more than 1 in the sum vector to obtain the screened example identity characteristics. And (3) finding the row index with the median value of the sum vector larger than 1, screening according to the found row index, and reserving the example identity characteristics of the corresponding row as screened example identity characteristics.
Inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into a self-attention module (self-attention) to obtain an initial attention mapping, and splicing and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result.
Specifically, the method comprises the following steps: and fusing the initial attention mapping, the original pixel level feature sequence of the corresponding frame and the screened example identity features to obtain a prediction result of each frame, wherein the prediction result comprises a mask of each ID, a category and a confidence score, and the prediction result with the confidence score higher than a first set value is selected as an example segmentation result.
And splicing and fusing the initial attention mapping with the original pixel-level feature sequence of the corresponding frame and the output of a transform encoder, and outputting the prediction result of each frame through one 3D convolution and three parallel branches, wherein the prediction result comprises the mask, the category and the confidence score of each ID. The first branch is a deformable convolution layer, outputting a mask m for each ID of different frames; the second branch is a convolutional layer and activation function, outputting the class c of each ID; the third branch is a convolutional layer and activation function, which outputs a confidence score s.
And obtaining a predicted class c, a confidence score s and a predicted mask m, outputting a semantic mask SemMsk and an instance ID mask IdMsk, and allocating a class label and an instance ID to each pixel. Specifically, semMsk and IdMsk are first initialized to zero. Then, the prediction results are sorted in descending order according to the confidence score, and the sorted prediction masks are filled into SemMsk and IdMsk. Results with confidence scores below the first set point (thrcls) are discarded and overlapping portions with lower confidence (above the first set point, below the second set point) are deleted to produce a full view result without overlap. And finally, adding the class label and the instance ID to obtain an instance segmentation result. Here, to constrain the class and mask of the output, the penalty function of adding the instance partitioning module is as follows:
of class branchingUsing Focal local, masked branchingFor cross-entropy loss, confidence branchingIs a log-likelihood function.
In the method for segmenting the instances in the urban scene based on unmanned driving, a single-stage full-time-space offset Transformer is used for feature extraction to obtain instance identity features (instance identities) of each frame, a data association module for instance identity change is used for data association of the instance identity features of two adjacent frames, and the spatio-temporal dependency relationship of the instances in the images is deeply mined based on the similarity of the images in the videos. The method uses the panorama segmentation technology based on the full-time-space migration transform, can effectively model long-term dependence and historical tracks, relieves the high complexity problem of the full-time-space transform through a migration attention mechanism, improves the operation speed, accelerates the model convergence, and reduces the operation amount. The data association module for instance identity change can effectively identify instance identity change and quickly adapt to complex environment in unmanned city scene.
Example 2
In this embodiment, an example segmentation system based on an unmanned city scene is disclosed, comprising:
the video acquisition module is used for acquiring urban scene videos;
the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;
the example identity characteristic screening module is used for calculating incidence matrixes of the example identity characteristics of the two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes;
and the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to acquire an initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to acquire an example segmentation result.
Example 3
In this embodiment, an electronic device is disclosed, comprising a memory and a processor, and computer instructions stored in the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the steps of the method for partitioning an example in an unmanned city scene disclosed in embodiment 1 are performed.
Example 4
In this embodiment, a computer readable storage medium is disclosed for storing computer instructions which, when executed by a processor, perform the steps of the example segmentation method in the unmanned-based city scenario disclosed in embodiment 1.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (8)
1. The method for segmenting the instances in the city scene based on unmanned driving is characterized by comprising the following steps:
acquiring a city scene video;
acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full space-time offset Transformer coder-decoder;
calculating incidence matrixes of the identity characteristics of the examples of the two adjacent frames, and screening the identity characteristics of the examples of the two adjacent frames according to the incidence matrixes;
inputting the screened example identity characteristics of each frame and the output of a Transformer decoder of the corresponding frame into a self-attention module to obtain initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result;
the encoder of the full-time-space offset Transformer encoder-decoder comprises a plurality of coding blocks, the output of each coding block is the input of the next coding block, each coding block comprises a multi-head offset attention module, a regularization layer, an FFN (fast Fourier transform) and a regularization layer which are connected in sequence, the decoder of the full-time-space offset Transformer encoder-decoder comprises a plurality of decoding blocks, each decoding block comprises a multi-head offset attention module, a regularization layer, a deformable multi-head attention layer, a regularization layer, an FFN (fast Fourier transform) and a regularization layer which are connected in sequence, the output of the encoder and the space-time position coding result are added to be used as a part of the input of each decoding block, the output of each decoding block is input into the next decoding block, and the Transformer encoder-decoder directly outputs example identity characteristics of each frame;
the multi-headed offset attention module includes a plurality of offset attention modules, each of which decomposes the input into three vectors: query vector Q, key vector K and value vector V;
wherein,P Q from one toQThe local sampling area of (a) is,is the offset of the learned sample point,is thatP Q The corresponding local key vector is then calculated,is thatThe dimension (c) of (a) is,is an activation function;
and splicing the outputs of the plurality of parallel offset attention modules to obtain the output of the multi-head offset attention module.
2. The method of claim 1, wherein multi-scale feature extraction is performed on the video through a backbone network to obtain a first feature map sequence, a second feature map sequence and a third feature map sequence;
the first characteristic diagram sequence is subjected to up-sampling and then spliced with the second characteristic diagram to obtain a fourth characteristic diagram;
after the fourth characteristic diagram is subjected to up-sampling, the fourth characteristic diagram is spliced with the third characteristic diagram to obtain a fifth characteristic diagram;
and compressing the fifth feature map into one dimension to obtain an original pixel-level feature sequence.
3. The method for partitioning the instances in the urban scene based on the unmanned driving of claim 1, wherein the identity features of the instances in two adjacent frames are combined into a feature vector, the feature vector is compressed to obtain a correlation matrix M, the correlation matrix M is summed in a row unit to obtain a sum vector of Nx1, and the identity features of the instances greater than 1 in the sum vector are retained to obtain the screened identity features of the instances.
4. The method of claim 1, wherein an additional column and row are respectively added to the correlation matrix to obtain a matrix M 1 Sum matrix M 2 To M, to 1 Performing Softmax operation in row units to obtain probability matrix A 1 To M, to 2 Obtaining the probability matrix A by executing the softmax operation with the column as the unit 2 ,A 1 And A 2 Respectively associated with the true matrixL t-n,t And comparing to obtain the matching loss, wherein,L t-n,t the binary data incidence matrix represents the corresponding relation between the t-n th frame and the detected instance object in the instance identity characteristic of the t-th frame.
5. The method for partitioning the instances in the urban scene based on the unmanned driving of claim 1, wherein the initial attention mapping, the original pixel level feature sequence of the corresponding frame and the screened instance identity features are fused to obtain a prediction result of each frame, the prediction result comprises a prediction result of a predicted mask, a predicted category and a confidence score, and the prediction result with the confidence score higher than a first set value is selected as the instance partitioning result.
6. An example segmentation system in an unmanned city scene is characterized by comprising:
the video acquisition module is used for acquiring a city scene video;
the feature extraction module is used for acquiring an original pixel-level feature sequence according to the urban scene video and the feature extraction network;
the space-time position coding module is used for carrying out space-time position coding on the original pixel level characteristic sequence to obtain a space-time position coding result;
the example prediction module is used for obtaining the example identity characteristics of each frame according to the original pixel level characteristic sequence, the space-time position coding result and the full-time space-offset Transformer coder-decoder;
the example identity characteristic screening module is used for calculating incidence matrixes of the example identity characteristics of the two adjacent frames and screening the example identity characteristics of the two adjacent frames according to the incidence matrixes;
the example segmentation result acquisition module is used for inputting the screened example identity characteristics of each frame and the output of the Transformer decoder of the corresponding frame into the self-attention module to obtain initial attention mapping, and fusing the initial attention mapping, the original pixel level characteristic sequence of the corresponding frame and the screened example identity characteristics to obtain an example segmentation result;
the encoder of the full-space-time migration Transformer encoder-decoder comprises a plurality of encoding blocks, the output of each encoding block is the input of the next encoding block, each encoding block comprises a multi-head migration attention module, a regularization layer, an FFN (fringe field noise) and a regularization layer which are sequentially connected, the decoder of the full-space-time migration Transformer encoder-decoder comprises a plurality of decoding blocks, each decoding block comprises a multi-head migration attention module, a regularization layer, a deformable multi-head attention layer, a regularization layer, an FFN (fringe field noise) and a regularization layer which are sequentially connected, the output of the encoder and the space-time position coding result are added to serve as a part of the input of each decoding block, the output of each decoding block is input into the next decoding block, and the Transformer encoder-decoder directly outputs the example identity characteristics of each frame;
the multi-headed offset attention module includes a plurality of offset attention modules, each of which decomposes the input into three vectors: query vector Q, key vector K and value vector V;
wherein,P Q from one toQThe local sampling area of (a) is,is the offset of the learned sample point,is thatP Q The corresponding local key vector is then calculated,is thatThe dimension (c) of (a) is,is an activation function;
and splicing the outputs of the parallel multiple offset attention modules to obtain the output of the multi-head offset attention module.
7. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the method of partitioning instances in an unmanned based city scenario of any of claims 1-5.
8. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of example segmentation in an unmanned based city scenario of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211098488.7A CN115171029B (en) | 2022-09-09 | 2022-09-09 | Unmanned-driving-based method and system for segmenting instances in urban scene |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211098488.7A CN115171029B (en) | 2022-09-09 | 2022-09-09 | Unmanned-driving-based method and system for segmenting instances in urban scene |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115171029A CN115171029A (en) | 2022-10-11 |
CN115171029B true CN115171029B (en) | 2022-12-30 |
Family
ID=83482387
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211098488.7A Active CN115171029B (en) | 2022-09-09 | 2022-09-09 | Unmanned-driving-based method and system for segmenting instances in urban scene |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115171029B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117893933B (en) * | 2024-03-14 | 2024-05-24 | 国网上海市电力公司 | Unmanned inspection fault detection method and system for power transmission and transformation equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5915044A (en) * | 1995-09-29 | 1999-06-22 | Intel Corporation | Encoding video images using foreground/background segmentation |
WO2021136528A1 (en) * | 2019-12-31 | 2021-07-08 | 华为技术有限公司 | Instance segmentation method and apparatus |
CN113177940A (en) * | 2021-05-26 | 2021-07-27 | 复旦大学附属中山医院 | Gastroscope video part identification network structure based on Transformer |
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
CN114049362A (en) * | 2021-11-09 | 2022-02-15 | 中国石油大学(华东) | Transform-based point cloud instance segmentation method |
CN114743020A (en) * | 2022-04-02 | 2022-07-12 | 华南理工大学 | Food identification method combining tag semantic embedding and attention fusion |
CN114842394A (en) * | 2022-05-17 | 2022-08-02 | 西安邮电大学 | Swin transform-based automatic identification method for surgical video flow |
CN114898243A (en) * | 2022-03-23 | 2022-08-12 | 超级视线科技有限公司 | Traffic scene analysis method and device based on video stream |
CN114998592A (en) * | 2022-06-18 | 2022-09-02 | 脸萌有限公司 | Method, apparatus, device and storage medium for instance partitioning |
CN114998815A (en) * | 2022-08-04 | 2022-09-02 | 江苏三棱智慧物联发展股份有限公司 | Traffic vehicle identification tracking method and system based on video analysis |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112184780A (en) * | 2020-10-13 | 2021-01-05 | 武汉斌果科技有限公司 | Moving object instance segmentation method |
-
2022
- 2022-09-09 CN CN202211098488.7A patent/CN115171029B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5915044A (en) * | 1995-09-29 | 1999-06-22 | Intel Corporation | Encoding video images using foreground/background segmentation |
WO2021136528A1 (en) * | 2019-12-31 | 2021-07-08 | 华为技术有限公司 | Instance segmentation method and apparatus |
CN113177940A (en) * | 2021-05-26 | 2021-07-27 | 复旦大学附属中山医院 | Gastroscope video part identification network structure based on Transformer |
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
CN114049362A (en) * | 2021-11-09 | 2022-02-15 | 中国石油大学(华东) | Transform-based point cloud instance segmentation method |
CN114898243A (en) * | 2022-03-23 | 2022-08-12 | 超级视线科技有限公司 | Traffic scene analysis method and device based on video stream |
CN114743020A (en) * | 2022-04-02 | 2022-07-12 | 华南理工大学 | Food identification method combining tag semantic embedding and attention fusion |
CN114842394A (en) * | 2022-05-17 | 2022-08-02 | 西安邮电大学 | Swin transform-based automatic identification method for surgical video flow |
CN114998592A (en) * | 2022-06-18 | 2022-09-02 | 脸萌有限公司 | Method, apparatus, device and storage medium for instance partitioning |
CN114998815A (en) * | 2022-08-04 | 2022-09-02 | 江苏三棱智慧物联发展股份有限公司 | Traffic vehicle identification tracking method and system based on video analysis |
Non-Patent Citations (3)
Title |
---|
Dynamic Convolution for 3D Point Cloud nstance Segmentation;Tong He et al.;《arXiv:2107.08392v2》;20220814;第1-15页 * |
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows;Ze Liu et al.;《Computer Vision Foundation》;20211231;第10012-10022页 * |
多尺度Transformer激光雷达点云3D物体检测;孙刘杰 等;《计算机工程与应用》;20211117;第1-14页 * |
Also Published As
Publication number | Publication date |
---|---|
CN115171029A (en) | 2022-10-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111210443B (en) | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance | |
CN110399850B (en) | Continuous sign language recognition method based on deep neural network | |
WO2022083335A1 (en) | Self-attention mechanism-based behavior recognition method | |
CN108509880A (en) | A kind of video personage behavior method for recognizing semantics | |
CN111968150B (en) | Weak surveillance video target segmentation method based on full convolution neural network | |
CN113591968A (en) | Infrared weak and small target detection method based on asymmetric attention feature fusion | |
CN111369565A (en) | Digital pathological image segmentation and classification method based on graph convolution network | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN112801068B (en) | Video multi-target tracking and segmenting system and method | |
CN116309725A (en) | Multi-target tracking method based on multi-scale deformable attention mechanism | |
CN114360067A (en) | Dynamic gesture recognition method based on deep learning | |
CN116469100A (en) | Dual-band image semantic segmentation method based on Transformer | |
CN115171029B (en) | Unmanned-driving-based method and system for segmenting instances in urban scene | |
CN111696136A (en) | Target tracking method based on coding and decoding structure | |
CN117171582A (en) | Vehicle track prediction method and system based on space-time attention mechanism | |
CN116863384A (en) | CNN-Transfomer-based self-supervision video segmentation method and system | |
CN117315293A (en) | Transformer-based space-time context target tracking method and system | |
CN115239765A (en) | Infrared image target tracking system and method based on multi-scale deformable attention | |
CN114693577A (en) | Infrared polarization image fusion method based on Transformer | |
CN117935088A (en) | Unmanned aerial vehicle image target detection method, system and storage medium based on full-scale feature perception and feature reconstruction | |
CN117876679A (en) | Remote sensing image scene segmentation method based on convolutional neural network | |
CN116994264A (en) | Text recognition method, chip and terminal | |
Hu et al. | Lightweight asymmetric dilation network for real-time semantic segmentation | |
CN115761229A (en) | Image semantic segmentation method based on multiple classifiers | |
Park et al. | Rainunet for super-resolution rain movie prediction under spatio-temporal shifts |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |