CN112561001A

CN112561001A - Video target detection method based on space-time feature deformable convolution fusion

Info

Publication number: CN112561001A
Application number: CN202110196121.8A
Authority: CN
Inventors: 吴泽彬; 詹天明; 邓伟诗; 陆威; 徐洋; 盛杰
Original assignee: Nanjing Zhiliansen Information Technology Co ltd
Current assignee: Nanjing Zhiliansen Information Technology Co ltd
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-03-26

Abstract

The invention discloses a video target detection method based on space-time characteristic deformable convolution fusion, which relates to the technical field of image processing and comprises the following steps of S1, selecting an image serving as an input network from a video, and obtaining a characteristic diagram of the image of the input network; step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module; step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic target

And length and width of the feature object

. The invention fully utilizes the time context information in the video sequence image, designs the variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, and obtains the time context informationAnd (4) performing space-time fusion on the enhanced feature diagram, and finally performing target detection on the feature diagram to obtain a final detection result, wherein the feature diagram is suitable for a video target detection scene.

Description

Video target detection method based on space-time feature deformable convolution fusion

Technical Field

The invention relates to the technical field of image processing, in particular to a video target detection method based on space-time feature deformable convolution fusion.

Background

Target detection is an important task in the field of computer vision, and has wide application in real life. With the rapid development of internet technology and communication technology, video data is becoming an indispensable part of life, so the importance of video target detection is becoming higher and higher.

The current research on target detection is mostly based on static images, and the research on video sequence images is less. Compared with the static image, the video sequence image may show different size, posture, view angle change and even non-rigid deformation for various reasons. Because the target detection algorithm based on the static image is mostly based on the traditional convolution structure and lacks sufficient geometric deformation modeling capability, the problems can cause that the feature extraction network can not effectively extract image features and can not well process video sequence images. Meanwhile, extremely abundant time context information exists in the video sequence image, and the target detection algorithm based on the static image cannot effectively utilize the information, so that the detection precision cannot meet the actual requirement.

Disclosure of Invention

The invention mainly aims to provide a video target detection method based on space-time characteristic deformable convolution fusion, which fully utilizes time context information in a video sequence image, designs a variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, obtains a characteristic diagram strengthened by space-time fusion, and finally performs target detection on the characteristic diagram to obtain a final detection result, thereby having good reference prospect.

The purpose of the invention can be achieved by adopting the following technical scheme:

a video target detection method based on space-time feature deformable convolution fusion comprises

Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;

step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;

step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic target

And length and width of the feature object

。

The video target detection method based on the space-time feature deformable convolution fusion is characterized in that: step S1 specifically includes the following steps

Step S1.1. selecting the second one according to the video sequence

Frame, first

Frame and second

The frames are used as input images of the network and are set as

、

And

；

step s1.2. will

、

And

respectively inputting the data into a feature extraction network to extract features to obtain feature graphs of the data

、

And

。

the video target detection method based on the space-time feature deformable convolution fusion is characterized in that: step S2 specifically includes the following steps

Step s2.1. will

And

input into a variable convolution space-time fusion module together, will

And

inputting the two signals into another variable convolution space-time fusion module;

step s2.2. after the characteristic diagram is input into the variable convolution space-time fusion module, firstly splicing the channels of two characteristic diagrams input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally, according to the respective offsetAmount of movement guidance

And

by variable convolution calculation of the obtained features

。

The video target detection method based on the spatio-temporal feature deformable convolution fusion further comprises the following steps after the step s2.2

Step s2.3. calculation

And

obtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMax

Multiplying to obtain the product with enhanced characteristics

Calculating

And

Multiplying to obtain the product with enhanced characteristics

；

Step s2.4. will

、

And

and performing point addition operation to perform feature fusion to obtain a feature fusion result of the feature diagram of the variable convolution space-time fusion module.

In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the variable convolution calculation formula in step s2.2 is specifically as follows:

wherein the content of the first and second substances,

the regions involved in the calculation for the variable convolution are,

is composed of

The coordinates of the center point of (a),

is a pair of

The enumeration of each of the positions in (a),

the amount of coordinate shift given to the variable convolution layer,

the characteristic value of the corresponding pixel point.

Radical of the foregoingMethod for video object detection based on spatio-temporal feature deformable convolution fusion, said method

The value of (c) is derived by bilinear interpolation.

In the video target detection method based on the time-space feature deformable convolution fusion, in step s2.3, the calculation is performed according to the cosine similarity measurement formula

And

the similarity of (c).

In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the cosine similarity measurement formula in step s2.3 is specifically,

wherein the content of the first and second substances,

in order to calculate the vectors of similarity, the two vectors have the same dimension.

In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the spatio-temporal fusion offset of each module is respectively obtained through a 3-layer variable convolution calculation formula in the step s2.2.

In the foregoing video target detection method based on spatio-temporal feature deformable convolution fusion, the specific process of bilinear interpolation is,

is provided with

The point represented is

The specific numerical value is

The coordinates are

The coordinates are not integers, so that the point can be known

There are four nearest points around, let this four point be

Their coordinates are respectively

、

、

And

the characteristic values corresponding to the characteristic map are respectively

、

、

And

；

firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation

、

Characteristic value of

、

The specific calculation formula is as follows:

then aligning the temporary point in the ordinate direction

、

Performing one-time single-line interpolation to obtain a point

Characteristic value of

The specific calculation formula is as follows:

。

the invention has the beneficial technical effects that:

the invention fully utilizes the time context information in the video sequence image, designs the variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, obtains the characteristic diagram strengthened by space-time fusion, and finally carries out target detection on the characteristic diagram to obtain the final detection result, thereby being applicable to the video target detection scene and having the following advantages:

(1) aiming at the condition of target deformation of a video sequence image, establishing geometric modeling of a target through a plurality of layers of variable convolution layers to ensure accurate extraction of image characteristics;

(2) aiming at the condition of low quality of video sequence images, the target characteristics are enhanced through spatio-temporal context characteristic fusion, and the detection of subsequent networks is promoted.

Drawings

FIG. 1 is a flow chart of a method for video target detection based on spatiotemporal feature deformable convolution fusion in accordance with the present invention;

FIG. 2 is a flow diagram of a variable convolution spatiotemporal fusion module in accordance with the present invention;

FIG. 3 is a flow chart of feature fusion in accordance with the present invention.

Detailed Description

In order to make the technical solutions of the present invention more clear and definite for those skilled in the art, the present invention is further described in detail below with reference to the examples and the accompanying drawings, but the embodiments of the present invention are not limited thereto.

As shown in fig. 1-3, the method for detecting a video target based on spatio-temporal feature deformable convolution fusion provided by this embodiment includes

And length and width of the feature object

。

In the present embodiment, as shown in fig. 1, step S1 specifically includes the following steps

Step S1.1. selecting the second one according to the video sequence

Frame, first

Frame and second

The frames are used as input images of the network and are set as

、

And

；

step s1.2. will

、

And

、

And

。

in the present embodiment, as shown in fig. 1, step S2 specifically includes the following steps

Step s2.1. will

And

input into a variable convolution space-time fusion module together, will

And

step s2.2. after the feature map is input into the variable convolution space-time fusion module, firstly splicing the channels of two feature maps input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally guiding the space-time fusion offset according to the respective offset

And

by variable convolution calculation of the obtained features

And the two modules are calculated simultaneously without any sequence.

In this embodiment, in step s2.2, the spatio-temporal fusion offsets of the modules are obtained through the 3-layer variable convolution calculation formula, that is, one module obtains its spatio-temporal fusion offset through the 3-layer variable convolution calculation formula, and the other module obtains its spatio-temporal fusion offset through the 3-layer variable convolution calculation formula.

In this embodiment, as shown in fig. 1, the step s2.2 is followed by the following steps

Step s2.3. calculation

And

Multiplying to obtain the product with enhanced characteristics

Calculating

And

Multiplying to obtain the product with enhanced characteristics

；

In this embodiment, the SoftMax calculation formula in step s2.3 is specifically:

wherein the content of the first and second substances,

to be a collection of all numbers that require SoftMax calculations,

is the first of

The number of the number is,

the number of digits that need to be calculated for SoftMax.

Step s2.4. will

、

And

In this embodiment, as shown in fig. 1, the variable convolution calculation formula in step s2.2 is specifically:

wherein the content of the first and second substances,

the regions involved in the calculation for the variable convolution are,

is composed of

The coordinates of the center point of (a),

is a pair of

The enumeration of each of the positions in (a),

the amount of coordinate shift given to the variable convolution layer,

the characteristic value of the corresponding pixel point.

In this embodiment, as shown in figure 1,

the numerical value of (A) is obtained by bilinear interpolation, and the specific process is as follows:

is provided with

The point represented is

The specific numerical value is

The coordinates are

The coordinates are not integers, so that the point can be known

There are four nearest points around, let this four point be

Their coordinates are respectively

、

、

And

、

、

And

；

、

Characteristic value of

、

The specific calculation formula is as follows:

then aligning the temporary point in the ordinate direction

、

Performing one-time single-line interpolation to obtain a point

Characteristic value of

The specific calculation formula is as follows:

in this embodiment, as shown in fig. 1, in step s2.3, the cosine similarity measure formula is calculated

And

the similarity of (c).

In this embodiment, as shown in fig. 1, the cosine similarity measure formula in step s2.3 is specifically,

wherein the content of the first and second substances,

In summary, the present invention fully utilizes the time context information in the video sequence image, designs the variable convolution spatio-temporal fusion module to overcome the difficulty of extracting the features of the video sequence image, obtains the feature map reinforced by spatio-temporal fusion, and finally performs the target detection on the feature map to obtain the final detection result, and is suitable for the video target detection scene and has the following advantages:

The above description is only for the purpose of illustrating the present invention and is not intended to limit the scope of the present invention, and any person skilled in the art can substitute or change the technical solution of the present invention and its conception within the scope of the present invention.

Claims

1. A video target detection method based on space-time feature deformable convolution fusion is characterized in that: comprises that

And length and width of the feature object

。

2. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 1, characterized in that: step S1 specifically includes the following steps

Step S1.1. selecting the second one according to the video sequence

Frame, first

Frame and second

The frames are used as input images of the network and are set as

、

And

；

step s1.2. will

、

And

、

And

。

3. the method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 2, characterized in that: step S2 specifically includes the following steps

Step s2.1. will

And

input into a variable convolution space-time fusion module together, will

And

And

by variable convolution calculation of the obtained features

And

。

4. the method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 3, characterized in that: also included after step s2.2 is the following

Step s2.3. calculation

And

obtaining similarity weight from the similarity, and weighting the similarityAfter SoftMax, the product is mixed with

Multiplying to obtain the product with enhanced characteristics

Calculating

And

Multiplying to obtain the product with enhanced characteristics

；

Step s2.4. will

、

And

5. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 4, characterized in that:

the variable convolution calculation formula in step s2.2 is specifically:

wherein the content of the first and second substances,

the regions involved in the calculation for the variable convolution are,

is composed of

The coordinates of the center point of (a),

is a pair of

The enumeration of each of the positions in (a),

the amount of coordinate shift given to the variable convolution layer,

the characteristic value of the corresponding pixel point.

6. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 5, characterized in that: the above-mentioned

The value of (c) is derived by bilinear interpolation.

7. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 4, characterized in that: in step s2.3, the calculation is performed according to a cosine similarity measurement formula

And

the similarity of (c).

8. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 7, characterized in that: the cosine similarity measure formula in step s2.3 is specifically,

wherein the content of the first and second substances,

9. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 3, characterized in that: and in the step s2.2, the space-time fusion offset of each module is respectively obtained through a 3-layer variable convolution calculation formula.

10. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 6, characterized in that: the specific process of the bilinear interpolation is that,

is provided with