CN112561001A - Video target detection method based on space-time feature deformable convolution fusion - Google Patents

Video target detection method based on space-time feature deformable convolution fusion Download PDF

Info

Publication number
CN112561001A
CN112561001A CN202110196121.8A CN202110196121A CN112561001A CN 112561001 A CN112561001 A CN 112561001A CN 202110196121 A CN202110196121 A CN 202110196121A CN 112561001 A CN112561001 A CN 112561001A
Authority
CN
China
Prior art keywords
fusion
feature
space
characteristic
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110196121.8A
Other languages
Chinese (zh)
Inventor
吴泽彬
詹天明
邓伟诗
陆威
徐洋
盛杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhiliansen Information Technology Co ltd
Original Assignee
Nanjing Zhiliansen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhiliansen Information Technology Co ltd filed Critical Nanjing Zhiliansen Information Technology Co ltd
Priority to CN202110196121.8A priority Critical patent/CN112561001A/en
Publication of CN112561001A publication Critical patent/CN112561001A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/251Fusion techniques of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target detection method based on space-time characteristic deformable convolution fusion, which relates to the technical field of image processing and comprises the following steps of S1, selecting an image serving as an input network from a video, and obtaining a characteristic diagram of the image of the input network; step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module; step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic target
Figure 579379DEST_PATH_IMAGE001
And length and width of the feature object
Figure 941090DEST_PATH_IMAGE002
. The invention fully utilizes the time context information in the video sequence image, designs the variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, and obtains the time context informationAnd (4) performing space-time fusion on the enhanced feature diagram, and finally performing target detection on the feature diagram to obtain a final detection result, wherein the feature diagram is suitable for a video target detection scene.

Description

Video target detection method based on space-time feature deformable convolution fusion
Technical Field
The invention relates to the technical field of image processing, in particular to a video target detection method based on space-time feature deformable convolution fusion.
Background
Target detection is an important task in the field of computer vision, and has wide application in real life. With the rapid development of internet technology and communication technology, video data is becoming an indispensable part of life, so the importance of video target detection is becoming higher and higher.
The current research on target detection is mostly based on static images, and the research on video sequence images is less. Compared with the static image, the video sequence image may show different size, posture, view angle change and even non-rigid deformation for various reasons. Because the target detection algorithm based on the static image is mostly based on the traditional convolution structure and lacks sufficient geometric deformation modeling capability, the problems can cause that the feature extraction network can not effectively extract image features and can not well process video sequence images. Meanwhile, extremely abundant time context information exists in the video sequence image, and the target detection algorithm based on the static image cannot effectively utilize the information, so that the detection precision cannot meet the actual requirement.
Disclosure of Invention
The invention mainly aims to provide a video target detection method based on space-time characteristic deformable convolution fusion, which fully utilizes time context information in a video sequence image, designs a variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, obtains a characteristic diagram strengthened by space-time fusion, and finally performs target detection on the characteristic diagram to obtain a final detection result, thereby having good reference prospect.
The purpose of the invention can be achieved by adopting the following technical scheme:
a video target detection method based on space-time feature deformable convolution fusion comprises
Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;
step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;
step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic target
Figure 696600DEST_PATH_IMAGE001
And length and width of the feature object
Figure 389750DEST_PATH_IMAGE002
The video target detection method based on the space-time feature deformable convolution fusion is characterized in that: step S1 specifically includes the following steps
Step S1.1. selecting the second one according to the video sequence
Figure 44722DEST_PATH_IMAGE003
Frame, first
Figure 594652DEST_PATH_IMAGE004
Frame and second
Figure 210441DEST_PATH_IMAGE005
The frames are used as input images of the network and are set as
Figure 317068DEST_PATH_IMAGE006
Figure 764230DEST_PATH_IMAGE007
And
Figure 422745DEST_PATH_IMAGE008
step s1.2. will
Figure 916043DEST_PATH_IMAGE006
Figure 747733DEST_PATH_IMAGE007
And
Figure 987084DEST_PATH_IMAGE008
respectively inputting the data into a feature extraction network to extract features to obtain feature graphs of the data
Figure 878817DEST_PATH_IMAGE009
Figure 577520DEST_PATH_IMAGE010
And
Figure 416163DEST_PATH_IMAGE011
the video target detection method based on the space-time feature deformable convolution fusion is characterized in that: step S2 specifically includes the following steps
Step s2.1. will
Figure 837917DEST_PATH_IMAGE009
And
Figure 962868DEST_PATH_IMAGE012
input into a variable convolution space-time fusion module together, will
Figure 40546DEST_PATH_IMAGE011
And
Figure 214038DEST_PATH_IMAGE013
inputting the two signals into another variable convolution space-time fusion module;
step s2.2. after the characteristic diagram is input into the variable convolution space-time fusion module, firstly splicing the channels of two characteristic diagrams input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally, according to the respective offsetAmount of movement guidance
Figure 303348DEST_PATH_IMAGE009
And
Figure 536883DEST_PATH_IMAGE011
by variable convolution calculation of the obtained features
Figure 836277DEST_PATH_IMAGE014
The video target detection method based on the spatio-temporal feature deformable convolution fusion further comprises the following steps after the step s2.2
Step s2.3. calculation
Figure 813461DEST_PATH_IMAGE015
And
Figure 537703DEST_PATH_IMAGE013
obtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMax
Figure 879823DEST_PATH_IMAGE015
Multiplying to obtain the product with enhanced characteristics
Figure 463251DEST_PATH_IMAGE016
Calculating
Figure 835937DEST_PATH_IMAGE017
And
Figure 24473DEST_PATH_IMAGE012
obtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMax
Figure 599811DEST_PATH_IMAGE018
Multiplying to obtain the product with enhanced characteristics
Figure 732852DEST_PATH_IMAGE019
Step s2.4. will
Figure 255100DEST_PATH_IMAGE016
Figure 360459DEST_PATH_IMAGE012
And
Figure 919748DEST_PATH_IMAGE019
and performing point addition operation to perform feature fusion to obtain a feature fusion result of the feature diagram of the variable convolution space-time fusion module.
In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the variable convolution calculation formula in step s2.2 is specifically as follows:
Figure 743347DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 538128DEST_PATH_IMAGE021
the regions involved in the calculation for the variable convolution are,
Figure 91469DEST_PATH_IMAGE022
is composed of
Figure 8609DEST_PATH_IMAGE021
The coordinates of the center point of (a),
Figure 991609DEST_PATH_IMAGE023
is a pair of
Figure 229561DEST_PATH_IMAGE021
The enumeration of each of the positions in (a),
Figure 309513DEST_PATH_IMAGE023
the amount of coordinate shift given to the variable convolution layer,
Figure 335237DEST_PATH_IMAGE024
the characteristic value of the corresponding pixel point.
Radical of the foregoingMethod for video object detection based on spatio-temporal feature deformable convolution fusion, said method
Figure 867850DEST_PATH_IMAGE024
The value of (c) is derived by bilinear interpolation.
In the video target detection method based on the time-space feature deformable convolution fusion, in step s2.3, the calculation is performed according to the cosine similarity measurement formula
Figure 660225DEST_PATH_IMAGE015
And
Figure 532366DEST_PATH_IMAGE012
the similarity of (c).
In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the cosine similarity measurement formula in step s2.3 is specifically,
Figure 791309DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 624267DEST_PATH_IMAGE026
in order to calculate the vectors of similarity, the two vectors have the same dimension.
In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the spatio-temporal fusion offset of each module is respectively obtained through a 3-layer variable convolution calculation formula in the step s2.2.
In the foregoing video target detection method based on spatio-temporal feature deformable convolution fusion, the specific process of bilinear interpolation is,
is provided with
Figure 95700DEST_PATH_IMAGE027
The point represented is
Figure 884664DEST_PATH_IMAGE028
The specific numerical value is
Figure 376826DEST_PATH_IMAGE029
The coordinates are
Figure 87293DEST_PATH_IMAGE030
The coordinates are not integers, so that the point can be known
Figure 893575DEST_PATH_IMAGE031
There are four nearest points around, let this four point be
Figure 851560DEST_PATH_IMAGE032
Their coordinates are respectively
Figure 921147DEST_PATH_IMAGE033
Figure 650069DEST_PATH_IMAGE034
Figure 322358DEST_PATH_IMAGE035
And
Figure 820336DEST_PATH_IMAGE036
the characteristic values corresponding to the characteristic map are respectively
Figure 795245DEST_PATH_IMAGE037
Figure 11463DEST_PATH_IMAGE038
Figure 238176DEST_PATH_IMAGE039
And
Figure 793922DEST_PATH_IMAGE040
firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation
Figure 329946DEST_PATH_IMAGE041
Figure 33459DEST_PATH_IMAGE042
Characteristic value of
Figure 188497DEST_PATH_IMAGE043
Figure 972651DEST_PATH_IMAGE044
The specific calculation formula is as follows:
Figure 351680DEST_PATH_IMAGE045
Figure 745752DEST_PATH_IMAGE046
then aligning the temporary point in the ordinate direction
Figure 297956DEST_PATH_IMAGE041
Figure 93874DEST_PATH_IMAGE042
Performing one-time single-line interpolation to obtain a point
Figure 378225DEST_PATH_IMAGE028
Characteristic value of
Figure 134960DEST_PATH_IMAGE047
The specific calculation formula is as follows:
Figure 366221DEST_PATH_IMAGE048
the invention has the beneficial technical effects that:
the invention fully utilizes the time context information in the video sequence image, designs the variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, obtains the characteristic diagram strengthened by space-time fusion, and finally carries out target detection on the characteristic diagram to obtain the final detection result, thereby being applicable to the video target detection scene and having the following advantages:
(1) aiming at the condition of target deformation of a video sequence image, establishing geometric modeling of a target through a plurality of layers of variable convolution layers to ensure accurate extraction of image characteristics;
(2) aiming at the condition of low quality of video sequence images, the target characteristics are enhanced through spatio-temporal context characteristic fusion, and the detection of subsequent networks is promoted.
Drawings
FIG. 1 is a flow chart of a method for video target detection based on spatiotemporal feature deformable convolution fusion in accordance with the present invention;
FIG. 2 is a flow diagram of a variable convolution spatiotemporal fusion module in accordance with the present invention;
FIG. 3 is a flow chart of feature fusion in accordance with the present invention.
Detailed Description
In order to make the technical solutions of the present invention more clear and definite for those skilled in the art, the present invention is further described in detail below with reference to the examples and the accompanying drawings, but the embodiments of the present invention are not limited thereto.
As shown in fig. 1-3, the method for detecting a video target based on spatio-temporal feature deformable convolution fusion provided by this embodiment includes
Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;
step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;
step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic target
Figure 141279DEST_PATH_IMAGE001
And length and width of the feature object
Figure 862110DEST_PATH_IMAGE002
In the present embodiment, as shown in fig. 1, step S1 specifically includes the following steps
Step S1.1. selecting the second one according to the video sequence
Figure 699616DEST_PATH_IMAGE003
Frame, first
Figure 531306DEST_PATH_IMAGE004
Frame and second
Figure 678647DEST_PATH_IMAGE005
The frames are used as input images of the network and are set as
Figure 508062DEST_PATH_IMAGE006
Figure 895181DEST_PATH_IMAGE007
And
Figure 858458DEST_PATH_IMAGE008
step s1.2. will
Figure 217895DEST_PATH_IMAGE006
Figure 280529DEST_PATH_IMAGE007
And
Figure 967994DEST_PATH_IMAGE008
respectively inputting the data into a feature extraction network to extract features to obtain feature graphs of the data
Figure 344748DEST_PATH_IMAGE009
Figure 621009DEST_PATH_IMAGE012
And
Figure 916861DEST_PATH_IMAGE011
in the present embodiment, as shown in fig. 1, step S2 specifically includes the following steps
Step s2.1. will
Figure 481835DEST_PATH_IMAGE009
And
Figure 459018DEST_PATH_IMAGE012
input into a variable convolution space-time fusion module together, will
Figure 901370DEST_PATH_IMAGE011
And
Figure 509068DEST_PATH_IMAGE013
inputting the two signals into another variable convolution space-time fusion module;
step s2.2. after the feature map is input into the variable convolution space-time fusion module, firstly splicing the channels of two feature maps input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally guiding the space-time fusion offset according to the respective offset
Figure 358076DEST_PATH_IMAGE009
And
Figure 201267DEST_PATH_IMAGE011
by variable convolution calculation of the obtained features
Figure 389803DEST_PATH_IMAGE014
And the two modules are calculated simultaneously without any sequence.
In this embodiment, in step s2.2, the spatio-temporal fusion offsets of the modules are obtained through the 3-layer variable convolution calculation formula, that is, one module obtains its spatio-temporal fusion offset through the 3-layer variable convolution calculation formula, and the other module obtains its spatio-temporal fusion offset through the 3-layer variable convolution calculation formula.
In this embodiment, as shown in fig. 1, the step s2.2 is followed by the following steps
Step s2.3. calculation
Figure 965140DEST_PATH_IMAGE018
And
Figure 380072DEST_PATH_IMAGE013
obtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMax
Figure 636741DEST_PATH_IMAGE018
Multiplying to obtain the product with enhanced characteristics
Figure 742101DEST_PATH_IMAGE016
Calculating
Figure 550657DEST_PATH_IMAGE017
And
Figure 577519DEST_PATH_IMAGE013
obtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMax
Figure 434616DEST_PATH_IMAGE018
Multiplying to obtain the product with enhanced characteristics
Figure 708996DEST_PATH_IMAGE019
In this embodiment, the SoftMax calculation formula in step s2.3 is specifically:
Figure 94978DEST_PATH_IMAGE049
wherein the content of the first and second substances,
Figure 140294DEST_PATH_IMAGE050
to be a collection of all numbers that require SoftMax calculations,
Figure 863400DEST_PATH_IMAGE051
is the first of
Figure 146614DEST_PATH_IMAGE052
The number of the number is,
Figure 234655DEST_PATH_IMAGE053
the number of digits that need to be calculated for SoftMax.
Step s2.4. will
Figure 845896DEST_PATH_IMAGE016
Figure 248059DEST_PATH_IMAGE013
And
Figure 510413DEST_PATH_IMAGE019
and performing point addition operation to perform feature fusion to obtain a feature fusion result of the feature diagram of the variable convolution space-time fusion module.
In this embodiment, as shown in fig. 1, the variable convolution calculation formula in step s2.2 is specifically:
Figure 769356DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 726948DEST_PATH_IMAGE021
the regions involved in the calculation for the variable convolution are,
Figure 306702DEST_PATH_IMAGE022
is composed of
Figure 95667DEST_PATH_IMAGE021
The coordinates of the center point of (a),
Figure 728773DEST_PATH_IMAGE023
is a pair of
Figure 235978DEST_PATH_IMAGE021
The enumeration of each of the positions in (a),
Figure 104577DEST_PATH_IMAGE023
the amount of coordinate shift given to the variable convolution layer,
Figure 951310DEST_PATH_IMAGE024
the characteristic value of the corresponding pixel point.
In this embodiment, as shown in figure 1,
Figure 552056DEST_PATH_IMAGE024
the numerical value of (A) is obtained by bilinear interpolation, and the specific process is as follows:
is provided with
Figure 359606DEST_PATH_IMAGE054
The point represented is
Figure 907262DEST_PATH_IMAGE028
The specific numerical value is
Figure 670819DEST_PATH_IMAGE055
The coordinates are
Figure 770362DEST_PATH_IMAGE056
The coordinates are not integers, so that the point can be known
Figure 986579DEST_PATH_IMAGE031
There are four nearest points around, let this four point be
Figure 337926DEST_PATH_IMAGE057
Their coordinates are respectively
Figure 258785DEST_PATH_IMAGE033
Figure 466912DEST_PATH_IMAGE034
Figure 373689DEST_PATH_IMAGE035
And
Figure 122202DEST_PATH_IMAGE036
the characteristic values corresponding to the characteristic map are respectively
Figure 860351DEST_PATH_IMAGE037
Figure 177062DEST_PATH_IMAGE058
Figure 446501DEST_PATH_IMAGE059
And
Figure 936388DEST_PATH_IMAGE040
firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation
Figure 201147DEST_PATH_IMAGE041
Figure 78973DEST_PATH_IMAGE042
Characteristic value of
Figure 757079DEST_PATH_IMAGE043
Figure 988341DEST_PATH_IMAGE044
The specific calculation formula is as follows:
Figure 435502DEST_PATH_IMAGE045
Figure 467918DEST_PATH_IMAGE046
then aligning the temporary point in the ordinate direction
Figure 571003DEST_PATH_IMAGE041
Figure 402693DEST_PATH_IMAGE042
Performing one-time single-line interpolation to obtain a point
Figure 32258DEST_PATH_IMAGE028
Characteristic value of
Figure 861673DEST_PATH_IMAGE047
The specific calculation formula is as follows:
Figure 514372DEST_PATH_IMAGE060
in this embodiment, as shown in fig. 1, in step s2.3, the cosine similarity measure formula is calculated
Figure 962802DEST_PATH_IMAGE061
And
Figure 322239DEST_PATH_IMAGE062
the similarity of (c).
In this embodiment, as shown in fig. 1, the cosine similarity measure formula in step s2.3 is specifically,
Figure 384873DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 321605DEST_PATH_IMAGE026
in order to calculate the vectors of similarity, the two vectors have the same dimension.
In summary, the present invention fully utilizes the time context information in the video sequence image, designs the variable convolution spatio-temporal fusion module to overcome the difficulty of extracting the features of the video sequence image, obtains the feature map reinforced by spatio-temporal fusion, and finally performs the target detection on the feature map to obtain the final detection result, and is suitable for the video target detection scene and has the following advantages:
(1) aiming at the condition of target deformation of a video sequence image, establishing geometric modeling of a target through a plurality of layers of variable convolution layers to ensure accurate extraction of image characteristics;
(2) aiming at the condition of low quality of video sequence images, the target characteristics are enhanced through spatio-temporal context characteristic fusion, and the detection of subsequent networks is promoted.
The above description is only for the purpose of illustrating the present invention and is not intended to limit the scope of the present invention, and any person skilled in the art can substitute or change the technical solution of the present invention and its conception within the scope of the present invention.

Claims (10)

1. A video target detection method based on space-time feature deformable convolution fusion is characterized in that: comprises that
Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;
step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;
step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic target
Figure 283853DEST_PATH_IMAGE001
And length and width of the feature object
Figure 105179DEST_PATH_IMAGE002
2. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 1, characterized in that: step S1 specifically includes the following steps
Step S1.1. selecting the second one according to the video sequence
Figure 313306DEST_PATH_IMAGE003
Frame, first
Figure 344716DEST_PATH_IMAGE004
Frame and second
Figure 968595DEST_PATH_IMAGE005
The frames are used as input images of the network and are set as
Figure 706744DEST_PATH_IMAGE006
Figure 400287DEST_PATH_IMAGE007
And
Figure 794359DEST_PATH_IMAGE008
step s1.2. will
Figure 346563DEST_PATH_IMAGE006
Figure 939219DEST_PATH_IMAGE007
And
Figure 426832DEST_PATH_IMAGE008
respectively inputting the data into a feature extraction network to extract features to obtain feature graphs of the data
Figure 183566DEST_PATH_IMAGE009
Figure 211565DEST_PATH_IMAGE010
And
Figure 861989DEST_PATH_IMAGE011
3. the method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 2, characterized in that: step S2 specifically includes the following steps
Step s2.1. will
Figure 645138DEST_PATH_IMAGE009
And
Figure 810540DEST_PATH_IMAGE010
input into a variable convolution space-time fusion module together, will
Figure 845492DEST_PATH_IMAGE011
And
Figure 458745DEST_PATH_IMAGE010
inputting the two signals into another variable convolution space-time fusion module;
step s2.2. after the feature map is input into the variable convolution space-time fusion module, firstly splicing the channels of two feature maps input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally guiding the space-time fusion offset according to the respective offset
Figure 350477DEST_PATH_IMAGE009
And
Figure 206438DEST_PATH_IMAGE011
by variable convolution calculation of the obtained features
Figure 638556DEST_PATH_IMAGE012
And
Figure 60310DEST_PATH_IMAGE013
4. the method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 3, characterized in that: also included after step s2.2 is the following
Step s2.3. calculation
Figure 60627DEST_PATH_IMAGE012
And
Figure 13671DEST_PATH_IMAGE010
obtaining similarity weight from the similarity, and weighting the similarityAfter SoftMax, the product is mixed with
Figure 452743DEST_PATH_IMAGE012
Multiplying to obtain the product with enhanced characteristics
Figure 666686DEST_PATH_IMAGE014
Calculating
Figure 228118DEST_PATH_IMAGE013
And
Figure 589829DEST_PATH_IMAGE010
obtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMax
Figure 239116DEST_PATH_IMAGE012
Multiplying to obtain the product with enhanced characteristics
Figure 635462DEST_PATH_IMAGE015
Step s2.4. will
Figure 631711DEST_PATH_IMAGE014
Figure 418401DEST_PATH_IMAGE010
And
Figure 933696DEST_PATH_IMAGE015
and performing point addition operation to perform feature fusion to obtain a feature fusion result of the feature diagram of the variable convolution space-time fusion module.
5. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 4, characterized in that:
the variable convolution calculation formula in step s2.2 is specifically:
Figure 512445DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 25466DEST_PATH_IMAGE017
the regions involved in the calculation for the variable convolution are,
Figure 361769DEST_PATH_IMAGE018
is composed of
Figure 493805DEST_PATH_IMAGE017
The coordinates of the center point of (a),
Figure 599164DEST_PATH_IMAGE019
is a pair of
Figure 548665DEST_PATH_IMAGE017
The enumeration of each of the positions in (a),
Figure 169003DEST_PATH_IMAGE019
the amount of coordinate shift given to the variable convolution layer,
Figure 291679DEST_PATH_IMAGE020
the characteristic value of the corresponding pixel point.
6. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 5, characterized in that: the above-mentioned
Figure 189228DEST_PATH_IMAGE020
The value of (c) is derived by bilinear interpolation.
7. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 4, characterized in that: in step s2.3, the calculation is performed according to a cosine similarity measurement formula
Figure 683532DEST_PATH_IMAGE012
And
Figure 994428DEST_PATH_IMAGE010
the similarity of (c).
8. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 7, characterized in that: the cosine similarity measure formula in step s2.3 is specifically,
Figure 592900DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 407272DEST_PATH_IMAGE022
in order to calculate the vectors of similarity, the two vectors have the same dimension.
9. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 3, characterized in that: and in the step s2.2, the space-time fusion offset of each module is respectively obtained through a 3-layer variable convolution calculation formula.
10. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 6, characterized in that: the specific process of the bilinear interpolation is that,
is provided with
Figure 88789DEST_PATH_IMAGE023
The point represented is
Figure 559084DEST_PATH_IMAGE024
The specific numerical value is
Figure 836613DEST_PATH_IMAGE025
The coordinates are
Figure 771071DEST_PATH_IMAGE026
The coordinates are not integers, so that the point can be known
Figure 233276DEST_PATH_IMAGE027
There are four nearest points around, let this four point be
Figure 987606DEST_PATH_IMAGE028
Their coordinates are respectively
Figure 583672DEST_PATH_IMAGE029
Figure 310320DEST_PATH_IMAGE030
Figure 740164DEST_PATH_IMAGE031
And
Figure 827462DEST_PATH_IMAGE032
the characteristic values corresponding to the characteristic map are respectively
Figure 571427DEST_PATH_IMAGE033
Figure 214898DEST_PATH_IMAGE034
Figure 409119DEST_PATH_IMAGE035
And
Figure 75724DEST_PATH_IMAGE036
firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation
Figure 764325DEST_PATH_IMAGE037
Figure 527882DEST_PATH_IMAGE038
Characteristic value of
Figure 237212DEST_PATH_IMAGE039
Figure 46905DEST_PATH_IMAGE040
The specific calculation formula is as follows:
Figure 194990DEST_PATH_IMAGE041
Figure 750736DEST_PATH_IMAGE042
then aligning the temporary point in the ordinate direction
Figure 693284DEST_PATH_IMAGE037
Figure 708382DEST_PATH_IMAGE038
Performing one-time single-line interpolation to obtain a point
Figure 863420DEST_PATH_IMAGE027
Characteristic value of
Figure 335990DEST_PATH_IMAGE043
The specific calculation formula is as follows:
Figure 777336DEST_PATH_IMAGE044
CN202110196121.8A 2021-02-22 2021-02-22 Video target detection method based on space-time feature deformable convolution fusion Pending CN112561001A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110196121.8A CN112561001A (en) 2021-02-22 2021-02-22 Video target detection method based on space-time feature deformable convolution fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110196121.8A CN112561001A (en) 2021-02-22 2021-02-22 Video target detection method based on space-time feature deformable convolution fusion

Publications (1)

Publication Number Publication Date
CN112561001A true CN112561001A (en) 2021-03-26

Family

ID=75036042

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110196121.8A Pending CN112561001A (en) 2021-02-22 2021-02-22 Video target detection method based on space-time feature deformable convolution fusion

Country Status (1)

Country Link
CN (1) CN112561001A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822172A (en) * 2021-08-30 2021-12-21 中国科学院上海微系统与信息技术研究所 Video spatiotemporal behavior detection method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287826A (en) * 2019-06-11 2019-09-27 北京工业大学 A kind of video object detection method based on attention mechanism

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287826A (en) * 2019-06-11 2019-09-27 北京工业大学 A kind of video object detection method based on attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BRANDON懂你: "图像处理之双线性插值", 《HTTPS://BLOG.CSDN.NET/QQ_37577735/ARTICLE/DETAILS/80041586》 *
GEDAS BERTASIUS等: "Object Detection in Video with Spatiotemporal Sampling Networks", 《HTTPS://ARXIV.ORG/PDF/1803.05549.PDF》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822172A (en) * 2021-08-30 2021-12-21 中国科学院上海微系统与信息技术研究所 Video spatiotemporal behavior detection method

Similar Documents

Publication Publication Date Title
EP3540637B1 (en) Neural network model training method, device and storage medium for image processing
Zhao et al. Alike: Accurate and lightweight keypoint detection and descriptor extraction
US11630972B2 (en) Assembly body change detection method, device and medium based on attention mechanism
CN108960211B (en) Multi-target human body posture detection method and system
CN111144376B (en) Video target detection feature extraction method
CN111311666A (en) Monocular vision odometer method integrating edge features and deep learning
CN111709980A (en) Multi-scale image registration method and device based on deep learning
CN112288758B (en) Infrared and visible light image registration method for power equipment
CN110992263A (en) Image splicing method and system
CN103955888A (en) High-definition video image mosaic method and device based on SIFT
CN114140623A (en) Image feature point extraction method and system
CN112561001A (en) Video target detection method based on space-time feature deformable convolution fusion
CN113888629A (en) RGBD camera-based rapid object three-dimensional pose estimation method
CN110322479B (en) Dual-core KCF target tracking method based on space-time significance
CN114707611B (en) Mobile robot map construction method, storage medium and equipment based on graph neural network feature extraction and matching
CN112884803A (en) Real-time intelligent monitoring target detection method and device based on DSP
CN116188535A (en) Video tracking method, device, equipment and storage medium based on optical flow estimation
CN116912467A (en) Image stitching method, device, equipment and storage medium
CN109801317A (en) The image matching method of feature extraction is carried out based on convolutional neural networks
CN115456870A (en) Multi-image splicing method based on external parameter estimation
CN111292357B (en) Video inter-frame rapid motion estimation method based on correlation filtering
CN111008555B (en) Unmanned aerial vehicle image small and weak target enhancement extraction method
CN113870307A (en) Target detection method and device based on interframe information
CN113223053A (en) Anchor-free target tracking method based on fusion of twin network and multilayer characteristics
CN111461140A (en) Linear descriptor construction and matching method suitable for S L AM system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210326