CN112561001A - Video target detection method based on space-time feature deformable convolution fusion - Google Patents
Video target detection method based on space-time feature deformable convolution fusion Download PDFInfo
- Publication number
- CN112561001A CN112561001A CN202110196121.8A CN202110196121A CN112561001A CN 112561001 A CN112561001 A CN 112561001A CN 202110196121 A CN202110196121 A CN 202110196121A CN 112561001 A CN112561001 A CN 112561001A
- Authority
- CN
- China
- Prior art keywords
- fusion
- feature
- space
- characteristic
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/251—Fusion techniques of input or preprocessed data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video target detection method based on space-time characteristic deformable convolution fusion, which relates to the technical field of image processing and comprises the following steps of S1, selecting an image serving as an input network from a video, and obtaining a characteristic diagram of the image of the input network; step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module; step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic targetAnd length and width of the feature object. The invention fully utilizes the time context information in the video sequence image, designs the variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, and obtains the time context informationAnd (4) performing space-time fusion on the enhanced feature diagram, and finally performing target detection on the feature diagram to obtain a final detection result, wherein the feature diagram is suitable for a video target detection scene.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to a video target detection method based on space-time feature deformable convolution fusion.
Background
Target detection is an important task in the field of computer vision, and has wide application in real life. With the rapid development of internet technology and communication technology, video data is becoming an indispensable part of life, so the importance of video target detection is becoming higher and higher.
The current research on target detection is mostly based on static images, and the research on video sequence images is less. Compared with the static image, the video sequence image may show different size, posture, view angle change and even non-rigid deformation for various reasons. Because the target detection algorithm based on the static image is mostly based on the traditional convolution structure and lacks sufficient geometric deformation modeling capability, the problems can cause that the feature extraction network can not effectively extract image features and can not well process video sequence images. Meanwhile, extremely abundant time context information exists in the video sequence image, and the target detection algorithm based on the static image cannot effectively utilize the information, so that the detection precision cannot meet the actual requirement.
Disclosure of Invention
The invention mainly aims to provide a video target detection method based on space-time characteristic deformable convolution fusion, which fully utilizes time context information in a video sequence image, designs a variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, obtains a characteristic diagram strengthened by space-time fusion, and finally performs target detection on the characteristic diagram to obtain a final detection result, thereby having good reference prospect.
The purpose of the invention can be achieved by adopting the following technical scheme:
a video target detection method based on space-time feature deformable convolution fusion comprises
Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;
step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;
step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic targetAnd length and width of the feature object。
The video target detection method based on the space-time feature deformable convolution fusion is characterized in that: step S1 specifically includes the following steps
Step S1.1. selecting the second one according to the video sequenceFrame, firstFrame and secondThe frames are used as input images of the network and are set as、And;
step s1.2. will、Andrespectively inputting the data into a feature extraction network to extract features to obtain feature graphs of the data、And。
the video target detection method based on the space-time feature deformable convolution fusion is characterized in that: step S2 specifically includes the following steps
Step s2.1. willAndinput into a variable convolution space-time fusion module together, willAndinputting the two signals into another variable convolution space-time fusion module;
step s2.2. after the characteristic diagram is input into the variable convolution space-time fusion module, firstly splicing the channels of two characteristic diagrams input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally, according to the respective offsetAmount of movement guidanceAndby variable convolution calculation of the obtained features。
The video target detection method based on the spatio-temporal feature deformable convolution fusion further comprises the following steps after the step s2.2
Step s2.3. calculationAndobtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMaxMultiplying to obtain the product with enhanced characteristicsCalculatingAndobtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMaxMultiplying to obtain the product with enhanced characteristics;
Step s2.4. will、Andand performing point addition operation to perform feature fusion to obtain a feature fusion result of the feature diagram of the variable convolution space-time fusion module.
In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the variable convolution calculation formula in step s2.2 is specifically as follows:
wherein the content of the first and second substances,the regions involved in the calculation for the variable convolution are,is composed ofThe coordinates of the center point of (a),is a pair ofThe enumeration of each of the positions in (a),the amount of coordinate shift given to the variable convolution layer,the characteristic value of the corresponding pixel point.
Radical of the foregoingMethod for video object detection based on spatio-temporal feature deformable convolution fusion, said methodThe value of (c) is derived by bilinear interpolation.
In the video target detection method based on the time-space feature deformable convolution fusion, in step s2.3, the calculation is performed according to the cosine similarity measurement formulaAndthe similarity of (c).
In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the cosine similarity measurement formula in step s2.3 is specifically,
wherein the content of the first and second substances,in order to calculate the vectors of similarity, the two vectors have the same dimension.
In the video target detection method based on the spatio-temporal feature deformable convolution fusion, the spatio-temporal fusion offset of each module is respectively obtained through a 3-layer variable convolution calculation formula in the step s2.2.
In the foregoing video target detection method based on spatio-temporal feature deformable convolution fusion, the specific process of bilinear interpolation is,
is provided withThe point represented isThe specific numerical value isThe coordinates areThe coordinates are not integers, so that the point can be knownThere are four nearest points around, let this four point beTheir coordinates are respectively、、Andthe characteristic values corresponding to the characteristic map are respectively、、And;
firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation、Characteristic value of、The specific calculation formula is as follows:
then aligning the temporary point in the ordinate direction、Performing one-time single-line interpolation to obtain a pointCharacteristic value ofThe specific calculation formula is as follows:
the invention has the beneficial technical effects that:
the invention fully utilizes the time context information in the video sequence image, designs the variable convolution space-time fusion module to overcome the difficulty of extracting the characteristics of the video sequence image, obtains the characteristic diagram strengthened by space-time fusion, and finally carries out target detection on the characteristic diagram to obtain the final detection result, thereby being applicable to the video target detection scene and having the following advantages:
(1) aiming at the condition of target deformation of a video sequence image, establishing geometric modeling of a target through a plurality of layers of variable convolution layers to ensure accurate extraction of image characteristics;
(2) aiming at the condition of low quality of video sequence images, the target characteristics are enhanced through spatio-temporal context characteristic fusion, and the detection of subsequent networks is promoted.
Drawings
FIG. 1 is a flow chart of a method for video target detection based on spatiotemporal feature deformable convolution fusion in accordance with the present invention;
FIG. 2 is a flow diagram of a variable convolution spatiotemporal fusion module in accordance with the present invention;
FIG. 3 is a flow chart of feature fusion in accordance with the present invention.
Detailed Description
In order to make the technical solutions of the present invention more clear and definite for those skilled in the art, the present invention is further described in detail below with reference to the examples and the accompanying drawings, but the embodiments of the present invention are not limited thereto.
As shown in fig. 1-3, the method for detecting a video target based on spatio-temporal feature deformable convolution fusion provided by this embodiment includes
Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;
step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;
step S3, detecting the characteristic target according to the characteristic diagram obtained in the step S2 to obtain the coordinate of the center point of the characteristic targetAnd length and width of the feature object。
In the present embodiment, as shown in fig. 1, step S1 specifically includes the following steps
Step S1.1. selecting the second one according to the video sequenceFrame, firstFrame and secondThe frames are used as input images of the network and are set as、And;
step s1.2. will、Andrespectively inputting the data into a feature extraction network to extract features to obtain feature graphs of the data、And。
in the present embodiment, as shown in fig. 1, step S2 specifically includes the following steps
Step s2.1. willAndinput into a variable convolution space-time fusion module together, willAndinputting the two signals into another variable convolution space-time fusion module;
step s2.2. after the feature map is input into the variable convolution space-time fusion module, firstly splicing the channels of two feature maps input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally guiding the space-time fusion offset according to the respective offsetAndby variable convolution calculation of the obtained featuresAnd the two modules are calculated simultaneously without any sequence.
In this embodiment, in step s2.2, the spatio-temporal fusion offsets of the modules are obtained through the 3-layer variable convolution calculation formula, that is, one module obtains its spatio-temporal fusion offset through the 3-layer variable convolution calculation formula, and the other module obtains its spatio-temporal fusion offset through the 3-layer variable convolution calculation formula.
In this embodiment, as shown in fig. 1, the step s2.2 is followed by the following steps
Step s2.3. calculationAndobtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMaxMultiplying to obtain the product with enhanced characteristicsCalculatingAndobtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMaxMultiplying to obtain the product with enhanced characteristics ;
In this embodiment, the SoftMax calculation formula in step s2.3 is specifically:
wherein the content of the first and second substances,to be a collection of all numbers that require SoftMax calculations,is the first ofThe number of the number is,the number of digits that need to be calculated for SoftMax.
Step s2.4. will、Andand performing point addition operation to perform feature fusion to obtain a feature fusion result of the feature diagram of the variable convolution space-time fusion module.
In this embodiment, as shown in fig. 1, the variable convolution calculation formula in step s2.2 is specifically:
wherein the content of the first and second substances,the regions involved in the calculation for the variable convolution are,is composed ofThe coordinates of the center point of (a),is a pair ofThe enumeration of each of the positions in (a),the amount of coordinate shift given to the variable convolution layer,the characteristic value of the corresponding pixel point.
In this embodiment, as shown in figure 1,the numerical value of (A) is obtained by bilinear interpolation, and the specific process is as follows:
is provided withThe point represented isThe specific numerical value isThe coordinates areThe coordinates are not integers, so that the point can be knownThere are four nearest points around, let this four point beTheir coordinates are respectively、、Andthe characteristic values corresponding to the characteristic map are respectively、、And;
firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation、Characteristic value of、The specific calculation formula is as follows:
then aligning the temporary point in the ordinate direction、Performing one-time single-line interpolation to obtain a pointCharacteristic value ofThe specific calculation formula is as follows:
in this embodiment, as shown in fig. 1, in step s2.3, the cosine similarity measure formula is calculatedAndthe similarity of (c).
In this embodiment, as shown in fig. 1, the cosine similarity measure formula in step s2.3 is specifically,
wherein the content of the first and second substances,in order to calculate the vectors of similarity, the two vectors have the same dimension.
In summary, the present invention fully utilizes the time context information in the video sequence image, designs the variable convolution spatio-temporal fusion module to overcome the difficulty of extracting the features of the video sequence image, obtains the feature map reinforced by spatio-temporal fusion, and finally performs the target detection on the feature map to obtain the final detection result, and is suitable for the video target detection scene and has the following advantages:
(1) aiming at the condition of target deformation of a video sequence image, establishing geometric modeling of a target through a plurality of layers of variable convolution layers to ensure accurate extraction of image characteristics;
(2) aiming at the condition of low quality of video sequence images, the target characteristics are enhanced through spatio-temporal context characteristic fusion, and the detection of subsequent networks is promoted.
The above description is only for the purpose of illustrating the present invention and is not intended to limit the scope of the present invention, and any person skilled in the art can substitute or change the technical solution of the present invention and its conception within the scope of the present invention.
Claims (10)
1. A video target detection method based on space-time feature deformable convolution fusion is characterized in that: comprises that
Step S1, selecting an image as an input network from the video, and obtaining a characteristic diagram of the image of the input network;
step S2, inputting the characteristic diagrams into a variable convolution space-time fusion module respectively to obtain the characteristic diagrams strengthened by the module;
2. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 1, characterized in that: step S1 specifically includes the following steps
Step S1.1. selecting the second one according to the video sequenceFrame, firstFrame and secondThe frames are used as input images of the network and are set as、And;
3. the method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 2, characterized in that: step S2 specifically includes the following steps
Step s2.1. willAndinput into a variable convolution space-time fusion module together, willAndinputting the two signals into another variable convolution space-time fusion module;
step s2.2. after the feature map is input into the variable convolution space-time fusion module, firstly splicing the channels of two feature maps input into the same variable convolution space-time fusion module, then respectively obtaining the space-time fusion offset of each module through a plurality of layers of variable convolution calculation formulas, and finally guiding the space-time fusion offset according to the respective offsetAndby variable convolution calculation of the obtained featuresAnd。
4. the method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 3, characterized in that: also included after step s2.2 is the following
Step s2.3. calculationAndobtaining similarity weight from the similarity, and weighting the similarityAfter SoftMax, the product is mixed withMultiplying to obtain the product with enhanced characteristicsCalculatingAndobtaining similarity weight according to the similarity, and comparing the similarity weight with the similarity weight after the SoftMaxMultiplying to obtain the product with enhanced characteristics;
5. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 4, characterized in that:
the variable convolution calculation formula in step s2.2 is specifically:
wherein the content of the first and second substances,the regions involved in the calculation for the variable convolution are,is composed ofThe coordinates of the center point of (a),is a pair ofThe enumeration of each of the positions in (a),the amount of coordinate shift given to the variable convolution layer,the characteristic value of the corresponding pixel point.
8. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 7, characterized in that: the cosine similarity measure formula in step s2.3 is specifically,
9. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 3, characterized in that: and in the step s2.2, the space-time fusion offset of each module is respectively obtained through a 3-layer variable convolution calculation formula.
10. The method for detecting video targets based on spatio-temporal feature deformable convolution fusion of claim 6, characterized in that: the specific process of the bilinear interpolation is that,
is provided withThe point represented isThe specific numerical value isThe coordinates areThe coordinates are not integers, so that the point can be knownThere are four nearest points around, let this four point beTheir coordinates are respectively、、Andthe characteristic values corresponding to the characteristic map are respectively、、And;
firstly, 2 times of unidirectional interpolation is carried out in the direction of the abscissa, and a temporary point is obtained by calculation、Characteristic value of、The specific calculation formula is as follows:
then aligning the temporary point in the ordinate direction、Performing one-time single-line interpolation to obtain a pointCharacteristic value ofThe specific calculation formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110196121.8A CN112561001A (en) | 2021-02-22 | 2021-02-22 | Video target detection method based on space-time feature deformable convolution fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110196121.8A CN112561001A (en) | 2021-02-22 | 2021-02-22 | Video target detection method based on space-time feature deformable convolution fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112561001A true CN112561001A (en) | 2021-03-26 |
Family
ID=75036042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110196121.8A Pending CN112561001A (en) | 2021-02-22 | 2021-02-22 | Video target detection method based on space-time feature deformable convolution fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112561001A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822172A (en) * | 2021-08-30 | 2021-12-21 | 中国科学院上海微系统与信息技术研究所 | Video spatiotemporal behavior detection method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287826A (en) * | 2019-06-11 | 2019-09-27 | 北京工业大学 | A kind of video object detection method based on attention mechanism |
-
2021
- 2021-02-22 CN CN202110196121.8A patent/CN112561001A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287826A (en) * | 2019-06-11 | 2019-09-27 | 北京工业大学 | A kind of video object detection method based on attention mechanism |
Non-Patent Citations (2)
Title |
---|
BRANDON懂你: "图像处理之双线性插值", 《HTTPS://BLOG.CSDN.NET/QQ_37577735/ARTICLE/DETAILS/80041586》 * |
GEDAS BERTASIUS等: "Object Detection in Video with Spatiotemporal Sampling Networks", 《HTTPS://ARXIV.ORG/PDF/1803.05549.PDF》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113822172A (en) * | 2021-08-30 | 2021-12-21 | 中国科学院上海微系统与信息技术研究所 | Video spatiotemporal behavior detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3540637B1 (en) | Neural network model training method, device and storage medium for image processing | |
Zhao et al. | Alike: Accurate and lightweight keypoint detection and descriptor extraction | |
US11630972B2 (en) | Assembly body change detection method, device and medium based on attention mechanism | |
CN108960211B (en) | Multi-target human body posture detection method and system | |
CN111144376B (en) | Video target detection feature extraction method | |
CN111311666A (en) | Monocular vision odometer method integrating edge features and deep learning | |
CN111709980A (en) | Multi-scale image registration method and device based on deep learning | |
CN112288758B (en) | Infrared and visible light image registration method for power equipment | |
CN110992263A (en) | Image splicing method and system | |
CN103955888A (en) | High-definition video image mosaic method and device based on SIFT | |
CN114140623A (en) | Image feature point extraction method and system | |
CN112561001A (en) | Video target detection method based on space-time feature deformable convolution fusion | |
CN113888629A (en) | RGBD camera-based rapid object three-dimensional pose estimation method | |
CN110322479B (en) | Dual-core KCF target tracking method based on space-time significance | |
CN114707611B (en) | Mobile robot map construction method, storage medium and equipment based on graph neural network feature extraction and matching | |
CN112884803A (en) | Real-time intelligent monitoring target detection method and device based on DSP | |
CN116188535A (en) | Video tracking method, device, equipment and storage medium based on optical flow estimation | |
CN116912467A (en) | Image stitching method, device, equipment and storage medium | |
CN109801317A (en) | The image matching method of feature extraction is carried out based on convolutional neural networks | |
CN115456870A (en) | Multi-image splicing method based on external parameter estimation | |
CN111292357B (en) | Video inter-frame rapid motion estimation method based on correlation filtering | |
CN111008555B (en) | Unmanned aerial vehicle image small and weak target enhancement extraction method | |
CN113870307A (en) | Target detection method and device based on interframe information | |
CN113223053A (en) | Anchor-free target tracking method based on fusion of twin network and multilayer characteristics | |
CN111461140A (en) | Linear descriptor construction and matching method suitable for S L AM system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210326 |