WO2023116632A1 - Video instance segmentation method and apparatus based on spatio-temporal memory information - Google Patents

Video instance segmentation method and apparatus based on spatio-temporal memory information Download PDF

Info

Publication number
WO2023116632A1
WO2023116632A1 PCT/CN2022/140070 CN2022140070W WO2023116632A1 WO 2023116632 A1 WO2023116632 A1 WO 2023116632A1 CN 2022140070 W CN2022140070 W CN 2022140070W WO 2023116632 A1 WO2023116632 A1 WO 2023116632A1
Authority
WO
WIPO (PCT)
Prior art keywords
memory
feature map
instance
frame image
segmented
Prior art date
Application number
PCT/CN2022/140070
Other languages
French (fr)
Chinese (zh)
Inventor
周翊民
马壮
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023116632A1 publication Critical patent/WO2023116632A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the invention belongs to the technical field of video processing, and in particular relates to a method for segmenting video instances based on spatio-temporal memory information, a segmenting device, a computer-readable storage medium, and computer equipment.
  • video instance segmentation is to segment some specific categories of objects and obtain their segmentation masks without any human intervention. Different from unsupervised video object segmentation, video instance segmentation needs to identify specific objects, not just salient objects. In order to complete detection, segmentation and tracking tasks at the same time, most video instance segmentation methods are often based on object detection methods.
  • the framework extends segmentation modules into one-stage and two-stage methods.
  • the two-stage method is the method of "detect first and then segment”. First, the frame of the target object is located, and then the target object is segmented within the frame.
  • the typical representative is Mask R-CNN.
  • Mask R-CNN is based on Faster R-CNN, which adds a branch of predictive segmentation mask, which relies heavily on ROI features and operations. It first generates a set of candidate solutions, and then predicts the foreground mask on each ROI.
  • the two-stage problem is that features cannot be shared between segmentation and detection, so that end-to-end backpropagation cannot be performed; secondly, ROI features are cropped to a fixed resolution size, so that some large objects will lose segmentation accuracy; finally, The problem of ROI itself, the ROI candidate area is much larger than the final prediction, which limits the operating efficiency of the algorithm.
  • Single-stage methods treat detection, segmentation, and tracking in video instance segmentation as problems that can be solved simultaneously.
  • the early single-stage method does not perform target detection, but directly performs segmentation, which loses the category information of the object and has a low accuracy rate.
  • the later single-stage method is mainly to design the combination relationship between the prototype mask and the target instance. By learning a set of coefficients, the target position and the semantic segmentation result are matched.
  • 2019, YOLACT decomposed the instance segmentation into two parallel tasks to generate a Group prototype masks and predict mask coefficients for each instance, further improving accuracy.
  • the SG-Net and ST-Mask proposed in 2021 refine the segmentation module based on the aforementioned method, and add the segmentation result information of the previous frame to guide the segmentation process of the current frame.
  • the historical segmentation results contain many frames, including the segmentation results of different states of the target instance, which have important guiding significance for the segmentation branch to resist physical deformation and occlusion.
  • the technical problem solved by the invention is: how to make full use of the historical segmentation results in video instance segmentation to improve the robustness of the segmentation.
  • a video instance segmentation method based on spatiotemporal memory information comprising:
  • the instance segmentation result of the current frame image is obtained according to the global feature map, the query key feature map and the query value feature map.
  • the method for obtaining the query key feature map and the query value feature map of the instance to be segmented of the current frame image in the video includes:
  • the method for obtaining the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image in the video includes:
  • the overall memory key feature map and the overall memory value feature map of the historical frame image are binarized to obtain the individual instance in the historical frame image.
  • the memory key feature map and the memory value feature map of each instance in each of the history frame images are screened out. feature map.
  • the method for calculating the weight value of the memory key feature map of the instance to be segmented in each of the memory frame images when performing attention matching includes:
  • a global pooling process is performed on the feature maps connected by the channel dimensions to obtain each weight value.
  • the video instance segmentation method also includes:
  • the predetermined number is determined according to the cosine similarity and the total number of memory frame images containing the instance to be segmented.
  • the method for obtaining a global feature map with weighted spatio-temporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients includes:
  • the global feature map G u,n with weighted spatio-temporal information is calculated according to the following formula:
  • 1 ⁇ u ⁇ g q,n , g q,n represents a predetermined number
  • KM u,n represents a memory key feature map of a memory frame image
  • the method for obtaining the instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map includes:
  • the attention matrix is connected with the channel dimension of the instance to be segmented, and the result of the connection operation is sent to the decoder for deconvolution and upsampling to obtain the instance segmentation result.
  • the present application also discloses a video instance segmentation device based on a spatiotemporal memory weighted network, and the video instance segmentation device includes:
  • the feature map acquisition module is used to obtain the query key feature map and the query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and the memory value feature map of the instance to be segmented in each memory frame image,
  • the memory frame image is the historical frame image containing the instance to be segmented before the current frame image in the video
  • Weight value calculation module used to calculate the weight value of the memory key feature map of the instance to be segmented in each of the memory frame images when performing attention matching
  • a weight coefficient screening module configured to select a predetermined number of weight values from all weight values as weight coefficients in descending order
  • a weighted value calculation module used to obtain a global feature map with weighted spatiotemporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients;
  • An attention matching module configured to obtain an instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
  • the present application also discloses a computer-readable storage medium, the computer-readable storage medium stores a video instance segmentation program based on a spatiotemporal memory weighted network, and when the video instance segmentation program based on a spatiotemporal memory weighted network is executed by a processor Realize the above-mentioned video instance segmentation method based on spatiotemporal memory information.
  • the present application also discloses a computer device, which includes a computer-readable storage medium, a processor, and a video instance segmentation program based on a spatiotemporal memory weighted network stored in the computer-readable storage medium.
  • a computer device which includes a computer-readable storage medium, a processor, and a video instance segmentation program based on a spatiotemporal memory weighted network stored in the computer-readable storage medium.
  • the invention discloses a video instance segmentation method and segmentation device based on spatio-temporal memory information. Compared with the existing method, it has the following technical effects:
  • This method can make full use of the historical information of memory frame images and improve the robustness of the segmentation results.
  • by screening out high-weight memory frame images for weighted matching it avoids directly using all memory frame images for calculation, and reduces the amount of calculation.
  • the feature map is binarized, so that the spatial attention matching is only performed locally, which reduces the influence of similar objects on the segmentation results.
  • Fig. 1 is the overall flowchart of the video instance segmentation method based on spatio-temporal memory information of Embodiment 1 of the present invention
  • Fig. 2 is the detailed flowchart of the video instance segmentation method based on spatio-temporal memory information of Embodiment 1 of the present invention
  • FIG. 3 is a schematic diagram of the calculation process of the weight value of the memory key feature map according to Embodiment 1 of the present invention.
  • FIG. 4 is a functional block diagram of a video instance segmentation device based on a spatiotemporal memory weighted network according to Embodiment 2 of the present invention
  • FIG. 5 is a schematic diagram of computer equipment according to Embodiment 4 of the present invention.
  • the feature map performs spatio-temporal weighted matching on the key feature map and value feature map of the instance to be segmented in the current frame image to obtain the final video instance segmentation result.
  • This method can make full use of the historical information of memory frame images and improve the robustness of the segmentation results.
  • by filtering out memory frame images with high weights for weighted matching it avoids directly using all memory frame images for calculation and reduces the amount of calculation.
  • the video instance segmentation method based on spatiotemporal memory information in the first embodiment includes the following steps:
  • Step S10 Obtain the query key feature map and query value feature map of the instance to be segmented in the current frame image in the video, and the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image, wherein the memory frame image is the historical frame image containing the instance to be segmented before the current frame image in the video;
  • Step S20 Calculate the weight value of the memory key feature map of the instance to be segmented in each memory frame image when performing attention matching
  • Step S30 Select a predetermined number of weight values from all weight values in descending order as weight coefficients
  • Step S40 Obtain a global feature map with weighted spatiotemporal information according to each weight coefficient, the memory key feature map and the memory value feature map of the memory frame image corresponding to each weight coefficient;
  • Step S50 Obtain an instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
  • step S10 the method for obtaining the query key feature map and the query value feature map of the instance to be segmented of the current frame image in the video includes the following steps:
  • Step S101 perform feature extraction on the current frame image, and obtain several feature maps of different scales.
  • the instance segmentation framework adopts the FCOS single-stage instance segmentation framework, and adds segmentation branches on the basis of center point and frame prediction.
  • the backbone network is the same as FCOS, and selects ResNet and FPNs to extract input features.
  • ResNet is used to extract the features of the current frame image, and five different scale convolution feature maps r1, r2, r3, r4, r5 are obtained, and three of the convolution feature maps r3, r4, r5 are horizontally connected to FPN
  • the network obtains three different scales of pyramid feature maps P 3 , P 4 , P 5 , and then performs pooling operation on the pyramid feature map P 5 to obtain pyramid feature maps P 6 , P 7 , and the pyramid feature maps of different scales P 3 , P 4 , P 5 , P 6 , P 7 as subsequent input.
  • Step S102 Obtain the center point and bounding box of the instance to be segmented and the overall query key feature map and overall query value feature map of the current frame image according to several feature maps of different scales.
  • the pyramid feature maps P 3 , P 4 , P 5 , P 6 , and P 7 of different scales are respectively input into the center point regression branch network and the frame regression classification prediction branch network to obtain the center point CE t of the instance to be segmented, i , bounding box B t,i , class CL t,i .
  • the fourth convolutional block of ResNet is combined with a convolutional layer as a query frame encoder, the original image of the current frame is input to the query frame encoder, and the overall query key feature map K q of the current frame image is output.
  • the overall query value feature map V q , q stands for query, which means query, and the query frame is equivalent to the current frame.
  • Step S103 Binarize the overall query key feature map and the overall query value feature map of the current frame image according to the center point and bounding box of the instance to be segmented, and obtain the query key feature map and query value feature map of the instance to be segmented .
  • the bounding box of each instance to be segmented is binarized at 1.5 times the size, and each to-be-segmented instance is divided into the bounding box of the overall query key feature map and the overall query value feature map of the current frame image.
  • the pixel gray value of the area where the bounding box of the segmentation instance is located is set to 1, and the pixel gray value of other areas is set to 0.
  • N is the number of instances to be segmented, n ⁇ [1,N].
  • the binarization process is to prevent similar instances from affecting the segmentation results, effectively increasing the segmentation accuracy.
  • step S10 the method for obtaining the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image in the video includes:
  • Step S111 acquiring the segmentation results, original images, center points and bounding boxes of each instance corresponding to all historical frame images in the video before the current frame image.
  • Segmentation results, original images, center points and bounding boxes of each instance, and each instance have been stored in advance for all historical frame images I 1 -I t-1 .
  • Step S112 according to the corresponding segmentation results and original images of the historical frame images, the overall memory key feature map and memory query value feature map of the historical frame images are obtained.
  • Step S113 according to the center point and bounding box of each instance in the historical frame image, perform binarization operation on the overall memory key feature map and the overall memory value feature map of the historical frame image, and obtain the memory key feature of each instance in the historical frame image graph and memory value feature map.
  • the bounding box of each instance is binarized at 1.5 times the size, and the boundary of each instance is placed inside the bounding box of the overall memory key feature map and the overall memory value feature map of the historical frame image
  • the pixel gray value of the area where the box is located is set to 1, and the pixel gray value of other areas is set to 0.
  • N is the instance quantity.
  • the binarization process is to prevent similar instances from affecting the segmentation results, effectively increasing the segmentation accuracy.
  • Step S114 Filter out the memory key feature map and memory value feature map of each instance to be segmented in each memory frame image from the memory key feature map and memory value feature map of each instance in each frame image according to the category of the instance to be segmented.
  • the method for calculating the weight value of the memory key feature map of the instance to be segmented in each memory frame image when performing attention matching includes: the memory key feature of the instance to be segmented in each memory frame image
  • the graphs are spatially connected to obtain feature maps connected by channel dimensions; global pooling is performed on feature maps connected by channel dimensions to obtain each weight value.
  • the memory key feature map KM T,n of the instance to be segmented has a dimension of H ⁇ W ⁇ C, and the total number of memory frame images is L.
  • the feature map C n connected by channel dimensions is obtained.
  • the dimension of is HWC ⁇ L, and the global pooling is performed through the convolution kernel of H ⁇ W ⁇ C to obtain the weight vector W n , which contains L weight values.
  • the video instance segmentation method also includes:
  • the predetermined number is determined according to the cosine similarity and the total number L n of memory frame images containing instances to be segmented, specifically, the predetermined number gq,n is calculated according to the following formula,
  • the weight value W n of the top g q,n position is selected from the L weight values W n as the weight coefficient W n [u], that is, the memory frame with a high degree of correlation is used
  • the feature map of the image is used for subsequent attention matching calculations. While making full use of historical information, it reduces the amount of calculation, computing time and memory usage.
  • step S40 according to each weight coefficient W n [u], the memory key feature map KM u,n and the memory value feature map VM u,n of the memory frame image corresponding to each weight coefficient W n [u] are obtained.
  • Methods for global feature maps G u,n with weighted spatiotemporal information include:
  • the global feature map G u,n with weighted spatio-temporal information is calculated according to the following formula:
  • 1 ⁇ u ⁇ g q,n , g q,n represents a predetermined number
  • KM u,n represents the memory key feature map of the memory frame image
  • D u,n is the number u that contains attention information and historical segmentation details
  • the feature map corresponding to the memory frame frame, the dimension of the global feature map G u,n is a fixed dimension.
  • step S50 the method for obtaining the instance segmentation result of the current frame image according to the global feature map G u,n , the query key feature map K q,n and the query value feature map V q,n includes:
  • the attention matrix and the query value feature map V q, n of the instance to be segmented are connected in the channel dimension, and the result of the connection operation is sent to the decoder for deconvolution and upsampling to obtain the instance segmentation result.
  • the video instance segmentation device based on the spatiotemporal memory weighted network of the second embodiment includes a feature map acquisition module 10, a weight value calculation module 20, a weight coefficient screening module 30, a weight value calculation module 40 and attention matching module 50 .
  • Feature map acquisition module 10 is used to obtain the query key feature map and the query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and the memory value feature map of the instance to be segmented in each memory frame image; weight value The calculation module 20 is used to calculate the weight value of the memory key feature map of the example to be segmented in each piece of memory frame images when performing attention matching; the weight coefficient screening module 30 is used to select from all weight values in order from large to small A predetermined number of weight values are used as weight coefficients; weight value calculation module 40 is used to obtain a global feature map with weighted spatio-temporal information according to each weight coefficient, the memory key feature map and the memory value feature map of the memory frame image corresponding to each weight coefficient; note The force matching module 50 is used to obtain the instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
  • the feature map acquisition module 10 is used to: perform feature extraction on the current frame image to obtain feature maps of several different scales; obtain the center point and bounding box of the instance to be segmented and the current frame image according to the feature maps of several different scales
  • the overall query key feature map and the overall query value feature map; according to the center point and bounding box of the instance to be segmented, the overall query key feature map and the overall query value feature map of the current frame image are binarized to obtain the instance to be segmented
  • the feature map acquisition module 10 is used to: acquire the segmentation results corresponding to all historical frame images before the current frame image in the video, the original image, the center point and the bounding box of each instance; according to the segmentation results corresponding to the historical frame images, the original
  • the overall memory key feature map and the overall memory value feature map of the historical frame image are obtained from the figure; according to the center point and the bounding box of each instance in the historical frame image, the overall memory key feature map and the overall memory value feature map of the historical frame image are processed twice.
  • weight value calculation module 20 is used to spatially connect the memory key feature maps of the instances to be segmented in each memory frame image to obtain feature maps connected by channel dimensions; perform global pooling processing on the feature maps connected by channel dimensions, Get each weight value.
  • weight coefficient screening module 30 for specific processing details of the weight coefficient screening module 30 , the weight value calculation module 40 and the attention matching module 50 , reference may be made to the relevant description in Embodiment 1, and details are not repeated here.
  • Embodiment 3 of the present application also discloses a computer-readable storage medium.
  • the computer-readable storage medium stores a video instance segmentation program based on a spatiotemporal memory weighted network.
  • the video instance segmentation program based on a spatiotemporal memory weighted network is executed by a processor, Implement the above-mentioned video instance segmentation method based on spatio-temporal memory weighted network.
  • Embodiment 4 also discloses a computer device.
  • the computer device includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 .
  • the processor 12 reads the corresponding computer program from the computer-readable storage medium and executes it, forming a request processing device on a logical level.
  • one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subject of the following processing flow is not limited to each A logic unit, which can also be a hardware or logic device.
  • the computer-readable storage medium 11 stores a video instance segmentation program based on a spatiotemporal memory weighted network.
  • the video instance segmentation program based on a spatiotemporal memory weighted network is executed by a processor, the above-mentioned video instance segmentation method based on a spatiotemporal memory weighted network is realized.
  • Computer-readable storage media includes both volatile and non-permanent, removable and non-removable media by any method or technology for storage of information.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer readable storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically erasable

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed in the present invention are a video instance segmentation method and apparatus based on spatio-temporal memory information. The video instance segmentation method comprises: acquiring a query key feature map and a query value feature map of an instance to be segmented of the current frame of image in a video, and a memory key feature map and a memory value feature map of an instance to be segmented of each memory frame image; calculating a weight value of the memory key feature map of the instance to be segmented of each memory frame image when attention matching is performed; selecting a predetermined number of weight values from among all the weight values in a descending order, and taking same as weight coefficients; according to the weight coefficients, and memory key feature maps and memory value feature maps of memory frame images corresponding to the weight coefficients, obtaining a global feature map having weighted spatio-temporal information; and obtaining an instance segmentation result of the current frame of image according to the global feature map, the query key feature map and the query value feature map. By means of the method, full use can be made of historical information of a memory frame image, thereby improving the robustness of a segmentation result.

Description

基于时空记忆信息的视频实例分割方法和分割装置Video instance segmentation method and segmentation device based on spatio-temporal memory information 技术领域technical field
本发明属于视频处理技术领域,具体地讲,涉及一种基于时空记忆信息的视频实例分割方法、分割装置、计算机可读存储介质、计算机设备。The invention belongs to the technical field of video processing, and in particular relates to a method for segmenting video instances based on spatio-temporal memory information, a segmenting device, a computer-readable storage medium, and computer equipment.
背景技术Background technique
视频实例分割的目标是在不需要任何人为干预的情况下,对某些特定类别的物体进行分割,得到其分割掩膜。与无监督的视频目标分割不同,视频实例分割需要识别出特定物体,而不仅仅是显著的物体,为了同时完成检测、分割和跟踪任务,大多数视频实例分割方法往往以目标检测的方法为基础框架扩展分割模块,分为单阶段和两阶段方法。The goal of video instance segmentation is to segment some specific categories of objects and obtain their segmentation masks without any human intervention. Different from unsupervised video object segmentation, video instance segmentation needs to identify specific objects, not just salient objects. In order to complete detection, segmentation and tracking tasks at the same time, most video instance segmentation methods are often based on object detection methods. The framework extends segmentation modules into one-stage and two-stage methods.
两阶段方法即“先检测后分割”的方法,首先定位到目标物体的边框,然后在边框内分割目标物体,典型的代表是Mask R-CNN。Mask R-CNN是在Faster R-CNN的基础上添加了一个预测分割mask的分支,很大程度上依赖于ROI特征和操作,首先产生一组候选方案,然后预测每个ROI上的前景掩码,两阶段的问题是分割和检测之间不能共享特征,这样就无法进行端到端的反向传播;其次ROI特征被裁剪成了固定分辨率大小,使得一些大的物体会损失分割精度;最后是ROI本身的问题,ROI候选区域远远大于最终预测,这限制了算法的运行效率。The two-stage method is the method of "detect first and then segment". First, the frame of the target object is located, and then the target object is segmented within the frame. The typical representative is Mask R-CNN. Mask R-CNN is based on Faster R-CNN, which adds a branch of predictive segmentation mask, which relies heavily on ROI features and operations. It first generates a set of candidate solutions, and then predicts the foreground mask on each ROI. , the two-stage problem is that features cannot be shared between segmentation and detection, so that end-to-end backpropagation cannot be performed; secondly, ROI features are cropped to a fixed resolution size, so that some large objects will lose segmentation accuracy; finally, The problem of ROI itself, the ROI candidate area is much larger than the final prediction, which limits the operating efficiency of the algorithm.
单阶段方法把视频实例分割中检测、分割和跟踪看为可以同时解决的问题。早期的单阶段方法不进行目标检测,直接进行分割,这丢失了物体的类别信息,准确率也很低。后期单阶段方法主要是在设计原型掩膜和目标实例之间的组合关系,通过学习一组系数将目标位置和语义分割结果对应起来,2019年YOLACT将实例分割分解为两个并行任务,生成一组原型掩码并预测每个实例的掩码系数,进一步提高的精度。2021年提出的SG-Net和ST-Mask在前述方法的基础上对分割模块进行的细化处理,并且增添了前一帧分割结果信息来指导当前帧的分割过程。然而历史分割结果包含许多帧,包含目标实例的不同状态的分割结果,这些对分割分支抵抗物理形变和遮挡有重要指导意义。Single-stage methods treat detection, segmentation, and tracking in video instance segmentation as problems that can be solved simultaneously. The early single-stage method does not perform target detection, but directly performs segmentation, which loses the category information of the object and has a low accuracy rate. The later single-stage method is mainly to design the combination relationship between the prototype mask and the target instance. By learning a set of coefficients, the target position and the semantic segmentation result are matched. In 2019, YOLACT decomposed the instance segmentation into two parallel tasks to generate a Group prototype masks and predict mask coefficients for each instance, further improving accuracy. The SG-Net and ST-Mask proposed in 2021 refine the segmentation module based on the aforementioned method, and add the segmentation result information of the previous frame to guide the segmentation process of the current frame. However, the historical segmentation results contain many frames, including the segmentation results of different states of the target instance, which have important guiding significance for the segmentation branch to resist physical deformation and occlusion.
现有的单阶段视频实例分割方法往往没有考虑历史的分割结果,对剧烈的 物体外观变化和遮挡不具有较高的鲁棒性。Existing single-stage video instance segmentation methods often do not consider historical segmentation results, and are not robust to drastic object appearance changes and occlusions.
发明内容Contents of the invention
(一)本发明所要解决的技术问题(1) technical problem to be solved by the present invention
本发明解决的技术问题是:如何在视频实例分割中充分利用历史分割结果,以提高分割鲁棒性。The technical problem solved by the invention is: how to make full use of the historical segmentation results in video instance segmentation to improve the robustness of the segmentation.
(二)本发明所采用的技术方案(2) The technical scheme adopted in the present invention
一种基于时空记忆信息的视频实例分割方法,所述视频实例分割方法包括:A video instance segmentation method based on spatiotemporal memory information, the video instance segmentation method comprising:
获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图以及各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图,其中所述记忆帧图像为视频中在所述当前帧图像之前的含有所述待分割实例的历史帧图像;Obtain the query key feature map and query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image, wherein the memory frame image is The historical frame image containing the instance to be segmented before the current frame image in the video;
计算各幅所述记忆帧图像中所述待分割实例的记忆键特征图在进行注意力匹配时的权重值;Calculate the weight value of the memory key feature map of the instance to be divided in each piece of the memory frame image when performing attention matching;
按照从大至小的顺序从所有权重值选出预定数目的权重值作为权重系数;Select a predetermined number of weight values from all weight values in descending order as weight coefficients;
根据各个所述权重系数、各个所述权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图;Obtain a global feature map with weighted spatiotemporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients;
根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果。The instance segmentation result of the current frame image is obtained according to the global feature map, the query key feature map and the query value feature map.
优选地,获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图的方法包括:Preferably, the method for obtaining the query key feature map and the query value feature map of the instance to be segmented of the current frame image in the video includes:
对所述当前帧图像进行特征提取,获得若干不同尺度的特征图;performing feature extraction on the current frame image to obtain several feature maps of different scales;
根据若干不同尺度的所述特征图得到所述待分割实例的中心点和边界框以及所述当前帧图像的整体查询键特征图和整体查询值特征图;Obtaining the central point and bounding box of the instance to be segmented and the overall query key feature map and the overall query value feature map of the current frame image according to the feature maps of several different scales;
根据所述待分割实例的中心点和边界框分别对所述当前帧图像的整体查询键特征图、整体查询值特征图进行二值化处理,获得所述待分割实例的查询键特征图和查询值特征图。Binarize the overall query key feature map and the overall query value feature map of the current frame image according to the center point and the bounding box of the instance to be segmented, and obtain the query key feature map and query value of the instance to be segmented. value feature map.
优选地,获取视频中各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图的方法包括:Preferably, the method for obtaining the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image in the video includes:
获取视频中在所述当前帧图像之前的所有历史帧图像对应的分割结果、原图、各个实例的中心点和边界框;Obtain the segmentation results, original images, center points and bounding boxes of all the historical frame images before the current frame image in the video;
根据历史帧图像对应的分割结果、原图得到所述历史帧图像的整体记忆键特征图和整体记忆值特征图;Obtain the overall memory key feature map and the overall memory value feature map of the historical frame image according to the segmentation result corresponding to the historical frame image and the original image;
根据所述历史帧图像中各个实例的中心点和边界框分别对所述历史帧图像的整体记忆键特征图、整体记忆值特征图进行二值化操作,获得所述历史帧图像中各个实例的记忆键特征图和记忆值特征图;According to the center point and the bounding box of each instance in the historical frame image, the overall memory key feature map and the overall memory value feature map of the historical frame image are binarized to obtain the individual instance in the historical frame image. Memory key feature map and memory value feature map;
根据所述待分割实例的类别从各幅所述历史帧图像中各个实例的记忆键特征图和记忆值特征图筛选出各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图。According to the category of the instance to be segmented, from the memory key feature map and the memory value feature map of each instance in each of the history frame images, the memory key feature map and the memory value of the instance to be segmented in each memory frame image are screened out. feature map.
优选地,计算各幅所述记忆帧图像中所述待分割实例的记忆键特征图在进行注意力匹配时的权重值的方法包括:Preferably, the method for calculating the weight value of the memory key feature map of the instance to be segmented in each of the memory frame images when performing attention matching includes:
将各幅所述记忆帧图像中所述待分割实例的记忆键特征图进行空间连接,获得通道维度相连的特征图;performing spatial connection on the memory key feature maps of the examples to be segmented in each of the memory frame images to obtain feature maps connected by channel dimensions;
对所述通道维度相连的特征图进行全局池化处理,得到各个权重值。A global pooling process is performed on the feature maps connected by the channel dimensions to obtain each weight value.
优选地,所述视频实例分割方法还包括:Preferably, the video instance segmentation method also includes:
确定所述待分割实例在所述当前帧图像的预测区域以及所述待分割实例在与所述当前帧图像相邻的记忆帧图像中的历史区域;Determining the prediction area of the instance to be segmented in the current frame image and the history area of the instance to be segmented in the memory frame image adjacent to the current frame image;
计算所述预测区域和所述历史区域之间的余弦相似度,其中所述余弦相似度的值大于0且小于1;calculating a cosine similarity between the predicted area and the historical area, wherein the value of the cosine similarity is greater than 0 and less than 1;
根据所述余弦相似度和含有所述待分割实例的记忆帧图像的总数确定所述预定数目。The predetermined number is determined according to the cosine similarity and the total number of memory frame images containing the instance to be segmented.
优选地,根据各个所述权重系数、各个所述权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图的方法包括:Preferably, the method for obtaining a global feature map with weighted spatio-temporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients includes:
计算各个所述权重系数W n[u]对应的记忆帧图像的加权记忆键特征图KMW u,nCalculate the weighted memory key feature map KMW u,n of the memory frame image corresponding to each of the weight coefficients W n [u]:
KMW u,n=KM u,n*W n[u] KMW u,n = KM u,n *W n [u]
将记忆帧图像的加权记忆键特征图KMW u,n和记忆值特征图VM u,n进行矩阵乘法计算: Perform matrix multiplication calculation on the weighted memory key feature map KMW u,n and memory value feature map VM u,n of the memory frame image:
Figure PCTCN2022140070-appb-000001
Figure PCTCN2022140070-appb-000001
根据如下公式计算得到具有加权时空信息的全局特征图G u,nThe global feature map G u,n with weighted spatio-temporal information is calculated according to the following formula:
Figure PCTCN2022140070-appb-000002
Figure PCTCN2022140070-appb-000002
其中,1≤u≤g q,n,g q,n表示预定数目,KM u,n表示记忆帧图像的记忆键特征图。 Wherein, 1≤u≤g q,n , g q,n represents a predetermined number, and KM u,n represents a memory key feature map of a memory frame image.
优选地,根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果的方法包括:Preferably, the method for obtaining the instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map includes:
将全局特征图所述待分割实例的查询键特征图进行矩阵点积,得到注意力矩阵;performing matrix dot product on the query key feature map of the instance to be segmented in the global feature map to obtain an attention matrix;
将所述注意力矩阵与所述待分割实例的进行通道维度的连接操作,并将连接操作的结果送入解码器进行反卷积和上采样得到实例分割结果。The attention matrix is connected with the channel dimension of the instance to be segmented, and the result of the connection operation is sent to the decoder for deconvolution and upsampling to obtain the instance segmentation result.
本申请还公开了一种基于时空记忆加权网络的视频实例分割装置,所述视频实例分割装置包括:The present application also discloses a video instance segmentation device based on a spatiotemporal memory weighted network, and the video instance segmentation device includes:
特征图获取模块,用于获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图以及各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图,其中所述记忆帧图像为视频中在所述当前帧图像之前的含有所述待分割实例的历史帧图像The feature map acquisition module is used to obtain the query key feature map and the query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and the memory value feature map of the instance to be segmented in each memory frame image, Wherein the memory frame image is the historical frame image containing the instance to be segmented before the current frame image in the video
权重值计算模块,用于计算各幅所述记忆帧图像中所述待分割实例的记忆键特征图在进行注意力匹配时的权重值Weight value calculation module, used to calculate the weight value of the memory key feature map of the instance to be segmented in each of the memory frame images when performing attention matching
权重系数筛选模块,用于按照从大至小的顺序从所有权重值选出预定数目的权重值作为权重系数;A weight coefficient screening module, configured to select a predetermined number of weight values from all weight values as weight coefficients in descending order;
加权值计算模块,用于根据各个所述权重系数、各个所述权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图;A weighted value calculation module, used to obtain a global feature map with weighted spatiotemporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients;
注意力匹配模块,用于根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果。An attention matching module, configured to obtain an instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
本申请还公开了一种计算机可读存储介质,所述计算机可读存储介质存储有基于时空记忆加权网络的视频实例分割程序,所述基于时空记忆加权网络的视频实例分割程序被处理器执行时实现上述的基于时空记忆信息的视频实例分割方法。The present application also discloses a computer-readable storage medium, the computer-readable storage medium stores a video instance segmentation program based on a spatiotemporal memory weighted network, and when the video instance segmentation program based on a spatiotemporal memory weighted network is executed by a processor Realize the above-mentioned video instance segmentation method based on spatiotemporal memory information.
本申请还公开了一种计算机设备,所述计算机设备包括计算机可读存储介质、处理器和存储在所述计算机可读存储介质中的基于时空记忆加权网络的视频实例分割程序,所述基于时空记忆加权网络的视频实例分割程序被处理器执行时实现上述的基于时空记忆信息的视频实例分割方法。The present application also discloses a computer device, which includes a computer-readable storage medium, a processor, and a video instance segmentation program based on a spatiotemporal memory weighted network stored in the computer-readable storage medium. When the video instance segmentation program of the memory weighted network is executed by the processor, the above-mentioned video instance segmentation method based on spatiotemporal memory information is realized.
(三)有益效果(3) Beneficial effects
本发明公开了一种基于时空记忆信息的视频实例分割方法和分割装置,相对于现有方法,具有如下技术效果:The invention discloses a video instance segmentation method and segmentation device based on spatio-temporal memory information. Compared with the existing method, it has the following technical effects:
该方法可以充分利用记忆帧图像的历史信息,提高了分割结果的鲁棒性,同时通过筛选出权重高的记忆帧图像进行加权匹配,避免直接采用全部记忆帧图像进行计算,在减少了计算量的同时,让网络自己学习记忆帧的注意力匹配权重达到优化注意力匹配效果的作用。This method can make full use of the historical information of memory frame images and improve the robustness of the segmentation results. At the same time, by screening out high-weight memory frame images for weighted matching, it avoids directly using all memory frame images for calculation, and reduces the amount of calculation. At the same time, let the network learn the attention matching weight of the memory frame to optimize the attention matching effect.
同时,对特征图进行二值化处理,使空间注意力匹配只在局部进行,减少了相似物体对分割结果的影响。At the same time, the feature map is binarized, so that the spatial attention matching is only performed locally, which reduces the influence of similar objects on the segmentation results.
附图说明Description of drawings
图1为本发明的实施例一的基于时空记忆信息的视频实例分割方法的整体流程图;Fig. 1 is the overall flowchart of the video instance segmentation method based on spatio-temporal memory information of Embodiment 1 of the present invention;
图2为本发明的实施例一的基于时空记忆信息的视频实例分割方法的详细流程图;Fig. 2 is the detailed flowchart of the video instance segmentation method based on spatio-temporal memory information of Embodiment 1 of the present invention;
图3为本发明的实施例一的记忆键特征图的权重值的计算过程示意图;3 is a schematic diagram of the calculation process of the weight value of the memory key feature map according to Embodiment 1 of the present invention;
图4为本发明的实施例二的基于时空记忆加权网络的视频实例分割装置的原理框图;4 is a functional block diagram of a video instance segmentation device based on a spatiotemporal memory weighted network according to Embodiment 2 of the present invention;
图5为本发明的实施例四的计算机设备示意图。FIG. 5 is a schematic diagram of computer equipment according to Embodiment 4 of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention more clear, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
在详细描述本申请的各个实施例之前,首先简单描述本申请的发明构思:现有技术中在进行视频实例分割时,往往没有有效地利用历史帧信息,当对剧烈的物体外观变化和遮挡时,极大地减弱分割结果鲁棒性。为此,本申请提供了一种基于时空记忆信息的视频实例分割方法,首先获取当前帧图像和各幅记忆帧图像中的待分割实例的键特征图、值特征图,接着计算各个记忆键特征图在进行注意力匹配时的权重值,并进一步筛选出权重值较大的键特征图用作加权匹配,利用筛选出的根据各个权重值、对应的记忆帧图像的记忆键特征图和记忆值特征图对当前帧图像中待分割实例的键特征图、值特征图进行时空加权匹配,获得最终的视频实例分割结果。该方法可以充分利用记忆帧图像的历史信息,提高了分割结果的鲁棒性,同时通过筛选出权重高的记忆帧图像进行加权匹配,避免直接采用全部记忆帧图像进行计算,减少了计算量。Before describing the various embodiments of the present application in detail, first briefly describe the inventive concept of the present application: when performing video instance segmentation in the prior art, the historical frame information is often not effectively used, when the sharp object appearance changes and occlusion , which greatly weakens the robustness of the segmentation results. To this end, this application provides a video instance segmentation method based on spatiotemporal memory information. First, obtain the key feature map and value feature map of the instance to be segmented in the current frame image and each memory frame image, and then calculate each memory key feature The weight value of the graph in the attention matching, and further screen out the key feature map with a larger weight value for weighted matching, and use the selected memory key feature map and memory value according to each weight value and the corresponding memory frame image The feature map performs spatio-temporal weighted matching on the key feature map and value feature map of the instance to be segmented in the current frame image to obtain the final video instance segmentation result. This method can make full use of the historical information of memory frame images and improve the robustness of the segmentation results. At the same time, by filtering out memory frame images with high weights for weighted matching, it avoids directly using all memory frame images for calculation and reduces the amount of calculation.
具体地,如图1和图2所示,本实施例一的基于时空记忆信息的视频实例分割方法包括如下步骤:Specifically, as shown in FIG. 1 and FIG. 2, the video instance segmentation method based on spatiotemporal memory information in the first embodiment includes the following steps:
步骤S10:获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图以及各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图,其中记忆帧图像为视频中在当前帧图像之前的含有待分割实例的历史帧图像;Step S10: Obtain the query key feature map and query value feature map of the instance to be segmented in the current frame image in the video, and the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image, wherein the memory frame image is the historical frame image containing the instance to be segmented before the current frame image in the video;
步骤S20:计算各幅记忆帧图像中待分割实例的记忆键特征图在进行注意力匹配时的权重值;Step S20: Calculate the weight value of the memory key feature map of the instance to be segmented in each memory frame image when performing attention matching;
步骤S30:按照从大至小的顺序从所有权重值选出预定数目的权重值作为权重系数;Step S30: Select a predetermined number of weight values from all weight values in descending order as weight coefficients;
步骤S40:根据各个权重系数、各个权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图;Step S40: Obtain a global feature map with weighted spatiotemporal information according to each weight coefficient, the memory key feature map and the memory value feature map of the memory frame image corresponding to each weight coefficient;
步骤S50:根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果。Step S50: Obtain an instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
在步骤S10中,获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图的方法包括如下步骤:In step S10, the method for obtaining the query key feature map and the query value feature map of the instance to be segmented of the current frame image in the video includes the following steps:
步骤S101、对当前帧图像进行特征提取,获得若干不同尺度的特征图。Step S101 , perform feature extraction on the current frame image, and obtain several feature maps of different scales.
示例性地,实例分割框架采用FCOS单阶段实例分割框架,在中心点和边框预测的基础上添加分割分支,骨干网络和FCOS一样,选择ResNet和FPNs提取输入特征。其中,采用ResNet对当前帧图像进行特征提取,得到5种不同尺度的卷积特征图r1,r2,r3,r4,r5,将其中的三个卷积特征图r3,r4,r5横向连接到FPN网络得到三种不同尺度的金字塔特征图P 3,P 4,P 5,接着对金字塔特征图P 5进行池化操作下采样得到金字塔特征图P 6,P 7,将不同尺度的金字塔特征图P 3,P 4,P 5,P 6,P 7作为后续的输入。 Exemplarily, the instance segmentation framework adopts the FCOS single-stage instance segmentation framework, and adds segmentation branches on the basis of center point and frame prediction. The backbone network is the same as FCOS, and selects ResNet and FPNs to extract input features. Among them, ResNet is used to extract the features of the current frame image, and five different scale convolution feature maps r1, r2, r3, r4, r5 are obtained, and three of the convolution feature maps r3, r4, r5 are horizontally connected to FPN The network obtains three different scales of pyramid feature maps P 3 , P 4 , P 5 , and then performs pooling operation on the pyramid feature map P 5 to obtain pyramid feature maps P 6 , P 7 , and the pyramid feature maps of different scales P 3 , P 4 , P 5 , P 6 , P 7 as subsequent input.
步骤S102、根据若干不同尺度的特征图得到待分割实例的中心点和边界框以及当前帧图像的整体查询键特征图和整体查询值特征图。Step S102. Obtain the center point and bounding box of the instance to be segmented and the overall query key feature map and overall query value feature map of the current frame image according to several feature maps of different scales.
一方面,将不同尺度的金字塔特征图P 3,P 4,P 5,P 6,P 7分别输入到中心点回归分支网络和边框回归分类预测分支网络,得到待分割实例的中心点CE t,i、边界框B t,i、类别CL t,iOn the one hand, the pyramid feature maps P 3 , P 4 , P 5 , P 6 , and P 7 of different scales are respectively input into the center point regression branch network and the frame regression classification prediction branch network to obtain the center point CE t of the instance to be segmented, i , bounding box B t,i , class CL t,i .
另一方面,将ResNet的第四个卷积块结合一个卷积层用作查询帧编码器,将当前帧原图输入到查询帧编码器,输出得到当前帧图像的整体查询键特征图K q和整体查询值特征图V q,q代表query,表示查询,查询帧等价于当前帧。 On the other hand, the fourth convolutional block of ResNet is combined with a convolutional layer as a query frame encoder, the original image of the current frame is input to the query frame encoder, and the overall query key feature map K q of the current frame image is output. And the overall query value feature map V q , q stands for query, which means query, and the query frame is equivalent to the current frame.
步骤S103、根据待分割实例的中心点和边界框分别对当前帧图像的整体查询键特征图、整体查询值特征图进行二值化处理,获得待分割实例的查询键特征图和查询值特征图。Step S103: Binarize the overall query key feature map and the overall query value feature map of the current frame image according to the center point and bounding box of the instance to be segmented, and obtain the query key feature map and query value feature map of the instance to be segmented .
根据得到的中心点、边界框对每个待分割实例的边界框进行1.5倍大小进行二值化,在当前帧图像的整体查询键特征图、整体查询值特征图的边界框内部将每个待分割实例的边界框所在区域的像素灰度值设置为1,其他区域像素灰度值设置为0,这个可以得到N个查询键特征图K q,n和查询值特征图V q,n,其中N为待分割实例的数量,n∈[1,N]。二值化处理是为了防止相似实例对分割结果产生影响,有效增加了分割准确率。 According to the obtained center point and bounding box, the bounding box of each instance to be segmented is binarized at 1.5 times the size, and each to-be-segmented instance is divided into the bounding box of the overall query key feature map and the overall query value feature map of the current frame image. The pixel gray value of the area where the bounding box of the segmentation instance is located is set to 1, and the pixel gray value of other areas is set to 0. This can get N query key feature maps K q,n and query value feature maps V q,n , where N is the number of instances to be segmented, n∈[1,N]. The binarization process is to prevent similar instances from affecting the segmentation results, effectively increasing the segmentation accuracy.
进一步地,在步骤S10中,获取视频中各幅记忆帧图像中待分割实例的记忆键特征图和记忆值特征图的方法包括:Further, in step S10, the method for obtaining the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image in the video includes:
步骤S111、获取视频中在当前帧图像之前的所有历史帧图像对应的分割结果、原图、各个实例的中心点和边界框。Step S111 , acquiring the segmentation results, original images, center points and bounding boxes of each instance corresponding to all historical frame images in the video before the current frame image.
所有历史帧图像I 1~I t-1对应的分割结果、原图、各个实例的中心点和边界 框以及各个实例已预先存储。 Segmentation results, original images, center points and bounding boxes of each instance, and each instance have been stored in advance for all historical frame images I 1 -I t-1 .
步骤S112、根据历史帧图像对应的分割结果、原图得到历史帧图像的整体记忆键特征图和记忆查询值特征图。Step S112 , according to the corresponding segmentation results and original images of the historical frame images, the overall memory key feature map and memory query value feature map of the historical frame images are obtained.
将ResNet的第四个卷积块用作记忆帧编码器,将历史帧图像对应的分割结果、原图进行空间连接,将空间连接结果输入到记忆帧编码器,记忆帧编码器输出历史帧图像的整体记忆键特征图KM T和整体记忆查询值特征图VM T,T∈[1,t-1]。 Use the fourth convolution block of ResNet as a memory frame encoder, spatially connect the segmentation results corresponding to the historical frame images and the original image, input the spatial connection results to the memory frame encoder, and the memory frame encoder outputs the historical frame images The overall memory key feature map KM T and the overall memory query value feature map VM T , T∈[1,t-1].
步骤S113、根据历史帧图像中各个实例的中心点和边界框分别对历史帧图像的整体记忆键特征图、整体记忆值特征图进行二值化操作,获得历史帧图像中各个实例的记忆键特征图和记忆值特征图。Step S113, according to the center point and bounding box of each instance in the historical frame image, perform binarization operation on the overall memory key feature map and the overall memory value feature map of the historical frame image, and obtain the memory key feature of each instance in the historical frame image graph and memory value feature map.
根据得到的中心点、边界框对每个实例的边界框进行1.5倍大小进行二值化,在历史帧图像的整体记忆键特征图、整体记忆值特征图的边界框内部将每个实例的边界框所在区域的像素灰度值设置为1,其他区域像素灰度值设置为0,这个可以得到N个记忆键特征图KM T,n和查询值特征图VM T,n,其中N为实例的数量。二值化处理是为了防止相似实例对分割结果产生影响,有效增加了分割准确率。 According to the obtained center point and bounding box, the bounding box of each instance is binarized at 1.5 times the size, and the boundary of each instance is placed inside the bounding box of the overall memory key feature map and the overall memory value feature map of the historical frame image The pixel gray value of the area where the box is located is set to 1, and the pixel gray value of other areas is set to 0. This can get N memory key feature maps KM T,n and query value feature maps VM T,n , where N is the instance quantity. The binarization process is to prevent similar instances from affecting the segmentation results, effectively increasing the segmentation accuracy.
步骤S114、根据待分割实例的类别从各幅历史帧图像中各个实例的记忆键特征图和记忆值特征图筛选出各幅记忆帧图像中待分割实例的记忆键特征图和记忆值特征图。Step S114: Filter out the memory key feature map and memory value feature map of each instance to be segmented in each memory frame image from the memory key feature map and memory value feature map of each instance in each frame image according to the category of the instance to be segmented.
在进行二值化处理之后,有利于将各个实例精准地分割出来,从各幅历史帧图像中找出与待分割实例类别相同的实例,将对应的记忆键特征图和记忆值特征图作为各幅记忆帧图像中待分割实例的记忆键特征图和记忆值特征图。After the binarization process, it is beneficial to accurately segment each instance, find out the same instance category as the instance to be segmented from each historical frame image, and use the corresponding memory key feature map and memory value feature map as each The memory key feature map and memory value feature map of the instance to be segmented in the memory frame image.
进一步地,在步骤S20中,计算各幅记忆帧图像中待分割实例的记忆键特征图在进行注意力匹配时的权重值的方法包括:将各幅记忆帧图像中待分割实例的记忆键特征图进行空间连接,获得通道维度相连的特征图;对通道维度相连的特征图进行全局池化处理,得到各个权重值。Further, in step S20, the method for calculating the weight value of the memory key feature map of the instance to be segmented in each memory frame image when performing attention matching includes: the memory key feature of the instance to be segmented in each memory frame image The graphs are spatially connected to obtain feature maps connected by channel dimensions; global pooling is performed on feature maps connected by channel dimensions to obtain each weight value.
如图3所示,假设待分割实例的记忆键特征图KM T,n的维度为H×W×C,记忆帧图像的总数是L,进行空间连接后,得到通道维度相连的特征图C n的维度是HWC×L,经过H×W×C的卷积核进行全局池化,得到权重向量W n,权重向量W n含有L个权重值。 As shown in Figure 3, suppose the memory key feature map KM T,n of the instance to be segmented has a dimension of H×W×C, and the total number of memory frame images is L. After spatial connection, the feature map C n connected by channel dimensions is obtained. The dimension of is HWC×L, and the global pooling is performed through the convolution kernel of H×W×C to obtain the weight vector W n , which contains L weight values.
进一步地,视频实例分割方法还包括:Further, the video instance segmentation method also includes:
确定所述待分割实例在当前帧图像的预测区域以及待分割实例在与当前帧图像相邻的记忆帧图像中的历史区域。Determine the prediction area of the instance to be segmented in the current frame image and the history area of the instance to be segmented in the memory frame image adjacent to the current frame image.
计算预测区域和历史区域之间的余弦相似度P gate,其中余弦相似度的值大于0且小于1。 Calculate the cosine similarity P gate between the predicted area and the historical area, where the value of the cosine similarity is greater than 0 and less than 1.
根据余弦相似度和含有待分割实例的记忆帧图像的总数L n确定所述预定数目,具体地,根据如下公式计算预定数目g q,nThe predetermined number is determined according to the cosine similarity and the total number L n of memory frame images containing instances to be segmented, specifically, the predetermined number gq,n is calculated according to the following formula,
g q,n=[P gate*L n] g q,n =[P gate *L n ]
其中,余弦相似度P gate越接近1表示实例的预测区域与历史区域越相似,[]表示取整。 Among them, the closer the cosine similarity P gate is to 1, the more similar the predicted area of the instance is to the historical area, and [] indicates rounding.
在计算得到预定数目g q,n之后,从L个权重值W n选出排在前g q,n位的权重值W n作为权重系数W n[u],即利用关联度高的记忆帧图像的特征图进行后续的注意力匹配计算,在充分利用历史信息的同时,减少计算量,减少运算时间和内存占用。 After the predetermined number g q,n is calculated, the weight value W n of the top g q,n position is selected from the L weight values W n as the weight coefficient W n [u], that is, the memory frame with a high degree of correlation is used The feature map of the image is used for subsequent attention matching calculations. While making full use of historical information, it reduces the amount of calculation, computing time and memory usage.
进一步地,在步骤S40中,根据各个权重系数W n[u]、各个权重系数W n[u]对应的记忆帧图像的记忆键特征图KM u,n和记忆值特征图VM u,n得到具有加权时空信息的全局特征图G u,n的方法包括: Further, in step S40, according to each weight coefficient W n [u], the memory key feature map KM u,n and the memory value feature map VM u,n of the memory frame image corresponding to each weight coefficient W n [u] are obtained Methods for global feature maps G u,n with weighted spatiotemporal information include:
计算各个权重系数W n[u]对应的记忆帧图像的加权记忆键特征图KMW u,nCalculate the weighted memory key feature map KMW u,n of the memory frame image corresponding to each weight coefficient W n [u]:
KMW u,n=KM u,n*W n[u] KMW u,n = KM u,n *W n [u]
将记忆帧图像的加权记忆键特征图KMW u,n和记忆值特征图VM u,n进行矩阵乘法计算: Perform matrix multiplication calculation on the weighted memory key feature map KMW u,n and memory value feature map VM u,n of the memory frame image:
Figure PCTCN2022140070-appb-000003
Figure PCTCN2022140070-appb-000003
根据如下公式计算得到具有加权时空信息的全局特征图G u,nThe global feature map G u,n with weighted spatio-temporal information is calculated according to the following formula:
Figure PCTCN2022140070-appb-000004
Figure PCTCN2022140070-appb-000004
其中,1≤u≤g q,n,g q,n表示预定数目,KM u,n表示记忆帧图像的记忆键特征图,D u,n为包含注意力信息和历史分割细节的编号u的记忆帧帧对应的特征图,全局特征图G u,n的维度为固定维度。 Among them, 1≤u≤g q,n , g q,n represents a predetermined number, KM u,n represents the memory key feature map of the memory frame image, D u,n is the number u that contains attention information and historical segmentation details The feature map corresponding to the memory frame frame, the dimension of the global feature map G u,n is a fixed dimension.
在步骤S50中,根据全局特征图G u,n、查询键特征图K q,n和查询值特征图V q,n得到当前帧图像的实例分割结果的方法包括: In step S50, the method for obtaining the instance segmentation result of the current frame image according to the global feature map G u,n , the query key feature map K q,n and the query value feature map V q,n includes:
将全局特征图G u,n与待分割实例的查询键特征图K q,n进行矩阵点积,得到注意力矩阵; Perform matrix dot product of the global feature map G u,n and the query key feature map K q,n of the instance to be segmented to obtain the attention matrix;
将注意力矩阵与待分割实例的查询值特征图V q,n进行通道维度的连接操作,并将连接操作的结果送入解码器进行反卷积和上采样得到实例分割结果。 The attention matrix and the query value feature map V q, n of the instance to be segmented are connected in the channel dimension, and the result of the connection operation is sent to the decoder for deconvolution and upsampling to obtain the instance segmentation result.
最后,将得到的当前帧图像的实例分割结果、中心点CE t,i、边界框B t,i、类别CL t,i存入到记忆模块,用于后续帧图像的分割。重复上述过程,直至将视频中每一帧图像分割完成。 Finally, store the instance segmentation results of the current frame image, the center point CE t,i , the bounding box B t,i , and the category CL t,i into the memory module for the segmentation of subsequent frame images. The above process is repeated until the image segmentation of each frame in the video is completed.
进一步地,如图4所示,实施例二的基于时空记忆加权网络的视频实例分割装置包括特征图获取模块10、权重值计算模块20、权重系数筛选模块30、加权值计算模块40和注意力匹配模块50。特征图获取模块10用于获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图以及各幅记忆帧图像中待分割实例的记忆键特征图和记忆值特征图;权重值计算模块20用于计算各幅记忆帧图像中待分割实例的记忆键特征图在进行注意力匹配时的权重值;权重系数筛选模块30用于按照从大至小的顺序从所有权重值选出预定数目的权重值作为权重系数;加权值计算模块40用于根据各个权重系数、各个权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图;注意力匹配模块50用于根据全局特征图、查询键特征图和查询值特征图得到当前帧图像的实例分割结果。Further, as shown in FIG. 4 , the video instance segmentation device based on the spatiotemporal memory weighted network of the second embodiment includes a feature map acquisition module 10, a weight value calculation module 20, a weight coefficient screening module 30, a weight value calculation module 40 and attention matching module 50 . Feature map acquisition module 10 is used to obtain the query key feature map and the query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and the memory value feature map of the instance to be segmented in each memory frame image; weight value The calculation module 20 is used to calculate the weight value of the memory key feature map of the example to be segmented in each piece of memory frame images when performing attention matching; the weight coefficient screening module 30 is used to select from all weight values in order from large to small A predetermined number of weight values are used as weight coefficients; weight value calculation module 40 is used to obtain a global feature map with weighted spatio-temporal information according to each weight coefficient, the memory key feature map and the memory value feature map of the memory frame image corresponding to each weight coefficient; note The force matching module 50 is used to obtain the instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
具体地,特征图获取模块10用于:对当前帧图像进行特征提取,获得若干不同尺度的特征图;根据若干不同尺度的所述特征图得到待分割实例的中心点和边界框以及当前帧图像的整体查询键特征图和整体查询值特征图;根据待分割实例的中心点和边界框分别对当前帧图像的整体查询键特征图、整体查询值特征图进行二值化处理,获得待分割实例的查询键特征图和查询值特征图。以及,特征图获取模块10用于:获取视频中在当前帧图像之前的所有历史帧图像对应的分割结果、原图、各个实例的中心点和边界框;根据历史帧图像对应的分割结果、原图得到历史帧图像的整体记忆键特征图和整体记忆值特征图;根据历史帧图像中各个实例的中心点和边界框分别对历史帧图像的整体记忆键特征图、整体记忆值特征图进行二值化操作,获得历史帧图像中各个实例的记忆键特征图和记忆值特征图;根据待分割实例的类别从各幅历史帧图像中各个实 例的记忆键特征图和记忆值特征图筛选出各幅记忆帧图像中待分割实例的记忆键特征图和记忆值特征图。其中,特征图获取模块10的具体处理细节可参考实施例一中相关描述,在此不进行赘述。Specifically, the feature map acquisition module 10 is used to: perform feature extraction on the current frame image to obtain feature maps of several different scales; obtain the center point and bounding box of the instance to be segmented and the current frame image according to the feature maps of several different scales The overall query key feature map and the overall query value feature map; according to the center point and bounding box of the instance to be segmented, the overall query key feature map and the overall query value feature map of the current frame image are binarized to obtain the instance to be segmented The query key feature map and query value feature map of . And, the feature map acquisition module 10 is used to: acquire the segmentation results corresponding to all historical frame images before the current frame image in the video, the original image, the center point and the bounding box of each instance; according to the segmentation results corresponding to the historical frame images, the original The overall memory key feature map and the overall memory value feature map of the historical frame image are obtained from the figure; according to the center point and the bounding box of each instance in the historical frame image, the overall memory key feature map and the overall memory value feature map of the historical frame image are processed twice. value operation to obtain the memory key feature map and memory value feature map of each instance in the historical frame image; according to the category of the instance to be segmented, filter out the memory key feature map and memory value feature map of each instance in each historical frame image The memory key feature map and memory value feature map of the instance to be segmented in the memory frame image. For specific processing details of the feature map acquisition module 10, reference may be made to the relevant description in Embodiment 1, and details are not repeated here.
进一步地,权重值计算模块20用于将各幅记忆帧图像中待分割实例的记忆键特征图进行空间连接,获得通道维度相连的特征图;对通道维度相连的特征图进行全局池化处理,得到各个权重值。Further, the weight value calculation module 20 is used to spatially connect the memory key feature maps of the instances to be segmented in each memory frame image to obtain feature maps connected by channel dimensions; perform global pooling processing on the feature maps connected by channel dimensions, Get each weight value.
进一步地,权重系数筛选模块30、加权值计算模块40和注意力匹配模块50的具体处理细节可参考实施例一中相关描述,在此不进行赘述。Further, for specific processing details of the weight coefficient screening module 30 , the weight value calculation module 40 and the attention matching module 50 , reference may be made to the relevant description in Embodiment 1, and details are not repeated here.
本申请的实施例三还公开了一种计算机可读存储介质,计算机可读存储介质存储有基于时空记忆加权网络的视频实例分割程序,基于时空记忆加权网络的视频实例分割程序被处理器执行时实现上述的基于时空记忆加权网络的视频实例分割方法。Embodiment 3 of the present application also discloses a computer-readable storage medium. The computer-readable storage medium stores a video instance segmentation program based on a spatiotemporal memory weighted network. When the video instance segmentation program based on a spatiotemporal memory weighted network is executed by a processor, Implement the above-mentioned video instance segmentation method based on spatio-temporal memory weighted network.
本实施例四还公开了一种计算机设备,在硬件层面,如图5所示,该计算机设备包括处理器12、内部总线13、网络接口14、计算机可读存储介质11。处理器12从计算机可读存储介质中读取对应的计算机程序然后运行,在逻辑层面上形成请求处理装置。当然,除了软件实现方式之外,本说明书一个或多个实施例并不排除其他实现方式,比如逻辑器件抑或软硬件结合的方式等等,也就是说以下处理流程的执行主体并不限定于各个逻辑单元,也可以是硬件或逻辑器件。计算机可读存储介质11上存储有基于时空记忆加权网络的视频实例分割程序,基于时空记忆加权网络的视频实例分割程序被处理器执行时实现上述的基于时空记忆加权网络的视频实例分割方法。Embodiment 4 also discloses a computer device. At the hardware level, as shown in FIG. 5 , the computer device includes a processor 12 , an internal bus 13 , a network interface 14 , and a computer-readable storage medium 11 . The processor 12 reads the corresponding computer program from the computer-readable storage medium and executes it, forming a request processing device on a logical level. Of course, in addition to software implementations, one or more embodiments of this specification do not exclude other implementations, such as logic devices or a combination of software and hardware, etc., that is to say, the execution subject of the following processing flow is not limited to each A logic unit, which can also be a hardware or logic device. The computer-readable storage medium 11 stores a video instance segmentation program based on a spatiotemporal memory weighted network. When the video instance segmentation program based on a spatiotemporal memory weighted network is executed by a processor, the above-mentioned video instance segmentation method based on a spatiotemporal memory weighted network is realized.
计算机可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机可读存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带、磁盘存储、量子存储器、基于石墨烯的存储介质或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。Computer-readable storage media includes both volatile and non-permanent, removable and non-removable media by any method or technology for storage of information. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer readable storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage , magnetic cassettes, disk storage, quantum memory, graphene-based storage media or other magnetic storage devices or any other non-transmission media that can be used to store information that can be accessed by computing devices.
上面对本发明的具体实施方式进行了详细描述,虽然已表示和描述了一些 实施例,但本领域技术人员应该理解,在不脱离由权利要求及其等同物限定其范围的本发明的原理和精神的情况下,可以对这些实施例进行修改和完善,这些修改和完善也应在本发明的保护范围内。The specific embodiments of the present invention have been described in detail above. Although some embodiments have been shown and described, those skilled in the art should understand that without departing from the principle and spirit of the present invention whose scope is defined by the claims and their equivalents Under the circumstances, these embodiments can be modified and improved, and these modifications and improvements should also be within the protection scope of the present invention.

Claims (10)

  1. 一种基于时空记忆信息的视频实例分割方法,其特征在于,所述视频实例分割方法包括:A video instance segmentation method based on spatiotemporal memory information, characterized in that the video instance segmentation method comprises:
    获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图以及各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图,其中所述记忆帧图像为视频中在所述当前帧图像之前的含有所述待分割实例的历史帧图像;Obtain the query key feature map and query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and memory value feature map of the instance to be segmented in each memory frame image, wherein the memory frame image is The historical frame image containing the instance to be segmented before the current frame image in the video;
    计算各幅所述记忆帧图像中所述待分割实例的记忆键特征图在进行注意力匹配时的权重值;Calculate the weight value of the memory key feature map of the instance to be divided in each piece of the memory frame image when performing attention matching;
    按照从大至小的顺序从所有权重值选出预定数目的权重值作为权重系数;Select a predetermined number of weight values from all weight values in descending order as weight coefficients;
    根据各个所述权重系数、各个所述权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图;Obtain a global feature map with weighted spatiotemporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients;
    根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果。The instance segmentation result of the current frame image is obtained according to the global feature map, the query key feature map and the query value feature map.
  2. 根据权利要求1所述的基于时空记忆信息的视频实例分割方法,其特征在于,获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图的方法包括:The video instance segmentation method based on spatiotemporal memory information according to claim 1, wherein the method for obtaining the query key feature map and the query value feature map of the instance to be segmented in the current frame image in the video comprises:
    对所述当前帧图像进行特征提取,获得若干不同尺度的特征图;performing feature extraction on the current frame image to obtain several feature maps of different scales;
    根据若干不同尺度的所述特征图得到所述待分割实例的中心点和边界框以及所述当前帧图像的整体查询键特征图和整体查询值特征图;Obtaining the central point and bounding box of the instance to be segmented and the overall query key feature map and the overall query value feature map of the current frame image according to the feature maps of several different scales;
    根据所述待分割实例的中心点和边界框分别对所述当前帧图像的整体查询键特征图、整体查询值特征图进行二值化处理,获得所述待分割实例的查询键特征图和查询值特征图。Binarize the overall query key feature map and the overall query value feature map of the current frame image according to the center point and the bounding box of the instance to be segmented, and obtain the query key feature map and query value of the instance to be segmented. value feature map.
  3. 根据权利要求1所述的基于时空记忆信息的视频实例分割方法,其特征在于,获取视频中各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图的方法包括:The video instance segmentation method based on spatiotemporal memory information according to claim 1, wherein the method for obtaining the memory key feature map and the memory value feature map of the instance to be segmented in each memory frame image in the video comprises:
    获取视频中在所述当前帧图像之前的所有历史帧图像对应的分割结果、原图、各个实例的中心点和边界框;Obtain the segmentation results, original images, center points and bounding boxes of all the historical frame images before the current frame image in the video;
    根据历史帧图像对应的分割结果、原图得到所述历史帧图像的整体记忆键特征图和整体记忆值特征图;Obtain the overall memory key feature map and the overall memory value feature map of the historical frame image according to the segmentation result corresponding to the historical frame image and the original image;
    根据所述历史帧图像中各个实例的中心点和边界框分别对所述历史帧图像的整体记忆键特征图、整体记忆值特征图进行二值化操作,获得所述历史帧图像中各个实例的记忆键特征图和记忆值特征图;According to the center point and the bounding box of each instance in the historical frame image, the overall memory key feature map and the overall memory value feature map of the historical frame image are binarized to obtain the individual instance in the historical frame image. Memory key feature map and memory value feature map;
    根据所述待分割实例的类别从各幅所述历史帧图像中各个实例的记忆键特征图和记忆值特征图筛选出各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图。According to the category of the instance to be segmented, from the memory key feature map and the memory value feature map of each instance in each of the history frame images, the memory key feature map and the memory value of the instance to be segmented in each memory frame image are screened out. feature map.
  4. 根据权利要求1所述的基于时空记忆信息的视频实例分割方法,其特征在于,计算各幅所述记忆帧图像中所述待分割实例的记忆键特征图在进行注意力匹配时的权重值的方法包括:The video instance segmentation method based on spatiotemporal memory information according to claim 1, characterized in that, calculating the weight value of the memory key feature map of the instance to be segmented in each of the memory frame images when performing attention matching Methods include:
    将各幅所述记忆帧图像中所述待分割实例的记忆键特征图进行空间连接,获得通道维度相连的特征图;performing spatial connection on the memory key feature maps of the examples to be segmented in each of the memory frame images to obtain feature maps connected by channel dimensions;
    对所述通道维度相连的特征图进行全局池化处理,得到各个权重值。A global pooling process is performed on the feature maps connected by the channel dimensions to obtain each weight value.
  5. 根据权利要求4所述的基于时空记忆信息的视频实例分割方法,其特征在于,所述视频实例分割方法还包括:The video instance segmentation method based on spatiotemporal memory information according to claim 4, wherein the video instance segmentation method further comprises:
    确定所述待分割实例在所述当前帧图像的预测区域以及所述待分割实例在与所述当前帧图像相邻的记忆帧图像中的历史区域;Determining the prediction area of the instance to be segmented in the current frame image and the history area of the instance to be segmented in the memory frame image adjacent to the current frame image;
    计算所述预测区域和所述历史区域之间的余弦相似度,其中所述余弦相似度的值大于0且小于1;calculating a cosine similarity between the predicted area and the historical area, wherein the value of the cosine similarity is greater than 0 and less than 1;
    根据所述余弦相似度和含有所述待分割实例的记忆帧图像的总数确定所述预定数目。The predetermined number is determined according to the cosine similarity and the total number of memory frame images containing the instance to be segmented.
  6. 根据权利要求1所述的基于时空记忆信息的视频实例分割方法,其特征在于,根据各个所述权重系数、各个所述权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图的方法包括:The video instance segmentation method based on spatio-temporal memory information according to claim 1, wherein, according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients are obtained with Methods for weighting global feature maps with spatiotemporal information include:
    计算各个所述权重系数W n[u]对应的记忆帧图像的加权记忆键特征图KMW u,nCalculate the weighted memory key feature map KMW u,n of the memory frame image corresponding to each of the weight coefficients W n [u]:
    KMW u,n=KM u,n*W n[u] KMW u,n = KM u,n *W n [u]
    将记忆帧图像的加权记忆键特征图KMW u,n和记忆值特征图VM u,n进行矩阵乘法计算: Perform matrix multiplication calculation on the weighted memory key feature map KMW u,n and memory value feature map VM u,n of the memory frame image:
    Figure PCTCN2022140070-appb-100001
    Figure PCTCN2022140070-appb-100001
    根据如下公式计算得到具有加权时空信息的全局特征图G u,nThe global feature map G u,n with weighted spatio-temporal information is calculated according to the following formula:
    Figure PCTCN2022140070-appb-100002
    Figure PCTCN2022140070-appb-100002
    其中,1≤u≤g q,n,g q,n表示预定数目,KM u,n表示记忆帧图像的记忆键特征图。 Wherein, 1≤u≤g q,n , g q,n represents a predetermined number, and KM u,n represents a memory key feature map of a memory frame image.
  7. 根据权利要求6所述的基于时空记忆信息的视频实例分割方法,其特征在于,根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果的方法包括:The video instance segmentation method based on spatiotemporal memory information according to claim 6, wherein the result of instance segmentation of the current frame image is obtained according to the global feature map, the query key feature map and the query value feature map Methods include:
    将全局特征图所述待分割实例的查询键特征图进行矩阵点积,得到注意力矩阵;performing matrix dot product on the query key feature map of the instance to be segmented in the global feature map to obtain an attention matrix;
    将所述注意力矩阵与所述待分割实例的进行通道维度的连接操作,并将连接操作的结果送入解码器进行反卷积和上采样得到实例分割结果。The attention matrix is connected with the channel dimension of the instance to be segmented, and the result of the connection operation is sent to the decoder for deconvolution and upsampling to obtain the instance segmentation result.
  8. 一种基于时空记忆加权网络的视频实例分割装置,其特征在于,所述视频实例分割装置包括:A video instance segmentation device based on spatio-temporal memory weighted network, characterized in that, the video instance segmentation device includes:
    特征图获取模块,用于获取视频中当前帧图像的待分割实例的查询键特征图和查询值特征图以及各幅记忆帧图像中所述待分割实例的记忆键特征图和记忆值特征图,其中所述记忆帧图像为视频中在所述当前帧图像之前的含有所述待分割实例的历史帧图像The feature map acquisition module is used to obtain the query key feature map and the query value feature map of the instance to be segmented in the current frame image in the video and the memory key feature map and the memory value feature map of the instance to be segmented in each memory frame image, Wherein the memory frame image is the historical frame image containing the instance to be segmented before the current frame image in the video
    权重值计算模块,用于计算各幅所述记忆帧图像中所述待分割实例的记忆键特征图在进行注意力匹配时的权重值Weight value calculation module, used to calculate the weight value of the memory key feature map of the instance to be segmented in each of the memory frame images when performing attention matching
    权重系数筛选模块,用于按照从大至小的顺序从所有权重值选出预定数目的权重值作为权重系数;A weight coefficient screening module, configured to select a predetermined number of weight values from all weight values as weight coefficients in descending order;
    加权值计算模块,用于根据各个所述权重系数、各个所述权重系数对应的记忆帧图像的记忆键特征图和记忆值特征图得到具有加权时空信息的全局特征图;A weighted value calculation module, used to obtain a global feature map with weighted spatiotemporal information according to each of the weight coefficients, the memory key feature map and the memory value feature map of the memory frame image corresponding to each of the weight coefficients;
    注意力匹配模块,用于根据所述全局特征图、所述查询键特征图和所述查询值特征图得到当前帧图像的实例分割结果。An attention matching module, configured to obtain an instance segmentation result of the current frame image according to the global feature map, the query key feature map and the query value feature map.
  9. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有基于时空记忆加权网络的视频实例分割程序,所述基于时空记忆加权网络的视频实例分割程序被处理器执行时实现权利要求1至7任一项所述的基于时空记忆信息的视频实例分割方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a video instance segmentation program based on a spatio-temporal memory weighted network, and the video instance segmentation program based on a spatio-temporal memory weighted network is implemented when the processor executes The video instance segmentation method based on spatiotemporal memory information according to any one of claims 1 to 7.
  10. 一种计算机设备,其特征在于,所述计算机设备包括计算机可读存储介质、处理器和存储在所述计算机可读存储介质中的基于时空记忆加权网络的视频实例分割程序,所述基于时空记忆加权网络的视频实例分割程序被处理器执行时实现权利要求1至7任一项所述的基于时空记忆信息的视频实例分割方法。A computer device, characterized in that the computer device includes a computer-readable storage medium, a processor, and a video instance segmentation program based on a spatio-temporal memory weighted network stored in the computer-readable storage medium, the spatio-temporal memory-based When the video instance segmentation program of the weighted network is executed by the processor, the video instance segmentation method based on spatiotemporal memory information according to any one of claims 1 to 7 is realized.
PCT/CN2022/140070 2021-12-22 2022-12-19 Video instance segmentation method and apparatus based on spatio-temporal memory information WO2023116632A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111581049.7A CN114241388A (en) 2021-12-22 2021-12-22 Video instance segmentation method and segmentation device based on space-time memory information
CN202111581049.7 2021-12-22

Publications (1)

Publication Number Publication Date
WO2023116632A1 true WO2023116632A1 (en) 2023-06-29

Family

ID=80761294

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140070 WO2023116632A1 (en) 2021-12-22 2022-12-19 Video instance segmentation method and apparatus based on spatio-temporal memory information

Country Status (2)

Country Link
CN (1) CN114241388A (en)
WO (1) WO2023116632A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974722A (en) * 2024-04-02 2024-05-03 江西师范大学 Single-target tracking system and method based on attention mechanism and improved transducer
CN118470413A (en) * 2024-05-16 2024-08-09 淮阴工学院 Mango classification recognition method based on UpCPFNet model

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114241388A (en) * 2021-12-22 2022-03-25 中国科学院深圳先进技术研究院 Video instance segmentation method and segmentation device based on space-time memory information
CN114782861A (en) * 2022-03-31 2022-07-22 腾讯科技(深圳)有限公司 Instance partitioning method, related device, and storage medium
WO2023226009A1 (en) * 2022-05-27 2023-11-30 中国科学院深圳先进技术研究院 Image processing method and device
CN117315530B (en) * 2023-09-19 2024-07-12 天津大学 Instance matching method based on multi-frame information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109118519A (en) * 2018-07-26 2019-01-01 北京纵目安驰智能科技有限公司 Target Re-ID method, system, terminal and the storage medium of Case-based Reasoning segmentation
CN112669324A (en) * 2020-12-31 2021-04-16 中国科学技术大学 Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution
CN113361519A (en) * 2021-05-21 2021-09-07 北京百度网讯科技有限公司 Target processing method, training method of target processing model and device thereof
CN114241388A (en) * 2021-12-22 2022-03-25 中国科学院深圳先进技术研究院 Video instance segmentation method and segmentation device based on space-time memory information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109118519A (en) * 2018-07-26 2019-01-01 北京纵目安驰智能科技有限公司 Target Re-ID method, system, terminal and the storage medium of Case-based Reasoning segmentation
CN112669324A (en) * 2020-12-31 2021-04-16 中国科学技术大学 Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution
CN113361519A (en) * 2021-05-21 2021-09-07 北京百度网讯科技有限公司 Target processing method, training method of target processing model and device thereof
CN114241388A (en) * 2021-12-22 2022-03-25 中国科学院深圳先进技术研究院 Video instance segmentation method and segmentation device based on space-time memory information

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974722A (en) * 2024-04-02 2024-05-03 江西师范大学 Single-target tracking system and method based on attention mechanism and improved transducer
CN117974722B (en) * 2024-04-02 2024-06-11 江西师范大学 Single-target tracking system and method based on attention mechanism and improved transducer
CN118470413A (en) * 2024-05-16 2024-08-09 淮阴工学院 Mango classification recognition method based on UpCPFNet model

Also Published As

Publication number Publication date
CN114241388A (en) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2023116632A1 (en) Video instance segmentation method and apparatus based on spatio-temporal memory information
CN108399362B (en) Rapid pedestrian detection method and device
CN111027493B (en) Pedestrian detection method based on deep learning multi-network soft fusion
Hoang et al. Metaheuristic optimized edge detection for recognition of concrete wall cracks: a comparative study on the performances of roberts, prewitt, canny, and sobel algorithms
CN110189255B (en) Face detection method based on two-stage detection
CN111179217A (en) Attention mechanism-based remote sensing image multi-scale target detection method
CN112800964B (en) Remote sensing image target detection method and system based on multi-module fusion
CN110532894A (en) Remote sensing target detection method based on boundary constraint CenterNet
CN113159120A (en) Contraband detection method based on multi-scale cross-image weak supervision learning
CN112418108B (en) Remote sensing image multi-class target detection method based on sample reweighing
CN114897779A (en) Cervical cytology image abnormal area positioning method and device based on fusion attention
CN107784288A (en) A kind of iteration positioning formula method for detecting human face based on deep neural network
CN112419202B (en) Automatic wild animal image recognition system based on big data and deep learning
CN109522831B (en) Real-time vehicle detection method based on micro-convolution neural network
CN113221731B (en) Multi-scale remote sensing image target detection method and system
CN116092179A (en) Improved Yolox fall detection system
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN115187786A (en) Rotation-based CenterNet2 target detection method
CN107451595A (en) Infrared image salient region detection method based on hybrid algorithm
US9081800B2 (en) Object detection via visual search
CN114419406A (en) Image change detection method, training method, device and computer equipment
CN117437423A (en) Weak supervision medical image segmentation method and device based on SAM collaborative learning and cross-layer feature aggregation enhancement
An et al. Patch loss: A generic multi-scale perceptual loss for single image super-resolution
CN113011415A (en) Improved target detection method and system based on Grid R-CNN model
CN115661828B (en) Character direction recognition method based on dynamic hierarchical nested residual error network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909944

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE