WO2022218012A1 - 特征提取方法、装置、设备、存储介质以及程序产品 - Google Patents
特征提取方法、装置、设备、存储介质以及程序产品 Download PDFInfo
- Publication number
- WO2022218012A1 WO2022218012A1 PCT/CN2022/075069 CN2022075069W WO2022218012A1 WO 2022218012 A1 WO2022218012 A1 WO 2022218012A1 CN 2022075069 W CN2022075069 W CN 2022075069W WO 2022218012 A1 WO2022218012 A1 WO 2022218012A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- pixel
- feature map
- level feature
- level
- Prior art date
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 50
- 230000011218 segmentation Effects 0.000 claims abstract description 68
- 238000013507 mapping Methods 0.000 claims abstract description 48
- 238000000034 method Methods 0.000 claims abstract description 37
- 239000013598 vector Substances 0.000 claims description 64
- 238000011176 pooling Methods 0.000 claims description 21
- 238000002372 labelling Methods 0.000 claims description 17
- 238000004590 computer program Methods 0.000 claims description 12
- 238000000926 separation method Methods 0.000 claims description 11
- 230000004927 fusion Effects 0.000 claims description 8
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 6
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000013135 deep learning Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 4
- 238000007500 overflow downdraw method Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 239000000284 extract Substances 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 241001465754 Metazoa Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/48—Extraction of image or video features by mapping characteristic values of the pattern into a parameter space, e.g. Hough transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
Definitions
- the present disclosure relates to the field of artificial intelligence, in particular to computer vision and deep learning technologies.
- VOS Video Object Segmentation
- Semi-supervised video object segmentation needs to extract features from the video sequence with only the initial mask (Mask) to segment the object.
- the current semi-supervised video object segmentation usually extracts the features of the preceding and following frames in the video separately when extracting features.
- the embodiments of the present disclosure propose a feature extraction method, apparatus, device, storage medium, and program product.
- an embodiment of the present disclosure proposes a feature extraction method, including: acquiring a prediction target segmentation and annotation image of the T-1th frame in a video and a pixel-level feature map of the Tth frame, where T is greater than 2 Positive integer; perform feature mapping on the prediction target segmentation and labeling image of the T-1th frame and the pixel-level feature map of the T-th frame respectively, and obtain the mapping feature map of the T-1th frame and the mapping feature map of the T-th frame; The convolution of the mapped feature map of frame T-1 checks the convolution of the mapped feature map of the T-th frame to obtain the score map of the T-th frame, wherein each point of the score map represents the position of the pixel-level feature map of the T-th frame and The similarity of the predicted target segmentation annotation image in the T-1th frame.
- an embodiment of the present disclosure proposes a feature extraction apparatus, including: an acquisition module configured to acquire a prediction target segmentation annotation image of the T-1th frame and a pixel-level feature map of the T-th frame in the video, wherein , T is a positive integer greater than 2; the mapping module is configured to perform feature mapping on the prediction target segmentation annotation image of the T-1th frame and the pixel-level feature map of the T-th frame respectively, and obtain the mapping feature of the T-1th frame.
- the convolution module is configured to use the convolution of the mapping feature map of the T-1th frame to convolve the mapped feature map of the T-th frame to obtain a score map of the T-th frame, wherein , each point of the score map represents the similarity between each position of the pixel-level feature map of the T-th frame and the prediction target segmentation and annotation image of the T-1-th frame.
- an embodiment of the present disclosure provides an electronic device, comprising: at least one processor; and a memory connected in communication with the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor.
- the at least one processor executes to enable the at least one processor to perform a method as described in any implementation of the first aspect.
- an embodiment of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, where the computer instructions are used to cause a computer to execute the method described in any implementation manner of the first aspect.
- an embodiment of the present disclosure provides a computer program product, including a computer program, when the computer program is executed by a processor, the method as described in any one of the implementation manners of the first aspect is implemented.
- the feature extraction method provided by the embodiment of the present disclosure extracts the feature of the subsequent frame in combination with the feature of the previous frame, so that the information between the previous frame and the previous frame can be better extracted.
- FIG. 1 is an exemplary system architecture diagram to which the present disclosure may be applied;
- FIG. 2 is a flowchart of one embodiment of a feature extraction method according to the present disclosure
- FIG. 3 is a scene diagram in which a feature extraction method according to an embodiment of the present disclosure can be implemented
- FIG. 4 is a flowchart of one embodiment of a feature fusion method according to the present disclosure.
- FIG. 5 is a flowchart of one embodiment of a segmentation prediction method according to the present disclosure.
- FIG. 6 is a scene diagram in which a segmentation prediction method according to an embodiment of the present disclosure can be implemented
- FIG. 7 is a schematic structural diagram of an embodiment of a feature extraction apparatus according to the present disclosure.
- FIG. 8 is a block diagram of an electronic device used to implement the feature extraction method of an embodiment of the present disclosure.
- FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the feature extraction method or feature extraction apparatus of the present disclosure may be applied.
- the system architecture 100 may include a video capture device 101 , a network 102 and a server 103 .
- the network 102 is a medium used to provide a communication link between the video capture device 101 and the server 103 .
- the network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
- the video capture device 101 can interact with the server 103 through the network 102 to receive or send images and the like.
- the video capture device 101 may be hardware or software. When the video capture device 101 is hardware, it can be various electronic devices with cameras. When the video capture device 101 is software, it can be installed in the above electronic device. It can be implemented as a plurality of software or software modules, and can also be implemented as a single software or software module. There is no specific limitation here.
- the server 103 can provide various services.
- the server 103 may perform processing such as analysis on the video stream obtained from the video capture device 101, and generate a processing result (eg, a score map of video frames in the video).
- the server 103 may be hardware or software.
- the server 103 When the server 103 is hardware, it may be implemented as a distributed server cluster composed of multiple servers, or may be implemented as a single server.
- the server 103 When the server 103 is software, it can be implemented as a plurality of software or software modules (for example, for providing distributed services), or can be implemented as a single software or software module. There is no specific limitation here.
- the feature extraction method provided by the embodiment of the present disclosure is generally executed by the server 103 , and accordingly, the feature extraction apparatus is generally set in the server 103 .
- FIG. 1 the numbers of video capture devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of video capture devices, networks and servers according to implementation needs.
- the feature extraction method includes the following steps:
- Step 201 Acquire the predicted segmentation and annotation image of the T-1th frame in the video and the pixel-level feature map of the Tth frame.
- the execution body of the feature extraction method can obtain the prediction segmentation and annotation image (Prediction T-1) of the T-1th frame in the video and the pixel level of the T-th frame.
- Prediction T-1 prediction segmentation and annotation image
- Feature map Panel-level Embedding
- a video capture device can capture video within the range of its camera.
- the target When the target appears within the camera range of the video capture device, the target will exist in the captured video.
- the target can be any tangible object existing in the real world, including but not limited to people, animals, plants, buildings, objects and so on.
- the predicted segmentation annotation image of the T-1th frame may be a predicted annotation image for segmenting the object in the T-1th frame. For example, an image generated by annotating the edges of objects in frame T-1. For another example, the edge of the target in the T-1th frame is marked, and then the pixels belonging to the target and the pixels not belonging to the object are respectively set as images generated by different pixel values.
- the pixel-level feature map of the T-th frame may be obtained by using a feature extraction network to perform pixel-level feature extraction, and is used to represent the pixel-level features of the T-th frame.
- the predicted segmentation and labeling image of the T-1th frame may be predicted by using the segmentation prediction method provided by the embodiment of the present disclosure, or may be predicted by using other VOS networks, which is not specifically limited here.
- the feature extraction network used to extract the pixel-level feature map of the T-th frame can be the backbone network (Backbone) in the CFBI (Collaborative Video Object Segmentation by Foreground-Background Integration) network, or It is the backbone network in other VOS networks, and is not specifically limited here.
- Step 202 Perform feature mapping on the predicted segmented and labeled image of the T-1th frame and the pixel-level feature map of the Tth frame, respectively, to obtain the mapped feature map of the T-1th frame and the mapped feature map of the Tth frame.
- the above-mentioned executive body may perform feature mapping on the predicted segmentation and annotation image of the T-1th frame and the pixel-level feature map of the Tth frame respectively, so as to obtain the mapped feature map of the T-1th frame and the T-th frame.
- Map feature maps are in the same feature space. For example, for a 127 ⁇ 127 ⁇ 3 predicted segmented and labeled image, a 6 ⁇ 6 ⁇ 128 mapped feature map is obtained through the feature mapping operation. Similarly, for the pixel-level feature map of 255 ⁇ 255 ⁇ 3, through the feature mapping operation, the mapping feature map of 22 ⁇ 22 ⁇ 128 is obtained.
- the transformation matrix is used to map the predicted segmentation and annotation image of the T-1th frame and the pixel-level feature map of the Tth frame from one feature space to another feature space, that is, Obtain the mapped feature map of the T-1th frame and the mapped feature map of the Tth frame.
- the transformation matrix can linearly transform the image, mapping the image from one space to another.
- the above-mentioned execution body may use a convolutional layer and a pooling layer in a CNN (Convolutional Neural Network, convolutional neural network) to segment and label the prediction of the T-1th frame respectively.
- the image and the pixel-level feature map of the T-th frame are mapped to a preset feature space, and then the mapped feature map of the T-1-th frame and the mapped feature map of the T-th frame can be obtained.
- the deep learning method is used for mapping, which can not only perform linear transformation on the image, but also perform nonlinear transformation on the image. By setting different convolutional layers and pooling layers, the image can be mapped to any space, which is more flexible.
- Step 203 using the convolution check of the mapping feature map of the T-1th frame to convolve the mapped feature map of the Tth frame, to obtain a score map of the Tth frame.
- the above-mentioned execution subject can use the convolution check of the mapping feature map of the T-1th frame to convolve the mapped feature map of the Tth frame, and obtain a score map (Score map) of the Tth frame.
- each point of the score map can represent the similarity between each position of the pixel-level feature map of the T-th frame and the predicted segmentation and annotation image of the T-1-th frame.
- the 22 ⁇ 22 ⁇ 128 mapped feature map is convolved with the convolution kernel 6 ⁇ 6 of the 6 ⁇ 6 ⁇ 128 mapped feature map to obtain a 17 ⁇ 17 ⁇ 1 score map.
- a point of the 17 ⁇ 17 ⁇ 1 score map can represent the similarity between a 15 ⁇ 15 ⁇ 3 region of the 255 ⁇ 255 ⁇ 3 pixel-level feature map and the 127 ⁇ 127 ⁇ 3 predicted segmentation annotation image.
- a point in the score map corresponds to a 15 ⁇ 15 ⁇ 3 region of the pixel-level feature map.
- the above-mentioned executive body can also calculate the position with the highest similarity in the T-th frame based on the score map of the T-th frame, and reversely calculate the position of the target in the T-1th frame, so as to verify the score of the T-th frame. Accuracy.
- the feature extraction method firstly obtain the prediction target segmentation and annotation image of the T-1th frame in the video and the pixel-level feature map of the T-th frame;
- the pixel-level feature maps of the T-th frame are feature-mapped, respectively, to obtain the mapped feature map of the T-1 frame and the T-th frame.
- the mapped feature map is convolved to obtain the score map of the T-th frame.
- the pixel-level feature maps of the subsequent frames are used as the overall input to directly calculate the similarity matching of the feature maps of the previous and subsequent frames, which saves the computational workload.
- FIG. 3 shows a scene diagram in which the feature extraction method according to the embodiment of the present disclosure can be implemented.
- z represents the 127 ⁇ 127 ⁇ 3 predicted segmentation and annotation image of the T-1th frame.
- x represents the 255 ⁇ 255 ⁇ 3 pixel-level feature map of frame T.
- ⁇ represents a feature mapping operation that maps the original image to a specific feature space, where the convolutional and pooling layers in CNN are used. After z passes through ⁇ , a 6 ⁇ 6 ⁇ 128 mapping feature map is obtained. In the same way, x passes through ⁇ to obtain a mapping feature map of 22 ⁇ 22 ⁇ 128.
- a point in the 17 ⁇ 17 ⁇ 1 score map can represent the similarity between a 15 ⁇ 15 ⁇ 3 region of the 255 ⁇ 255 ⁇ 3 pixel-level feature map and the 127 ⁇ 127 ⁇ 3 predicted segmentation annotation image.
- a point in the score map corresponds to a 15 ⁇ 15 ⁇ 3 region of the pixel-level feature map.
- the feature fusion method includes the following steps:
- Step 401 obtaining the predicted segmentation and annotation image of the T-1th frame in the video and the pixel-level feature map of the Tth frame.
- Step 402 perform feature mapping on the predicted segmentation and labeling image of the T-1th frame and the pixel-level feature map of the T-th frame, respectively, to obtain the mapping feature map of the T-1th frame and the mapping feature map of the T-th frame.
- Step 403 using the convolution check of the mapping feature map of the T-1th frame to convolve the mapped feature map of the Tth frame, to obtain a score map of the Tth frame.
- steps 401-403 have been described in detail in steps 201-203 in the embodiment shown in FIG. 2, and are not repeated here.
- Step 404 Obtain the pixel-level feature map of the reference frame in the video, and match the pixel-level feature map of the T-th frame with the pixel-level feature map of the reference frame to obtain a first matching feature map of the T-th frame.
- the execution body of the feature extraction method can obtain the pixel-level feature map of the reference frame in the video, and compare the pixel-level feature map of the T-th frame with the pixels of the reference frame Level feature maps are matched to obtain the first matching feature map of the T-th frame.
- the reference frame has a segmented annotated image, which is usually the first frame in the video. By segmenting and labeling the target in the reference frame, the segmented and labelled image of the reference frame can be obtained.
- the segmentation annotations here are usually manual segmentation annotations.
- the above executive body when applied in FEELVOS (Fast End-to-End Embedding Learning for Video Object Segmentation) network, the above executive body can combine the pixel-level feature map of the T-th frame with the The pixel-level feature maps of the reference frame are directly matched.
- the above-mentioned executive body can also first separate the pixel-level feature map of the reference frame into the foreground pixel-level feature map and the background pixel-level feature map of the reference frame, and then perform the pixel-level feature map with the pixel-level feature map of the T-th frame. match.
- the foreground refers to the object in the picture that is in front of the target or even close to the camera.
- the background refers to the object in the picture that is behind the target and away from the camera.
- the first matching feature map belongs to the pixel-level feature map, and each point of the first matching feature map can represent the matching degree at each point between the pixel-level feature map of the T-th frame and the pixel-level feature map of the reference frame.
- Step 405 Obtain the pixel-level feature map of the T-1th frame, and match the pixel-level feature map of the T-th frame with the pixel-level feature map of the T-1th frame to obtain a second matching feature map of the Tth frame.
- the above-mentioned execution body may obtain the pixel-level feature map of the T-1th frame, and match the pixel-level feature map of the T-th frame with the pixel-level feature map of the T-1th frame to obtain the Tth frame.
- the above executive body can directly match the pixel-level feature map of the T-th frame with the pixel-level feature map of the T-1th frame, or can first separate the pixel-level feature map of the T-1th frame into T-1th
- the foreground pixel-level feature map (Pixel-level FG) and the background pixel-level feature map (Pixel-level BG) of the frame are then matched with the pixel-level feature map of the T-th frame.
- the second matching feature map belongs to the pixel-level feature map, and each point thereof can represent the matching degree of the pixel-level feature map of the Tth frame and the pixel-level feature map of the T-1th frame at each point.
- Step 406 fuse the score map, the first matching feature map, and the second matching feature map of the T-th frame to obtain a fused pixel-level feature map.
- the execution subject may fuse the score map, the first matching feature map, and the second matching feature map of the T-th frame to obtain a fused pixel-level feature map.
- a fused pixel-level feature map can be obtained by performing a concat operation on the score map, the first matching feature map, and the second matching feature map of the T-th frame.
- steps 401-403, step 404 and step 405 may be executed at the same time, or some part may be executed before other parts, and the execution order is not limited here.
- the feature fusion method extracts the feature of the subsequent frame in combination with the feature of the previous frame, so that the information between the previous frame and the previous frame can be better extracted.
- the feature matching is performed based on the reference frame and the previous frame respectively.
- the network structure is simple and fast, and the matching features of the subsequent frame can be obtained quickly, which reduces the workload of feature matching.
- the score map, the first matching feature map and the second matching feature map of the T-th frame are fused to obtain the fused pixel-level feature map, so that the fused pixel-level feature map fully considers the characteristics of the frames before and after, and the information content is richer, including more segmentations information required by the target.
- the segmentation prediction method includes the following steps:
- Step 501 Obtain the predicted segmentation and annotation image of the T-1th frame in the video and the pixel-level feature map of the Tth frame.
- Step 502 Perform feature mapping on the predicted segmented and labeled image of the T-1th frame and the pixel-level feature map of the Tth frame, respectively, to obtain the mapped feature map of the T-1th frame and the mapped feature map of the Tth frame.
- Step 503 using the convolution check of the mapping feature map of the T-1th frame to convolve the mapped feature map of the T-th frame to obtain a score map of the T-th frame.
- steps 501-503 have been described in detail in steps 401-403 in the embodiment shown in FIG. 4, and will not be repeated here.
- Step 504 down-sampling the segmented and labeled images of the reference frame to obtain a mask of the reference frame.
- the execution body of the feature extraction method may downsample (Downsample) the segmented annotation image (Groundtruth) of the reference frame to obtain the mask of the reference frame.
- the segmentation and labeling image of the reference frame may be an image generated by labeling the edge of the target in the reference frame, and then setting the pixels belonging to the target and the pixels not belonging to the object to different pixel values respectively. For example, set pixels that belong to the target to 1, and pixels that do not belong to the object to 0. As another example, pixels belonging to the object are set to 0, and pixels that do not belong to the object are set to 1.
- Downsampling that is, reducing the image, its main purpose is: make the image fit the size of the display area; generate a thumbnail of the corresponding image.
- the downsampling principle is: for an image of size M*N, turn the area in the s*s window of the image into a pixel (the value of which is usually the value of the pixel point is the average of all pixels in the window), that is, (M /s)*(N/s) size image.
- M, N, s are positive integers
- s is the common divisor of M and N.
- the mask of the reference frame can be used to extract regions of interest from the pixel-level feature maps of the reference frame. For example, an AND operation of the mask of the reference frame and the pixel-level feature map of the reference frame can obtain the region of interest image. Wherein, the region of interest image includes only one of foreground or background.
- Step 505 Input the reference frame into a pre-trained feature extraction network to obtain a pixel-level feature map of the reference frame.
- the above-mentioned execution subject may input the reference frame to a pre-trained feature extraction network to obtain a pixel-level feature map of the reference frame.
- the reference frame is input to the backbone network in the CFBI network for pixel-level feature extraction, and the pixel-level feature map of the reference frame can be obtained.
- Step 506 using the mask of the reference frame to perform pixel-level separation on the pixel-level feature map of the reference frame, to obtain the foreground pixel-level feature map and the background pixel-level feature map of the reference frame.
- the above-mentioned execution body may use the mask of the reference frame to perform pixel-level separation (Pixel Separation) on the pixel-level feature map of the reference frame to obtain the foreground pixel-level feature map and the background pixel-level feature map of the reference frame.
- pixel-level separation Pixel Separation
- a mask with a foreground pixel of 1 and a background pixel of 0 perform an AND operation with the pixel-level feature map to obtain the foreground pixel-level feature map.
- Step 507 perform foreground-background global matching on the pixel-level feature map of the T-th frame, the foreground pixel-level feature map and the background pixel-level feature map of the reference frame, and obtain the first matching feature map of the T-th frame.
- the above-mentioned executive body may perform foreground-background global matching (F-G Global Matching) on the pixel-level feature map of the T-th frame, the foreground pixel-level feature map and the background pixel-level feature map of the reference frame, and obtain the T-th frame.
- F-G Global Matching foreground-background global matching
- the pixel-level feature map of the T-th frame is globally matched with the foreground pixel-level feature map and the background pixel-level feature map of the reference frame, respectively.
- Step 508 down-sampling the predicted segmentation and labeling image of the T-1th frame to obtain a mask of the T-1th frame.
- the above-mentioned execution subject may down-sample the predicted segmentation and annotation image of the T-1th frame to obtain the mask of the T-1th frame.
- the segmentation and labeling image of the T-1th frame may be an image generated by labeling the edge of the target in the T-1th frame, and then setting the pixels belonging to the target and the pixels not belonging to the object to different pixel values. . For example, set pixels that belong to the target to 1, and pixels that do not belong to the object to 0. As another example, pixels belonging to the object are set to 0, and pixels that do not belong to the object are set to 1.
- the mask at frame T-1 can be used to extract regions of interest from the pixel-level feature map at frame T-1.
- the image of the region of interest can be obtained by ANDing the mask of the T-1th frame with the pixel-level feature map of the T-1th frame.
- the region of interest image includes only one of foreground or background.
- Step 509 Input the T-1th frame to the pre-trained feature extraction network to obtain a pixel-level feature map of the T-1th frame.
- the above-mentioned execution subject may input the T-1 th frame to a pre-trained feature extraction network to obtain a pixel-level feature map of the T-1 th frame.
- the T-1th frame is input to the backbone network in the CFBI network for pixel-level feature extraction, and the pixel-level feature map of the T-1th frame can be obtained.
- Step 510 using the mask of the T-1th frame to perform pixel-level separation on the pixel-level feature map of the T-1th frame, to obtain the foreground pixel-level feature map and the background pixel-level feature map of the T-1th frame.
- the above-mentioned executive body may use the mask of the T-1th frame to perform pixel-level separation on the pixel-level feature map of the T-1th frame, to obtain the foreground pixel-level feature map and the background pixels of the T-1th frame. level feature map.
- a mask with a foreground pixel of 1 and a background pixel of 0 perform an AND operation with the pixel-level feature map to obtain the foreground pixel-level feature map.
- Step 511 perform foreground-background multi-local matching on the pixel-level feature map of the T-th frame, the foreground pixel-level feature map and the background pixel-level feature map of the T-1th frame, and obtain a second matching feature map of the T-th frame.
- the above-mentioned execution body may perform foreground-background multi-local matching (F-G Multi-Local Matching) on the pixel-level feature map of the T-th frame, the foreground pixel-level feature map and the background pixel-level feature map of the T-1th frame. ) to obtain the second matching feature map of the T-th frame.
- F-G Multi-Local Matching foreground-background multi-local matching
- the pixel-level feature map of the T-th frame, the foreground pixel-level feature map and the background pixel-level feature map of the T-1th frame are respectively subjected to multi-local matching.
- multi-local matching is to set multiple windows from small to large, and one window performs local matching.
- Step 512 fuse the score map, the first matching feature map, and the second matching feature map of the T-th frame to obtain a fused pixel-level feature map.
- step 512 has been described in detail in step 406 in the embodiment shown in FIG. 4 , and will not be repeated here.
- Step 513 Perform global pooling on the feature channel of the foreground pixel-level feature map and the background pixel-level feature map of the reference frame to obtain the foreground instance-level feature vector and the background instance-level feature vector of the reference frame.
- the above-mentioned executive body may globally pool the foreground pixel-level feature map and background pixel-level feature map of the reference frame on the feature channel to obtain the foreground instance-level feature vector (Instance-level FG) and Background instance-level feature vector (Instance-level BG).
- the foreground pixel feature map and the background pixel feature map are globally pooled on the feature channel, and the pixel-scale feature map is converted into an instance-scale pooling vector.
- the pooling vector adjusts the channels of features in the Collaborative Ensembler of the CFBI network based on the attention mechanism. As a result, the network can better obtain instance-scale information.
- Step 514 Perform global pooling on the feature channel of the foreground pixel-level feature map and the background pixel-level feature map of the T-1th frame to obtain the foreground instance-level feature vector and the background instance-level feature vector of the T-1th frame.
- the above-mentioned executive body may globally pool the foreground pixel-level feature map and the background pixel-level feature map of the T-1th frame on the feature channel to obtain the foreground instance-level feature vector of the T-1th frame and Background instance-level feature vector.
- the foreground pixel feature map and the background pixel feature map are globally pooled on the feature channel, and the pixel-scale feature map is converted into an instance-scale pooling vector.
- the pooling vector adjusts the channels of features in the cooperative integrator of the CFBI network based on the attention mechanism. As a result, the network can better obtain instance-scale information.
- Step 515 fuse the foreground instance-level feature vector and the background instance-level feature vector of the reference frame, and the foreground instance-level feature vector and the background instance-level feature vector of the T-1th frame to obtain a fused instance-level feature vector.
- the above-mentioned executive body may fuse the foreground instance-level feature vector and the background instance-level feature vector of the reference frame, and the foreground instance-level feature vector and the background instance-level feature vector of the T-1th frame to obtain a fusion instance.
- level feature vector For example, the fused instance-level feature map can be obtained by splicing the foreground instance-level feature vector and the background instance-level feature vector of the reference frame, and the foreground instance-level feature vector and the background instance-level feature vector of the T-1th frame.
- Step 516 Input the low-level pixel-level feature map, fused pixel-level feature vector, and fused instance-level feature vector of the T-th frame to the cooperative integrator to obtain the predicted segmentation and annotation image of the T-th frame.
- the above-mentioned executive body may input the low-level pixel-level feature map (low-level-feature) of the T-th frame, the fusion pixel-level feature vector, and the fusion instance-level feature vector to the cooperative integrator to obtain the T-th frame.
- Prediction to segment annotated images (Prediction T) The target in the T-th frame can be obtained by segmenting the T-th frame based on the predicted segmentation and annotation image of the T-th frame.
- a cooperative integrator is used to construct a large receptive field for accurate prediction.
- the segmentation prediction method provided by the embodiment of the present disclosure not only embeds learning from foreground pixels, but also embeds learning from background pixels for collaboration, and contrasts the features of the foreground and background to alleviate background confusion, thereby improving the accuracy of segmentation prediction results.
- Spend. With the cooperation of foreground pixels and background pixels, embedding matching is further performed from pixel level and instance level. For pixel-level matching, the robustness of local matching under various target moving rates is improved. For instance-level matching, an attention mechanism is designed to effectively enhance pixel-level matching. Based on the CFBI network, the idea of tracking network is added, so that the information between the previous and previous frames can be better extracted. It is equivalent to adding an additional layer of supervision signals to the CFBI network, and the extracted features can better represent the needs of the model, thereby improving the network segmentation effect.
- the feature extraction method can be used not only in the CFBI network, but also in other VOS networks, and the location of the embedded network can be adjusted according to the actual situation.
- FIG. 6 shows a scene diagram in which the segmentation prediction method according to the embodiment of the present disclosure can be implemented.
- input the 1st frame, T-1th frame and Tth frame in the video to the Backbone in the CFBI network to obtain the Pixel-level Embedding of the 1st frame, T-1th frame and Tth frame , and Downsample the Groundtruth of the first frame and the Prediction T-1 of the T-1 frame to obtain the Mask of the first frame and the T-1 frame.
- Use the convolution check of the mapping feature map of the Prediction T-1 of the T-1th frame to convolve the mapped feature map of the Pixel-level Embedding of the T-th frame to obtain the Score map of the T-th frame.
- the present disclosure provides an embodiment of a feature extraction apparatus.
- the apparatus embodiment corresponds to the method embodiment shown in FIG. 2 .
- the apparatus may Used in various electronic devices.
- the feature extraction apparatus 700 in this embodiment may include: an acquisition module 701 , a mapping module 702 and a convolution module 703 .
- the acquisition module 701 is configured to acquire the predicted segmentation and annotation image of the T-1th frame and the pixel-level feature map of the T-th frame in the video, where T is a positive integer greater than 2;
- the mapping module 702 is configured to Perform feature mapping on the predicted segmentation and labeling image of the T-1th frame and the pixel-level feature map of the T-th frame, respectively, to obtain the mapping feature map of the T-1th frame and the T-th frame.
- convolution module 703 which is It is configured to use the convolution of the mapped feature map of the T-1th frame to convolve the mapped feature map of the T-th frame to obtain a score map of the T-th frame, wherein each point of the score map represents the pixel-level feature map of the T-th frame. The similarity of each position of t-1 to the predicted segmentation annotation image of the T-1th frame.
- the specific processing of the acquisition module 701, the mapping module 702, and the convolution module 703 and the technical effects brought about by them can refer to steps 201-203 in the corresponding embodiment of FIG. 2, respectively. Related descriptions are not repeated here.
- the mapping module 702 is further configured to: use the convolutional layer and the pooling layer in the convolutional neural network to segment the prediction of the T-1th frame and label the image and the The pixel-level feature maps of T frames are mapped to a preset feature space.
- the feature extraction apparatus 700 further includes: a first matching module, configured to acquire a pixel-level feature map of a reference frame in the video, and convert the pixel-level feature map of the T th frame Matching with the pixel-level feature map of the reference frame to obtain the first matching feature map of the T-th frame, wherein the reference frame has a segmented and labeled image; the second matching module is configured to obtain the T-1th frame.
- a first matching module configured to acquire a pixel-level feature map of a reference frame in the video, and convert the pixel-level feature map of the T th frame Matching with the pixel-level feature map of the reference frame to obtain the first matching feature map of the T-th frame, wherein the reference frame has a segmented and labeled image
- the second matching module is configured to obtain the T-1th frame.
- the pixel-level feature map of the frame , and the pixel-level feature map of the T-th frame is matched with the pixel-level feature map of the T-1st frame to obtain the second matching feature map of the T-th frame;
- the first fusion module is configured to compare the score of the T-th frame , the first matching feature map and the second matching feature map are fused to obtain a fused pixel-level feature map.
- the first matching module is further configured to: downsample the segmented and annotated images of the reference frame to obtain the mask of the reference frame; input the reference frame to the pre-trained feature extraction network to obtain the pixel-level feature map of the reference frame; use the mask of the reference frame to separate the pixel-level feature map of the reference frame to obtain the foreground pixel-level feature map and background pixel-level feature map of the reference frame;
- the pixel-level feature map of T is subjected to foreground-background global matching with the foreground pixel-level feature map and background pixel-level feature map of the reference frame, and the first matching feature map of the T-th frame is obtained.
- the second matching module is further configured to: downsample the predicted segmentation and labeling image of the T-1th frame to obtain a mask of the T-1th frame; -1 frame is input to the pre-trained feature extraction network, and the pixel-level feature map of the T-1th frame is obtained; the pixel-level feature map of the T-1th frame is separated by using the mask of the T-1th frame.
- the foreground pixel-level feature map and background pixel-level feature map of frame T-1; the pixel-level feature map of frame T and the foreground pixel-level feature map and background pixel-level feature map of frame T-1 Local matching to obtain the second matching feature map of the T-th frame.
- the feature extraction apparatus 700 further includes: a first pooling module, configured to perform a global analysis of the foreground pixel-level feature map and the background pixel-level feature map of the reference frame on the feature channel Pooling to obtain the foreground instance-level feature vector and background instance-level feature vector of the reference frame; the second pooling module is configured to combine the foreground pixel-level feature map and background pixel-level feature map of the T-1th frame on the feature channel.
- a first pooling module configured to perform a global analysis of the foreground pixel-level feature map and the background pixel-level feature map of the reference frame on the feature channel Pooling to obtain the foreground instance-level feature vector and background instance-level feature vector of the reference frame
- the second pooling module is configured to combine the foreground pixel-level feature map and background pixel-level feature map of the T-1th frame on the feature channel.
- the second fusion module is configured to combine the foreground instance-level feature vector and background instance-level feature vector of the reference frame, and the first The foreground instance-level feature vector and the background instance-level feature vector of the T-1 frame are fused to obtain a fused instance-level feature vector.
- the feature extraction apparatus 700 further includes: a prediction module configured to input the low-level pixel-level feature map of the T-th frame, the fused pixel-level feature vector, and the fused instance-level feature vector to A cooperative integrator to obtain the predicted segmentation and annotation image of the T-th frame. ,
- the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
- FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure.
- Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
- Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices.
- the components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
- the device 800 includes a computing unit 801 that can be executed according to a computer program stored in a read only memory (ROM) 802 or a computer program loaded from a storage unit 808 into a random access memory (RAM) 803 Various appropriate actions and handling. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored.
- the computing unit 801, the ROM 802, and the RAM 803 are connected to each other through a bus 804.
- An input/output (I/O) interface 805 is also connected to bus 804 .
- Various components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, mouse, etc.; an output unit 807, such as various types of displays, speakers, etc.; a storage unit 808, such as a magnetic disk, an optical disk, etc. ; and a communication unit 809, such as a network card, a modem, a wireless communication transceiver, and the like.
- the communication unit 809 allows the device 800 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.
- Computing unit 801 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 801 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
- the computing unit 801 performs the various methods and processes described above, such as the feature extraction method.
- the feature extraction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808 .
- part or all of the computer program may be loaded and/or installed on device 800 via ROM 802 and/or communication unit 809.
- ROM 802 and/or communication unit 809 When a computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the feature extraction method described above may be performed.
- the computing unit 801 may be configured to perform the feature extraction method by any other suitable means (eg, by means of firmware).
- Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOC systems on chips system
- CPLD load programmable logic device
- computer hardware firmware, software, and/or combinations thereof.
- These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that
- the processor which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
- Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented.
- the program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
- the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
- machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read only memory
- EPROM or flash memory erasable programmable read only memory
- CD-ROM compact disk read only memory
- magnetic storage or any suitable combination of the foregoing.
- the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer.
- a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- a keyboard and pointing device eg, a mouse or trackball
- Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
- the systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
- the components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
- a computer system can include clients and servers.
- Clients and servers are generally remote from each other and usually interact through a communication network.
- the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
- the server can be a cloud server, a distributed system server, or a server combined with blockchain.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (17)
- 一种特征提取方法,包括:获取视频中的第T-1帧的预测目标分割标注图像和第T帧的像素级特征图,其中,T为大于2的正整数;对所述第T-1帧的预测目标分割标注图像和所述第T帧的像素级特征图分别进行特征映射,得到所述第T-1帧的映射特征图和所述第T帧的映射特征图;利用所述第T-1帧的映射特征图的卷积核对所述第T帧的映射特征图卷积,得到所述第T帧的得分图,其中,所述得分图的各个点表征所述第T帧的像素级特征图的各个位置与所述第T-1帧的预测目标分割标注图像的相似度。
- 根据权利要求1所述的方法,其中,所述对所述第T-1帧的预测目标分割标注图像和所述第T帧的像素级特征图分别进行特征映射,包括:采用卷积神经网络中的卷积层和池化层,分别将所述第T-1帧的预测目标分割标注图像和所述第T帧的像素级特征图映射到预设的特征空间。
- 根据权利要求1或2所述的方法,其中,所述方法还包括:获取所述视频中的参考帧的像素级特征图,以及将所述第T帧的像素级特征图与所述参考帧的像素级特征图进行匹配,得到所述第T帧的第一匹配特征图,其中,所述参考帧具有目标分割标注图像;获取所述第T-1帧的像素级特征图,以及将所述第T帧的像素级特征图与所述第T-1帧的像素级特征图进行匹配,得到所述第T帧的第二匹配特征图;将所述第T帧的得分图、第一匹配特征图和第二匹配特征图进行融合,得到融合像素级特征图。
- 根据权利要求3所述的方法,其中,所述获取所述视频中的参考帧的像素级特征图,以及将所述第T帧的像素级特征图与所述参考帧的像素级特征图进行匹配,得到所述第T帧的第一匹配特征图,包括:对所述参考帧的目标分割标注图像进行下采样,得到所述参考帧的掩膜;将所述参考帧输入至预先训练的特征提取网络,得到所述参考帧的像素级特征图;利用所述参考帧的掩膜对所述参考帧的像素级特征图进行像素级分离,得到所述参考帧的前景像素级特征图和背景像素级特征图;将所述第T帧的像素级特征图与所述参考帧的前景像素级特征图和背景像素级特征图进行前景-背景全局匹配,得到所述第T帧的第一匹配特征图。
- 根据权利要求4所述的方法,其中,所述获取所述第T-1帧的像素级特征图,以及将所述第T帧的像素级特征图与所述第T-1帧的像素级特征图进行匹配,得到所述第T帧的第二匹配特征图,包括:对所述第T-1帧的预测目标分割标注图像进行下采样,得到所述第T-1帧的掩膜;将所述第T-1帧输入至预先训练的特征提取网络,得到所述第T-1帧的像素级特征图;利用所述第T-1帧的掩膜对所述第T-1帧的像素级特征图进行像素级分离,得到所述第T-1帧的前景像素级特征图和背景像素级特征图;将所述第T帧的像素级特征图与所述第T-1帧的前景像素级特征图和背景像素级特征图进行前景-背景多局部匹配,得到所述第T帧的第二匹配特征图。
- 根据权利要求5所述的方法,其中,所述方法还包括:将所述参考帧的前景像素级特征图和背景像素级特征图在特征通道上进行全局池化,得到所述参考帧的前景实例级特征向量和背景实例级特征向量;将所述第T-1帧的前景像素级特征图和背景像素级特征图在特征通道上进行全局池化,得到所述第T-1帧的前景实例级特征向量和背景实例级特征向量;将所述参考帧的前景实例级特征向量和背景实例级特征向量,以及所述第T-1帧的前景实例级特征向量和背景实例级特征向量进行融合,得到融合实例级特征向量。
- 根据权利要求6所述的方法,其中,所述方法还包括:将所述第T帧的低层像素级特征图、所述融合像素级特征向量和所述融合实例级特征向量输入至协作集成器,得到所述第T帧的预测目标分割标注图像。
- 一种特征提取装置,包括:获取模块,被配置成获取视频中的第T-1帧的预测目标分割标注图像和第T帧的像素级特征图,其中,T为大于2的正整数;映射模块,被配置成对所述第T-1帧的预测目标分割标注图像和所述第T帧的像素级特征图分别进行特征映射,得到所述第T-1帧的映射特征图和所述第T帧的映射特征图;卷积模块,被配置成利用所述第T-1帧的映射特征图的卷积核对所述第T帧的映射特征图卷积,得到所述第T帧的得分图,其中,所述得分图的各个点表征所述第T帧的像素级特征图的各个位置与所述第T-1帧的预测目标分割标注图像的相似度。
- 根据权利要求8所述的装置,其中,所述映射模块进一步被配置成:采用卷积神经网络中的卷积层和池化层,分别将所述第T-1帧的预测目标分割标注图像和所述第T帧的像素级特征图映射到预设的特征空间。
- 根据权利要求8或9所述的装置,其中,所述装置还包括:第一匹配模块,被配置成获取所述视频中的参考帧的像素级特征图,以及将所述第T帧的像素级特征图与所述参考帧的像素级特征图进行匹配,得到所述第T帧的第一匹配特征图,其中,所述参考帧具有目标分割标注图像;第二匹配模块,被配置成获取所述第T-1帧的像素级特征图,以及将所述第T帧的像素级特征图与所述第T-1帧的像素级特征图进行匹配,得到所述第T帧的第二匹配特征图;第一融合模块,被配置成将所述第T帧的得分图、第一匹配特征图和第二匹配特征图进行融合,得到融合像素级特征图。
- 根据权利要求10所述的装置,其中,所述第一匹配模块进一步被配置成:对所述参考帧的目标分割标注图像进行下采样,得到所述参考帧的掩膜;将所述参考帧输入至预先训练的特征提取网络,得到所述参考帧的像素级特征图;利用所述参考帧的掩膜对所述参考帧的像素级特征图进行像素级分离,得到所述参考帧的前景像素级特征图和背景像素级特征图;将所述第T帧的像素级特征图与所述参考帧的前景像素级特征图和背景像素级特征图进行前景-背景全局匹配,得到所述第T帧的第一匹配特征图。
- 根据权利要求11所述的装置,其中,所述第二匹配模块进一步被配置成:对所述第T-1帧的预测目标分割标注图像进行下采样,得到所述第T-1帧的掩膜;将所述第T-1帧输入至预先训练的特征提取网络,得到所述第T-1帧的像素级特征图;利用所述第T-1帧的掩膜对所述第T-1帧的像素级特征图进行像素级分离,得到所述第T-1帧的前景像素级特征图和背景像素级特征图;将所述第T帧的像素级特征图与所述第T-1帧的前景像素级特征图和背景像素级特征图进行前景-背景多局部匹配,得到所述第T帧的第二匹配特征图。
- 根据权利要求12所述的装置,其中,所述装置还包括:第一池化模块,被配置成将所述参考帧的前景像素级特征图和背景像素级特征图在特征通道上进行全局池化,得到所述参考帧的前景实例级特征向量和背景实例级特征向量;第二池化模块,被配置成将所述第T-1帧的前景像素级特征图和背景像素级特征图在特征通道上进行全局池化,得到所述第T-1帧的前景实例级特征向量和背景实例级特征向量;第二融合模块,被配置成将所述参考帧的前景实例级特征向量和背景实例级特征向量,以及所述第T-1帧的前景实例级特征向量和背景实例级特征向量进行融合,得到融合实例级特征向量。
- 根据权利要求13所述的装置,其中,所述装置还包括:预测模块,被配置成将所述第T帧的低层像素级特征图、所述融合像素级特征向量和所述融合实例级特征向量输入至协作集成器,得到所述第T帧的预测目标分割标注图像。
- 一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的方法。
- 一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使所述计算机执行权利要求1-7中任一项所述的方法。
- 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-7中任一项所述的方法。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020227038225A KR20220153667A (ko) | 2021-04-13 | 2022-01-29 | 특징 추출 방법, 장치, 전자 기기, 저장 매체 및 컴퓨터 프로그램 |
JP2022560927A JP2023525462A (ja) | 2021-04-13 | 2022-01-29 | 特徴を抽出するための方法、装置、電子機器、記憶媒体およびコンピュータプログラム |
US17/963,865 US20230030431A1 (en) | 2021-04-13 | 2022-10-11 | Method and apparatus for extracting feature, device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110396281.7A CN112861830B (zh) | 2021-04-13 | 2021-04-13 | 特征提取方法、装置、设备、存储介质以及程序产品 |
CN202110396281.7 | 2021-04-13 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/963,865 Continuation US20230030431A1 (en) | 2021-04-13 | 2022-10-11 | Method and apparatus for extracting feature, device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022218012A1 true WO2022218012A1 (zh) | 2022-10-20 |
Family
ID=75992531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/075069 WO2022218012A1 (zh) | 2021-04-13 | 2022-01-29 | 特征提取方法、装置、设备、存储介质以及程序产品 |
Country Status (5)
Country | Link |
---|---|
US (1) | US20230030431A1 (zh) |
JP (1) | JP2023525462A (zh) |
KR (1) | KR20220153667A (zh) |
CN (1) | CN112861830B (zh) |
WO (1) | WO2022218012A1 (zh) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112861830B (zh) * | 2021-04-13 | 2023-08-25 | 北京百度网讯科技有限公司 | 特征提取方法、装置、设备、存储介质以及程序产品 |
CN113570607B (zh) * | 2021-06-30 | 2024-02-06 | 北京百度网讯科技有限公司 | 目标分割的方法、装置及电子设备 |
CN113610885B (zh) * | 2021-07-12 | 2023-08-22 | 大连民族大学 | 使用差异对比学习网络的半监督目标视频分割方法及系统 |
CN116580249B (zh) * | 2023-06-06 | 2024-02-20 | 河北中废通拍卖有限公司 | 基于集成学习模型的拍品分类方法、系统及存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260688A (zh) * | 2020-01-13 | 2020-06-09 | 深圳大学 | 一种孪生双路目标跟踪方法 |
CN112132232A (zh) * | 2020-10-19 | 2020-12-25 | 武汉千屏影像技术有限责任公司 | 医学图像的分类标注方法和系统、服务器 |
CN112861830A (zh) * | 2021-04-13 | 2021-05-28 | 北京百度网讯科技有限公司 | 特征提取方法、装置、设备、存储介质以及程序产品 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109214238B (zh) * | 2017-06-30 | 2022-06-28 | 阿波罗智能技术(北京)有限公司 | 多目标跟踪方法、装置、设备及存储介质 |
US10671855B2 (en) * | 2018-04-10 | 2020-06-02 | Adobe Inc. | Video object segmentation by reference-guided mask propagation |
CN108898086B (zh) * | 2018-06-20 | 2023-05-26 | 腾讯科技(深圳)有限公司 | 视频图像处理方法及装置、计算机可读介质和电子设备 |
US10269125B1 (en) * | 2018-10-05 | 2019-04-23 | StradVision, Inc. | Method for tracking object by using convolutional neural network including tracking network and computing device using the same |
CN110427839B (zh) * | 2018-12-26 | 2022-05-06 | 厦门瞳景物联科技股份有限公司 | 基于多层特征融合的视频目标检测方法 |
US11763565B2 (en) * | 2019-11-08 | 2023-09-19 | Intel Corporation | Fine-grain object segmentation in video with deep features and multi-level graphical models |
CN111462132A (zh) * | 2020-03-20 | 2020-07-28 | 西北大学 | 一种基于深度学习的视频物体分割方法及系统 |
CN111507997B (zh) * | 2020-04-22 | 2023-07-25 | 腾讯科技(深圳)有限公司 | 图像分割方法、装置、设备及计算机存储介质 |
CN112434618B (zh) * | 2020-11-26 | 2023-06-23 | 西安电子科技大学 | 基于稀疏前景先验的视频目标检测方法、存储介质及设备 |
-
2021
- 2021-04-13 CN CN202110396281.7A patent/CN112861830B/zh active Active
-
2022
- 2022-01-29 KR KR1020227038225A patent/KR20220153667A/ko not_active Application Discontinuation
- 2022-01-29 JP JP2022560927A patent/JP2023525462A/ja not_active Ceased
- 2022-01-29 WO PCT/CN2022/075069 patent/WO2022218012A1/zh active Application Filing
- 2022-10-11 US US17/963,865 patent/US20230030431A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260688A (zh) * | 2020-01-13 | 2020-06-09 | 深圳大学 | 一种孪生双路目标跟踪方法 |
CN112132232A (zh) * | 2020-10-19 | 2020-12-25 | 武汉千屏影像技术有限责任公司 | 医学图像的分类标注方法和系统、服务器 |
CN112861830A (zh) * | 2021-04-13 | 2021-05-28 | 北京百度网讯科技有限公司 | 特征提取方法、装置、设备、存储介质以及程序产品 |
Also Published As
Publication number | Publication date |
---|---|
CN112861830A (zh) | 2021-05-28 |
CN112861830B (zh) | 2023-08-25 |
JP2023525462A (ja) | 2023-06-16 |
US20230030431A1 (en) | 2023-02-02 |
KR20220153667A (ko) | 2022-11-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022218012A1 (zh) | 特征提取方法、装置、设备、存储介质以及程序产品 | |
WO2020199931A1 (zh) | 人脸关键点检测方法及装置、存储介质和电子设备 | |
US20210004984A1 (en) | Method and apparatus for training 6d pose estimation network based on deep learning iterative matching | |
US11270158B2 (en) | Instance segmentation methods and apparatuses, electronic devices, programs, and media | |
US11222211B2 (en) | Method and apparatus for segmenting video object, electronic device, and storage medium | |
US11841921B2 (en) | Model training method and apparatus, and prediction method and apparatus | |
WO2019011249A1 (zh) | 一种图像中物体姿态的确定方法、装置、设备及存储介质 | |
CN111783620A (zh) | 表情识别方法、装置、设备及存储介质 | |
US11915439B2 (en) | Method and apparatus of training depth estimation network, and method and apparatus of estimating depth of image | |
CN111767853B (zh) | 车道线检测方法和装置 | |
WO2022227768A1 (zh) | 动态手势识别方法、装置、设备以及存储介质 | |
WO2023273173A1 (zh) | 目标分割的方法、装置及电子设备 | |
CN113343982A (zh) | 多模态特征融合的实体关系提取方法、装置和设备 | |
CN113435408A (zh) | 人脸活体检测方法、装置、电子设备及存储介质 | |
CN111382647B (zh) | 一种图片处理方法、装置、设备及存储介质 | |
CN114092759A (zh) | 图像识别模型的训练方法、装置、电子设备及存储介质 | |
CN112528858A (zh) | 人体姿态估计模型的训练方法、装置、设备、介质及产品 | |
CN113378712A (zh) | 物体检测模型的训练方法、图像检测方法及其装置 | |
CN113343981A (zh) | 一种视觉特征增强的字符识别方法、装置和设备 | |
JP7126586B2 (ja) | 顔合成画像検出方法、顔合成画像検出装置、電子機器、記憶媒体及びコンピュータプログラム | |
CN113569855A (zh) | 一种舌象分割方法、设备及存储介质 | |
CN116363429A (zh) | 图像识别模型的训练方法、图像识别方法、装置及设备 | |
CN113177483B (zh) | 视频目标分割方法、装置、设备以及存储介质 | |
CN114549904A (zh) | 视觉处理及模型训练方法、设备、存储介质及程序产品 | |
CN113537359A (zh) | 训练数据的生成方法及装置、计算机可读介质和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 2022560927 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20227038225 Country of ref document: KR Kind code of ref document: A |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22787242 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22787242 Country of ref document: EP Kind code of ref document: A1 |