CN111242003B - Video salient object detection method based on multi-scale constrained self-attention mechanism - Google Patents

Video salient object detection method based on multi-scale constrained self-attention mechanism Download PDF

Info

Publication number
CN111242003B
CN111242003B CN202010024556.XA CN202010024556A CN111242003B CN 111242003 B CN111242003 B CN 111242003B CN 202010024556 A CN202010024556 A CN 202010024556A CN 111242003 B CN111242003 B CN 111242003B
Authority
CN
China
Prior art keywords
video
feature
features
training
spatial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010024556.XA
Other languages
Chinese (zh)
Other versions
CN111242003A (en
Inventor
程明明
顾宇超
卢少平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202010024556.XA priority Critical patent/CN111242003B/en
Publication of CN111242003A publication Critical patent/CN111242003A/en
Application granted granted Critical
Publication of CN111242003B publication Critical patent/CN111242003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention discloses a video salient object detection method based on a multi-scale constrained self-attention mechanism. The method adopts a constrained attention mechanism, namely measuring the similarity of an inquiry element and a plurality of frames in a video segment in a surrounding constrained region to generate an attention diagram, and weighting and collecting characteristic information of the plurality of frames in the surrounding constrained region to strengthen the characteristics of the inquiry element by using the attention diagram as a weight. Meanwhile, by utilizing a multi-branch technology, the sampling of each constrained attention branch is in different scale ranges, and the formed multi-scale constrained self-attention mechanism can be suitable for the input of different scales. The method successfully utilizes the visual connection between the inter-frame elements in the video segments, thereby solving the problem of motion modeling of the salient objects between videos. The video saliency target detection system constructed based on the method can achieve extremely high detection speed and high detection precision.

Description

Video salient object detection method based on multi-scale constrained self-attention mechanism
Technical Field
The invention belongs to the technical field of video processing, and particularly relates to a video salient object detection method based on a multi-scale constrained self-attention mechanism.
Background
Video salient object detection is used to segment the most appealing objects in a video segment. The technique is generally used as a pre-processing for numerous real-time applications, such as video tracking, video segmentation, human-computer interaction, and the like. Therefore, both efficiency and accuracy are important to video saliency model design.
Laurent Itti, 1998, has shown that for dynamic scenes, the motion of objects can attract attention, so the timing characteristics play a decisive role in video saliency. In an early approach, Tao Xi et al proposed in "sales Object Detection With spatialitemeporal Background subjects for Video" in 2016 to learn significant Object information in Video. With the rise of deep learning, early manual features changed to encode spatiotemporal information using different network structures. Among them, Trung-Nghia Le et al propose to use "3D convolution" to extract spatio-temporal Features in "Video saline Object Detection Using spatial iterative Deep Features", and then construct a spatio-temporal map to determine timing consistency. Hongmei Song et al in "guided delay ConvLSTM for Video sales Object Detection" encode timing information primarily by long and short remembering convolutional neural networks. Guinbin Li et al, in "Flow Guided Current New Encoder for Video sales Object Detection", uses optical Flow to acquire motion information. The 3D convolution, optical flow extraction and long-time and short-time memory network generally have large calculation overhead. In addition, these methods can only process adjacent frames at a time, extract motion information step by step, and cannot acquire motion information directly from a plurality of frames in one video segment at the same time, so that the speed is relatively slow.
Recently Wang et al proposed a "non-local neural network" that extends the self-attention mechanism from the machine translation domain to the video classification domain. The method can simulate long-distance time sequence information by measuring the paired similarity between elements and gathering information from multiple frames according to the similarity. The Non-local approach has a large computational and memory overhead, followed by different work exploring how to mitigate the Non-local computational overhead. Kaiyu Yue et al, who simultaneously learned the relationships between space and channel elements in a "Compact Generalized Non-local Network", propose a Compact form of kernel function that reduces computational overhead. Computing the pixel-by-pixel relationships is difficult in the pixel-level intensive prediction task. Since pixel-level dense prediction requires maintaining a high-resolution feature map, it causes huge computational consumption. The method for cross sampling is provided by Zilong Huang et al in ' CCNet ' Criss-CrossAttention for Semantic Segmentation ', sparse elements are learned in Semantic Segmentation to be in pairwise relation, and calculation consumption is reduced. In the field of video significance prediction, no work is done to acquire motion information by analyzing the pair-wise relationship of inter-frame elements, and the method is directly used for video significance detection and still has large calculation overhead.
Disclosure of Invention
The invention provides a video saliency object detection method based on a multi-scale constrained self-attention mechanism, aiming at the problems that the current video saliency object detection speed is low, the time sequence relation cannot be modeled among multiple frames and the like. The method of the invention groups the space characteristics of a plurality of input frames, and measures the relationship of the elements between the frames by utilizing sampling windows with different scales in each group. By the inter-frame space-time continuous motion prior, the method reduces the overhead of dense relation sampling, can quickly acquire motion information among multiple frames, and achieves quick and accurate video significance detection.
Technical scheme of the invention
A video salient object detection method based on a multi-scale constrained self-attention mechanism comprises three steps of space feature training, time sequence feature training and model deployment, and specifically comprises the following steps:
step 1, training spatial characteristics;
step 1.1, collecting an image significance data set;
step 1.2, preprocessing an image significance data set, including random turning and scale transformation;
step 1.3, training a backbone network by utilizing an image significance data set and a BP algorithm to obtain an image significance characteristic extraction network;
Step 2, training time sequence characteristics;
step 2.1, collecting a video significance data set;
step 2.2, performing data enhancement on the video significance data set, including random turning and extracting training frames at different interval lengths;
2.3, in training, extracting a section of video frame in the video significance data set, and extracting spatial features frame by frame through a neural network;
step 2.4, grouping the extracted spatial features, and setting windows with different scales for each group;
2.5, for each group of grouped spatial features, because an object does not have large displacement between frames, generating an attention diagram of each position by measuring the similarity between the features of each position on the feature diagram and the features of a spatial adjacent area of the position on an adjacent frame, wherein the size of the spatial adjacent area is given by a preset scale window in the step 2.4;
step 2.6, for each group of space features after grouping, for each position on the feature map, the time sequence information of the surrounding frames is collected through the weighted attention map generated in the step 2.5, and the space-time feature of the position is obtained;
step 2.7, performing linear fusion on the space-time characteristics obtained by different groups, and obtaining a predicted significance result by the fused space-time characteristics through a decoder;
Step 2.8, repeating the training process until convergence, and obtaining trained neural network parameters;
step 3, model deployment;
step 3.1, acquiring a video to be detected;
3.2, framing the video to be detected, and forming small-batch data by the obtained frames according to a given quantity;
step 3.3, initializing a neural network, and loading the trained parameters in the step 2.8;
and 3.4, performing video significance prediction on each small batch of data formed in the step 3.2, and synthesizing a detection result video.
The invention has the advantages and beneficial effects that: the method can acquire the motion information by measuring the relation of the element levels among multiple frames in the video clip. By restricting the reference window, the method restricts the inquiry range to the position where the motion between the object frames may occur, thereby greatly reducing the cost of modeling the element-by-element relationship between the video frames. By introducing different window sizes in different branches, the method can adapt to different object dimensions and large displacement between frames caused by different speeds. The method brings a new solution to the time sequence modeling of video significance object detection, and achieves extremely high detection speed and precision.
Drawings
Fig. 1 is a flow diagram of video salient object detection based on a multi-scale constrained self-attention mechanism.
Fig. 2 is a specific architecture of an image saliency feature based extraction network, which is based on mobilenetV3, and removes the full connection layer, and sets the convolution step sizes of the 3 rd and 5 th stages to 1.
Fig. 3 is a schematic diagram of a sampling window setting.
Fig. 4 is a schematic diagram of video salient object detection based on a multi-scale constrained self-attention mechanism, which includes an overall frame schematic diagram and a constrained self-attention mechanism module schematic diagram.
Fig. 5 is a graph of the results of the generated attention maps.
FIG. 6 is a significance map generation result on a public data set. Wherein the first line is an input video frame, the second line is an annotated graph, and the third line is a result generated by the method of the invention.
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings, but the present invention is not limited to the following.
Referring to fig. 1, a flow chart of video salient object detection based on a multi-scale constrained self-attention mechanism includes three stages of spatial feature training, timing feature training and model deployment:
firstly, the spatial feature training comprises the following steps:
S1, collecting an image significance data set; in particular, we use the picture saliency to detect the training set part of the public data set DUTS.
S2, preprocessing the image saliency data set, namely firstly scaling the image to a fixed size 224 x 224, and then carrying out scale transformation on the image to obtain the image which is (0.5, 0.75, 1, 1.25 and 1.5) times as large as the original image. The input image is subtracted by the mean (0.485, 0.456, 0.406), divided by the standard deviation (0.2299, 0.224, 0.225),
and S3, training a backbone network by using the image significance data set and the BP algorithm. Detailed structure referring to fig. 2, we use mobilene-v 3 as the backbone network, eliminating the full connectivity layer of the mobilene. In order to maintain the spatial information of the feature map, we set the convolution step size of the last two stages of the mobilene to 1, and the spatial feature map output by the network has the resolution of the original map 1/8. And obtaining the image salient features through training to extract a backbone network.
The backbone network obtained through training of the image saliency data set has the capacity of extracting the spatial saliency objects, and provides a good initialization for the time sequence saliency training.
Secondly, the time sequence characteristic training comprises the following steps:
s4. collect video saliency data sets, in particular, we use the training set part of the two data sets of the disclosed video saliency data set DAVIS and davsodod.
And S5, performing data enhancement on the video significance data set, giving k frames to form a batch of video segments as training data, performing random horizontal turning on the training segments, and randomly selecting k frames with different intervals (1-5) to form the training data.
S6, in training, inputting a group of video clips, extracting spatial features frame by frame through a pre-trained image saliency feature extraction network, and splicing along a time dimension;
s7, grouping the spatial characteristics of the input video clips, and setting windows with different scales;
the method specifically comprises the following substeps: for the obtained spatial features X of the video segment, the size of X is T X H W, where T, H,w represents the frame number, length, width and channel number of the video space feature respectively; the method comprises the steps that three convolution kernels with the size of 1X 1 are used for carrying out convolution on a spatial feature X of a video clip, and the X is linearly projected to three subspaces Q, K and V, wherein Q, K and V respectively represent an inquiry feature, a measurement feature and a value feature; and splitting Q, K and V into g feature groups along the feature channel, wherein each feature group has C/g dimensional features, and the features Qi, Ki and Vi of each feature group are subjected to the feature group, wherein i represents the ith group of features. For each set of spatial features, a sampling window of a given scale is set. Referring to FIG. 3, we give a sampling window radius parameter r and a window hole parameter d, for each query element position x q(gray element in the figure) the sampling window is centered on this element, a perforated square window on a given video segment (black element in the figure represents the sampling point, example window radius r is 1, hole d is 2). The sample position function may be defined as:
Figure GDA0003597659280000051
wherein x isq(h, w, t) represents the position of the query element, KiRepresenting the metric characteristics and T representing the total frame number. The sampling window is centered on the query element, at the same spatial location on successive frames. Since the motion of the object has a continuous trajectory, the sampling window can give a rough localization in the previous and subsequent frames of the query element using the location prior.
S8, traversing all feature elements for each group of grouped spatial features, and generating an attention diagram by measuring the similarity of the features of the query elements and the elements of surrounding frames in a given sampling window;
the method specifically comprises the following substeps: referring to fig. 4, all elements are traversed for the grouped ith set of query features Qi. To QiEach element x of (1)qIn measuring the characteristic KiTo obtain xqAnd measuring characteristics of elements in the window range, and generating an attention diagram through a relation measure. For the relationship metric function, we use dot product for similarity measurement. The formula for generating the attention diagram is as follows:
Watt=f(Qi,Ki)=QiS(Qi、Ki)T
wherein, WattRepresenting the generated attention map. We performed softmax normalization on the generated attention map. The sampling window defined in this step gives an approximate range of positions of the query element in the adjacent frame, and we try to more accurately locate the position of the element in the adjacent frame by measuring the attention generated by similarity. Our window setup can greatly reduce the number of times the element similarity is calculated in case the object motion is captured. Compared with the similarity calculated by the whole graph in the prior method, the calculation of a large number of irrelevant areas is reduced through the continuity of adjacent frames. Referring to fig. 5, a diagram of the window attention maps at the current frame and the fifth frame is shown, with queries made at two locations given the first frame. We can focus salient objects well between frames.
S9, for each group of spatial features which are grouped, acquiring information of surrounding frames through attention force diagram weighting, and updating the features of the current frame;
the specific steps are that all characteristic elements are traversed to the grouped ith group inquiry characteristics Qi. For QiEach query element x in (1)qIn the value feature ViAbove, we are on xqTo the feature vector of the element in the sampling window of (2) to the attention map W attWeighted summation is carried out on spatial positions as weights, and the obtained characteristic vector is taken as xqNew features, expressed as:
Yi=WattS(Qi,Vi)
wherein Y isiAnd the space-time characteristics of the ith group of multi-frame information gathered through the operation are shown. The new feature we obtain is the weighted aggregation of all incoming frames, where the weights are attention-driven diagrams based on similarity measures, by which the invention models information exchange between frames.
S10, solving space-time characteristics Y of different scale groupsiPerforming linear fusion to obtain fused space-time featuresThe significance result of the prediction obtained by the decoder is characterized.
The specific steps are that, for each group of characteristics { X }iAnd i is 1, 2, g, different scale windows are given, and the space-time characteristics { Y ] of the multi-frame information gathered under the given scale window are obtainediI 1, 2,. g }. We find the feature Y from different scale groupsiPerforming linear fusion by using 1 × 1 convolution to obtain space-time characteristics Y under multiple scales, and adding a fusion result into the original characteristics X in a residual error mode, namely:
X′=X+Y
we feed the spatio-temporal features X' to the neural network decoder to obtain the significance result of this segment of video prediction.
And S11, calculating cross entropy loss by marking the result and the data, and updating parameters by using an SGD optimizer until the network converges to obtain trained network parameters.
Thirdly, deploying the model;
the model is divided into two parts, one is an image saliency segmentation network, the specific architecture refers to step S3, the other is a time-series feature extraction module, namely a restricted self-attention module, the specific process is described in steps S7-S10, and referring to fig. 4, the restricted attention module (CSA in the figure) is involved between the feature encoder and decoder of the image saliency segmentation network. The training process comprises space significance training and time sequence significance training, wherein in the space significance training, the limited self-attention module is removed, an image data set is used for pre-training an image significance segmentation network, in the time sequence significance training, the limited attention module is added, and all modules are trained by a video data set. Through two-step training, the obtained model has the capability of extracting space-time significance, and referring to fig. 1, the following steps are provided for deploying the trained model for real application:
s12, acquiring a video to be detected
And S13, framing the video to be detected, and forming small-batch video clips by the obtained frames according to a given number to serve as the input of the network.
S14, constructing a limited video saliency detection model as shown in the figure 4, and loading model parameters obtained through training.
And S15, performing video significance prediction on the data of each video clip, and splicing detection results to form a video form for output. As shown in fig. 6, the input video segment is 5 frames, and the output is the segmentation result of the video segment. And splicing the output prediction results according to time to obtain the predicted video output.

Claims (4)

1. A method for detecting video salient objects based on a multi-scale constrained self-attention mechanism is characterized by comprising the following steps: the method comprises three steps of space characteristic training, time sequence characteristic training and model deployment, and comprises the following specific steps:
step 1, training spatial characteristics;
step 1.1, collecting an image significance data set;
step 1.2, preprocessing an image significance data set, including random turning and scale transformation;
step 1.3, training a backbone network by utilizing an image significance data set and a BP algorithm to obtain an image significance characteristic extraction network;
step 2, training time sequence characteristics;
step 2.1, collecting a video significance data set;
step 2.2, performing data enhancement on the video significance data set, including random inversion, and extracting training frames at different interval lengths;
2.3, in the training, extracting a section of video frame in the video saliency data set, and extracting spatial features frame by frame through a neural network;
Step 2.4, grouping the extracted spatial features, and setting windows with different scales for each group;
2.5, for each group of grouped spatial features, because an object does not have large displacement between frames, generating an attention map of each position by measuring the similarity between the features of each position on a feature map and the features of the spatial adjacent area of the position on an adjacent frame, wherein the size of the spatial adjacent area is given by a preset scale window in the step 2.4;
step 2.6, for each group of space features after grouping, for each position on the feature map, the time sequence information of the surrounding frames is collected through the weighted attention map generated in the step 2.5, and the space-time feature of the position is obtained;
step 2.7, performing linear fusion on the space-time characteristics obtained by different groups, and obtaining a predicted significance result by the fused space-time characteristics through a decoder;
step 2.8, repeating the training process until convergence, and obtaining trained neural network parameters;
step 3, model deployment;
step 3.1, acquiring a video to be detected;
3.2, framing the video to be detected, and forming small-batch data by the obtained frames according to a given quantity;
Step 3.3, initializing a neural network, and loading the trained parameters in the step 2.8;
and 3.4, performing video significance prediction on the data of each small batch formed in the step 3.2, and synthesizing a detection result video.
2. The method for video salient object detection based on the multi-scale constrained self-attention mechanism as claimed in claim 1, wherein the step 2.4 comprises the following sub-steps: for the spatial features X of the input video clip, the size of X is T H W, wherein T, H and W respectively represent the frame number, length, width and the number of feature channels of the video spatial features and are represented by C; the method comprises the steps that three convolution kernels with the size of 1X 1 are used for carrying out convolution on a spatial feature X of a video clip, and the X is linearly projected to three subspaces Q, K and V, wherein Q, K and V respectively represent an inquiry feature, a measurement feature and a value feature; splitting Q, K and V into g feature groups along a feature channel, wherein each feature group has C/g dimensional features, setting different window radius parameters ri and window cavity parameters di for the features Qi, Ki and Vi of each feature group, and initializing windows with different sizes; the window is global in the time sequence dimension and is a region which is centered on the inquiry point in the space dimension, so the window well positions the inquiry position in the time sequence context and is beneficial to the information transfer between frames.
3. The method for video salient object detection based on the multi-scale constrained self-attention mechanism as claimed in claim 2, wherein the step 2.5 comprises the following sub-steps: traversing all spatial positions for the i-th group of inquiry characteristics Qi which are grouped, extracting the characteristic vector of each position as the inquiry characteristic vector, and extracting the measurement characteristics K of each element characteristic in the window through the window initialized in the step 2.4jAnd query feature QiA dot product operation is performed to obtain a similarity, and an attention map for each inquiry position is generated.
4. The method for video salient object detection based on the multi-scale constrained self-attention mechanism according to claim 2 or 3, wherein the step 2.6 comprises the following steps: for the grouped ith set of query features Qi, all spatial positions are traversed, and the features within the window range in the Vi are weighted and summed by the attention map generated in step 2.5.
CN202010024556.XA 2020-01-10 2020-01-10 Video salient object detection method based on multi-scale constrained self-attention mechanism Active CN111242003B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010024556.XA CN111242003B (en) 2020-01-10 2020-01-10 Video salient object detection method based on multi-scale constrained self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010024556.XA CN111242003B (en) 2020-01-10 2020-01-10 Video salient object detection method based on multi-scale constrained self-attention mechanism

Publications (2)

Publication Number Publication Date
CN111242003A CN111242003A (en) 2020-06-05
CN111242003B true CN111242003B (en) 2022-05-27

Family

ID=70872403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010024556.XA Active CN111242003B (en) 2020-01-10 2020-01-10 Video salient object detection method based on multi-scale constrained self-attention mechanism

Country Status (1)

Country Link
CN (1) CN111242003B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528899B (en) * 2020-12-17 2022-04-12 南开大学 Image salient object detection method and system based on implicit depth information recovery
CN113591868B (en) * 2021-07-30 2023-09-01 南开大学 Video target segmentation method and system based on full duplex strategy

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9196053B1 (en) * 2007-10-04 2015-11-24 Hrl Laboratories, Llc Motion-seeded object based attention for dynamic visual imagery
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks
CN109993774A (en) * 2019-03-29 2019-07-09 大连理工大学 Online Video method for tracking target based on depth intersection Similarity matching
CN110097115A (en) * 2019-04-28 2019-08-06 南开大学 A kind of saliency object detecting method based on attention metastasis
CN110210278A (en) * 2018-11-21 2019-09-06 腾讯科技(深圳)有限公司 A kind of video object detection method, device and storage medium
CN110288597A (en) * 2019-07-01 2019-09-27 哈尔滨工业大学 Wireless capsule endoscope saliency detection method based on attention mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9196053B1 (en) * 2007-10-04 2015-11-24 Hrl Laboratories, Llc Motion-seeded object based attention for dynamic visual imagery
CN109376611A (en) * 2018-09-27 2019-02-22 方玉明 A kind of saliency detection method based on 3D convolutional neural networks
CN110210278A (en) * 2018-11-21 2019-09-06 腾讯科技(深圳)有限公司 A kind of video object detection method, device and storage medium
CN109993774A (en) * 2019-03-29 2019-07-09 大连理工大学 Online Video method for tracking target based on depth intersection Similarity matching
CN110097115A (en) * 2019-04-28 2019-08-06 南开大学 A kind of saliency object detecting method based on attention metastasis
CN110288597A (en) * 2019-07-01 2019-09-27 哈尔滨工业大学 Wireless capsule endoscope saliency detection method based on attention mechanism

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Jingyu Lu等."An effective visual saliency detection method based on maximum entropy random walk".《2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW)》.2016, *
Mohammad Shokri等."Salient Object Detection in Video using Deep Non-Local Neural Networks".《arXiv preprint arXiv》.2018, *
Quan Rong等."Unsupervised Salient Object Detection via Inferring From Imperfect Saliency Models".《IEEE TRANSACTIONS ON MULTIMEDIA》.2018,第20卷(第5期), *
丛润民等."视频显著性检测研究进展".《软件学报》.2018,第29卷(第8期), *
文雅宏."基于显著性目标检测的图像检索方法研究".《中国优秀硕士学位论文全文数据库 信息科技辑》.2019, *

Also Published As

Publication number Publication date
CN111242003A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN108764308B (en) Pedestrian re-identification method based on convolution cycle network
CN112184752A (en) Video target tracking method based on pyramid convolution
CN109446889B (en) Object tracking method and device based on twin matching network
CN113516012B (en) Pedestrian re-identification method and system based on multi-level feature fusion
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN106570893A (en) Rapid stable visual tracking method based on correlation filtering
CN110796057A (en) Pedestrian re-identification method and device and computer equipment
CN108399435B (en) Video classification method based on dynamic and static characteristics
CN109743642B (en) Video abstract generation method based on hierarchical recurrent neural network
CN107169117B (en) Hand-drawn human motion retrieval method based on automatic encoder and DTW
CN107067410B (en) Manifold regularization related filtering target tracking method based on augmented samples
CN111369522B (en) Light field significance target detection method based on generation of deconvolution neural network
CN108154133B (en) Face portrait-photo recognition method based on asymmetric joint learning
CN112001278A (en) Crowd counting model based on structured knowledge distillation and method thereof
CN111242003B (en) Video salient object detection method based on multi-scale constrained self-attention mechanism
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
Zhu et al. A multi-scale and multi-level feature aggregation network for crowd counting
CN113988147A (en) Multi-label classification method and device for remote sensing image scene based on graph network, and multi-label retrieval method and device
CN112991394B (en) KCF target tracking method based on cubic spline interpolation and Markov chain
CN113158904B (en) Twin network target tracking method and device based on double-mask template updating
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN113780129A (en) Motion recognition method based on unsupervised graph sequence predictive coding and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant