CN117237842A

CN117237842A - Pseudo tag generated video significance detection method based on time sequence features

Info

Publication number: CN117237842A
Application number: CN202311185586.9A
Authority: CN
Inventors: 徐涛; 史肖丽; 蔡磊; 柴豪杰; 赵未硕; 蒋靓峣
Original assignee: Henan Institute of Science and Technology
Current assignee: Henan Institute of Science and Technology
Priority date: 2023-09-14
Filing date: 2023-09-14
Publication date: 2023-12-15

Abstract

The invention provides a method for detecting the saliency of a pseudo tag generated video based on time sequence characteristics, which comprises the following steps: inputting the video sequence in the dataset into an LSTM model, and extracting the time sequence characteristics of the video sequence; generating a pseudo tag according to the similarity between adjacent frames in the video sequence, and putting the generated pseudo tag and the real tag together to be used as a training data set for training the LSTM model; scoring the samples with noise labels according to the confidence coefficient and training progress of the samples by using a confidence coefficient perception significance extraction scheme, and selecting samples with high confidence coefficient; a scoring mechanism is adopted to guide the LSTM model to extract significance knowledge from simple to difficult; repeatedly training an LSTM model by using the pseudo tag and the real tag; inputting the data to be detected into a saliency detection model to obtain a saliency detection result of each frame of image. The invention improves the detection precision of difficult samples and can effectively improve the accuracy and stability of video saliency detection.

Description

Pseudo tag generated video significance detection method based on time sequence features

Technical Field

The invention relates to the technical field of computer vision, in particular to a method for detecting the saliency of a pseudo tag generated video based on time sequence characteristics.

Background

Video saliency detection is an important research direction in the field of computer vision, and aims to automatically identify the most attractive area in video, and provide basic support for applications such as video analysis, editing and compression. Video saliency detection has been widely used in the fields of video advertising, video monitoring, video photography, and the like. The video saliency detection is a technology for identifying and analyzing the most attractive areas (such as moving objects, scene changes, illumination changes and the like) in the video, and can help a computer to automatically identify the most attractive areas in the video, so that the efficiency and the accuracy of video analysis, editing, compression and other applications are improved. The following are some of the more important video saliency detection techniques: (1) video saliency detection based on deep learning: deep learning techniques have become the dominant method of video saliency detection. Among them, convolutional Neural Network (CNN) based method is one of the most commonly used methods at present. By using a deep learning model such as CNN, the features of the most attractive region in the video can be automatically learned and the saliency detected. (2) Video saliency detection based on spatiotemporal attention mechanisms: the spatiotemporal attention mechanism is a method which can model the motion and scene change of objects in the video, and the method improves the accuracy and the robustness of the video saliency detection by introducing the spatiotemporal attention mechanism. (3) video saliency detection based on image segmentation: the method segments a video frame by using an image segmentation algorithm and then uses the segmentation result to perform saliency detection. The method can improve the accuracy of video saliency detection and has good adaptability to complex scenes. In summary, video saliency detection is an interdisciplinary discipline involving multiple fields, where the techniques and methods involved are also very diverse. With the continuous development of technology, research and application of video saliency detection will be deeper and wider.

Traditional video saliency detection methods are mainly based on manual feature extraction and shallow models, such as color, texture, edge and other features, tend to be sensitive to noise and illumination changes, and perform poorly when dealing with complex scenes and dynamic backgrounds. Aiming at the defects of the traditional method, researchers propose a video saliency detection method based on deep learning. The method represented by SOD (Salient Object Detection) task uses Convolutional Neural Network (CNN) to extract image features, and obtains video significance detection results through subsequent processing. However, these methods still have some problems. First, they require a significant amount of annotation data, which is time consuming and costly to manually annotate. In addition, the labeling data often has noise and subjectivity problems, and the generalization performance of the model is affected. Recently, a cyclic neural network (RNN) based method, such as a long and short term memory network (LSTM), can model time sequence information, and effectively improve accuracy and stability of video saliency detection.

Disclosure of Invention

Aiming at the technical problems of low detection precision and poor stability of the existing video saliency detection method, the invention provides a pseudo tag generation video saliency detection method based on time sequence characteristics.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows: a pseudo tag generation video saliency detection method based on time sequence features comprises the following steps:

s1: inputting the video sequence in the data set into an LSTM model for encoding and decoding, and extracting the time sequence characteristics of the video sequence;

s2: generating a pseudo tag according to the similarity between adjacent frames in a video sequence by a pseudo tag generation algorithm based on time sequence characteristics, and putting the generated pseudo tag and a real tag together to be used as a training data set for training an LSTM model;

s3: scoring the samples with noise labels according to the confidence coefficient and training progress of the samples by using a confidence coefficient perception significance extraction scheme, and selecting samples with high confidence coefficient as new pseudo labels;

s4: a scoring mechanism is adopted to guide the LSTM model to gradually extract significance knowledge from simple to difficult;

s5: forming a new training data set by using the new pseudo tag and the image marked by each frame in the data set, and repeatedly training the LSTM model by using the new training data set to obtain a significance detection model;

s6: inputting the new training data set to be detected into the saliency detection model to obtain a saliency detection result of each frame of image in the new training data set, and obtaining a saliency map of the video sequence to be detected.

Preferably, each frame of image in each video sequence is input into a pre-trained convolutional neural network, and the convolutional neural network converts each frame of image into a feature vector with fixed dimension to obtain a feature representation of each frame of image; and inputting the characteristic representation of each frame of image into the LSTM model, and extracting the time sequence characteristic.

Preferably, the dataset provides pixel-level annotations, including foreground object segmentation and bounding box annotations, for each video sequence of the DAVIS dataset; during training, processing the video sequence of the DAVIS dataset using data enhancement techniques of random cropping, horizontal flipping, and brightness adjustment;

the convolutional neural network is a classical ResNet network, pre-trained using a DAVIS dataset, the ResNet network is initialized with pre-trained weights, and ResNet network parameters are fine-tuned by a back-propagation and gradient descent algorithm.

Preferably, the LSTM model includes an input layer, an encoder, a decoder, and an output layer connected in sequence, the input layer receives an input video sequence, the encoder encodes the video sequence and extracts time-series features, the decoder maps the time-series features of the encoder to a required output space using the full connection layer, and the output layer outputs feature vectors of the time-series.

Preferably, the encoder comprises a plurality of LSTM layers and Bi-LSTM layers connected in sequence, the LSTM layers capturing short-term dependencies of the video sequence by learning time dependencies; the Bi-LSTM layer captures more comprehensive context information by processing both forward and reverse video sequences.

Preferably, the encoder comprises a first LSTM layer, a second LSTM layer, a Bi-LSTM layer, a third LSTM layer and a fourth LSTM layer which are sequentially connected, wherein the first LSTM layer receives an input video sequence and learns the time dependency relationship thereof, and captures the short-term dependency relationship of the video sequence and transmits the short-term dependency relationship to the next layer; the second LSTM layer further learns the long-term dependence of the input video sequence, memorizes information in a longer time interval and transmits the information to the next layer; the forward LSTM layer of the Bi-LSTM layer processes the order of the input sequence, and the reverse LSTM layer of the Bi-LSTM layer processes the reverse order of the input sequence; the third LSTM layer further extracts time dependency relationships in the input sequence and captures more abstract features; the fourth LSTM layer encodes the input sequence at a higher level of abstraction and generates the final time series feature representation.

Preferably, the method for generating the pseudo tag comprises the following steps: calculating similarity scores between feature vectors of the front and rear adjacent frame images, and generating pseudo tags according to the values of the similarity scores;

the frame images with high similarity scores are marked as 1 and used as pseudo labels, and the frame images with low similarity scores are marked as 0 and filtered;

projecting the false label detection result of the previous frame image into the current frame image to generate a group of candidate false labels, and calculating and selecting the best false label by utilizing the similarity scores of the previous and the next frames;

the calculation method of the similarity score comprises the following steps: average of squares of differences between predicted and real values:

MSE＝(1/n)*Σ(actual-prediction) ²

wherein, sigma is the sum symbol; n is the total number of pixels, i.e. the number of pixels in the previous and subsequent frames; actual is the pixel gray value in the previous frame, which is considered as the actual data value; the prediction is the pixel gray value in the following frame, and is regarded as the predicted data value.

Preferably, the method for selecting the sample with high confidence comprises the following steps:

for each sample in the training data set consisting of the pseudo tag and the real tag, calculating a confidence score by using information of tag noise; the confidence coefficient calculation formula is:

C＝p(1-p)/m

where C represents the confidence score, p represents the sample value with the noisy label, i.e. the confidence value, its value ranges from 0 to 1, and m represents the total sample.

Preferably, the implementation method of the step S4 is as follows: introducing a factor rho to dynamically adjust the gradient of the sample; as training proceeds, the factor ρ is linearly increased from 0 to 1; during training, for samples with high confidence scores, the factor ρ is set to 1;

loss function L _csd The method comprises the following steps:

where n is the number of pixels, Φ (x _i ) For pixel x _i Significance prediction value of (2);

L _csd /Φ(x _i ) The partial derivatives of (2) are:

wherein the sign function sign(k _i ) Belongs to { -1,1} is respectively a negative value and a positive value, k _i Index value, sign (k) _i ) According to index value k _i To determine the sign of the output.

Preferably, the method further comprises step S7: generating a new pseudo tag according to the significance detection result, and circularly executing the step S5 and the step S6 until the LSTM model is sufficiently and stably trained; the saliency detection result of each frame image is a saliency score of each pixel.

Compared with the prior art, the invention has the beneficial effects that:

1) Pseudo tag generation based on time series characteristics: modeling the time sequence information in the video by utilizing the LSTM network, and automatically generating a pseudo tag to train by combining the image characteristics and the semantic information. Compared with the mode that a large amount of data is required to be marked manually in the existing method, the method can obviously reduce the required quantity and cost of marking data, simultaneously can also reduce noise and subjectivity problems in the marking data, and improves generalization performance of an LSTM model. According to the invention, the detection precision of the difficult sample is improved by adding the pseudo tag and confidence perception significance extraction, and the predicted image generated by the backbone network is further optimized, so that the interior of the predicted image is more uniform, and the boundary is clearer.

2) Detection precision and stability are improved: the video saliency detection method and device can effectively improve accuracy and stability of video saliency detection. Compared with the traditional method and the existing LSTM-based method, the method provided by the invention is more excellent in complex scenes and dynamic backgrounds.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a general flow chart of the present invention.

Fig. 2 is a network architecture diagram of the time series feature acquisition model of the present invention.

Fig. 3 is a diagram of LSTM model architecture of the present invention.

FIG. 4 is a graph showing the loss function L in the confidence perceptual significance extraction scheme of the present invention _csd Is a gradient landscape map of (1).

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the embodiment of the invention provides a method for detecting the saliency of a pseudo tag generated video based on time sequence features, firstly, inputting a video sequence in a dataset into an LSTM model to extract the time sequence features, including operations such as encoding and decoding the video sequence; then, generating a pseudo tag according to the similarity score between adjacent frames by using a pseudo tag generation algorithm based on time sequence characteristics; secondly, scoring and processing the pseudo tag by using a confidence perception significance extraction scheme; and finally, acquiring a video sequence to be detected, inputting the video sequence to be detected into a significance detection model for training until the model converges, and outputting a significance map of the video sequence to be detected. The method comprises the following steps:

s1: and inputting the video sequences in the data set into an LSTM model for encoding and decoding, and extracting the time sequence characteristics.

The present invention evaluates the proposed LSTM model, i.e., the DAVIS dataset, which is one of the most important datasets in the Video Object Segmentation (VOS) task, on a widely used video segmentation dataset for saliency detection, which is used to evaluate the performance of video segmentation algorithms. The DAVIS dataset provides high quality pixel-level annotations, including foreground object segmentation and bounding box annotations, so that researchers can use the dataset for training and testing of video segmentation algorithms. The DAVIS dataset contains 50 video sequences, each containing 20 to 50 frames of video, and multiple foreground objects and a complex background. Each video sequence provides pixel-level annotations in which foreground objects are marked white and background is marked black. Following many outstanding significance detection models in recent years, the model of the present invention uses the DAVIS dataset to train the LSTM model. The generalization capability of the model is improved by using data enhancement techniques such as random clipping, horizontal flipping, and brightness adjustment during training. First, each frame of image in each video sequence is input into a pretrained Convolutional Neural Network (CNN) to obtain a characteristic representation of each frame of image. The pretrained CNN is a classical image classification res net network, pretrained using the DAVIS dataset, and appropriately adjusted according to the number of categories of the DAVIS dataset to learn the visual characteristics of the video frame. The ResNet network is initialized with pre-trained weights, which may be the result of training on other tasks (e.g., imageNet classification). The ResNet network parameters are fine-tuned in pre-training by back-propagation and gradient descent algorithms to suit the needs of a particular task. Each frame of image can be converted into feature vectors with fixed dimensions by using a pre-trained CNN, which captures high-level semantic information of the image, such as edges, textures, and shapes of objects. Such feature representation may reduce the dimensionality of the input data and may be more advantageous for LSTM model extraction of time series features. These feature representations are then input into the LSTM model, extracting time series features. The raw video frames X in fig. 2 represent the original video sequence, each frame being an image, the optimized video frame Y being obtained by pseudo-labeling and training the model. The encoder of the LSTM model comprises a plurality of LSTM layers and Bi-LSTM layers which are connected in sequence, and the layers are used for encoding an input sequence and extracting time sequence characteristics. The LSTM layer captures short-term dependencies of the input sequence by learning time dependencies. The Bi-LSTM layer captures more comprehensive context information by processing both forward and reverse input sequences. The decoder of the LSTM model uses the sense layer (fully connected layer) to map the time series characteristics of the encoder to the required output space. The Dense layer may perform dimension reduction, feature selection, or convert features into other forms to meet the task's requirements.

As shown in fig. 3, which shows an LSTM model architecture diagram of the present invention, the LSTM (long-short-term memory network) model is a variant of Recurrent Neural Network (RNN), and is particularly suitable for processing time-series data. The LSTM model includes an input layer, a plurality of LSTM layers, and an output layer for extracting time series features of the video sequence. Firstly, receiving an input video sequence through an input layer; secondly, encoding and decoding an input sequence by using a plurality of LSTM layers to extract time sequence characteristics; and finally, outputting a result of the LSTM model through an output layer, namely the time sequence feature vector. An Encoder (Encoder) for encoding a video sequence and extracting temporal sequence features includes a first LSTM layer, a second LSTM layer, a Bi-LSTM layer, a third LSTM layer, and a fourth LSTM layer connected in sequence, the first LSTM layer receiving an input video sequence and learning its temporal dependencies, capturing short-term dependencies of the input sequence, and passing them to the next layer. The second LSTM layer further learns the long-term dependency of the input sequence, and can memorize information in a longer time interval and pass it to the next layer. The Bi-LSTM layer (Bi-directional LSTM layer) can capture more comprehensive context information by processing both forward and reverse input sequences. The forward LSTM layer processes the order of the input sequence, while the reverse LSTM layer processes the reverse order of the input sequence. The third LSTM layer further extracts time-dependent relationships in the input sequence and captures more abstract features. The fourth LSTM layer encodes the input sequence at a higher level of abstraction and generates the final time series feature representation. The decoder includes a sense layer, i.e., a full-link layer, for predicting the significance detection result, and maps the time-series characteristics of the encoder to a desired output space. The LSTM model output Y in fig. 3 refers to the result generated directly by the LSTM model during the training process, without going through the optimization process of the pseudo tag and the training model.

S2: generating a pseudo tag according to the similarity between adjacent frames in a video sequence by using a pseudo tag generation algorithm based on time sequence characteristics (LSTM), and putting the generated pseudo tag and a real tag together to be used as a training data set for training an LSTM model.

And calculating the similarity between the feature vectors of the front and rear adjacent frame images, and generating a pseudo tag according to the value of the similarity for subsequent training. For each pair of adjacent frames in the video sequence in the dataset (two frames of images which are immediately adjacent in time, the difference or change of each pair of adjacent frames can be used as part of time series characteristics), extracting the characteristic vectors of the adjacent frames by using the time series characteristic representation of the images obtained by the convolutional neural network CNN, inputting the characteristic vectors of the adjacent frames into a similarity calculation module, and calculating the similarity score between the characteristic vectors of the adjacent frames; for each frame of image (video frame in the data set), the images are marked as 1 or 0 according to the similarity scores of the images and the adjacent frames, the marks with high similarity scores are marked as 1, and the marks with low similarity scores are marked as 0, so that the LSTM model is convenient to pay more attention to the study of samples with high similarity scores. Using the marked frame image, a pseudo tag (used as a pseudo tag, marked 1, filtered out, marked 0) is generated. And projecting the detection result of the previous frame into the current frame, generating a group of candidate pseudo tags (the candidate pseudo tags refer to the non-screened pseudo tags), and calculating and selecting the best pseudo tag by using the similarity scores of the previous and the next frames. The LSTM network model is then trained with the selected pseudo tags as additional training data along with the real tags (in reality, the tags labeled with the samples are used directly). The diversity of training data is increased by using the pseudo tag, and the generalization capability and the detection precision of the LSTM model are improved. The LSTM model can learn real labels and pseudo labels simultaneously through training, and the learning effect on samples with high similarity is improved. . MSE (mean square error) is one of the simplest similarity calculation methods, and can be used to calculate the similarity of the previous and subsequent frames. The MSE formula calculates the average degree of difference between the predicted value and the true value, i.e., the average of the squares of the differences between the predicted value and the true value. The calculation formula is as follows:

MSE＝(1/n)*Σ(actual-prediction) ²

wherein, sigma is the sum symbol; n is the total number of pixels, i.e. the number of pixels in the previous and subsequent frames; actual is the pixel gray value in the previous frame and can be considered as the actual data value; prediction is a pixel gray value in the subsequent frame and can be regarded as a predicted data value. The smaller the calculated MSE value, the higher the similarity between the previous and subsequent frames, i.e., the smaller the difference between the two frames.

S3: and scoring the samples with the noise labels according to the confidence coefficient of the samples and the training progress by using a confidence coefficient perception significance extraction scheme, selecting samples with high confidence coefficient, and continuing training the LSTM model.

Aiming at the problem that the significance knowledge of the difficult sample hidden in noise is not fully mined, a confidence perception significance extraction scheme is provided, and the scheme scores samples with noise labels according to the confidence of the samples and the training progress. For each sample in the training data set consisting of the pseudo tag and the real tag, calculating a confidence score by using information of tag noise; samples with noise labels are scored according to confidence scores and training progress, samples with high scores can be selected for further analysis, feature extraction or model updating based on the sample scores for better understanding and processing of difficult samples for better mining of significance knowledge of the difficult samples. By tracking the training progress of the LSTM model, the current number of training rounds or training samples may be obtained. Video frame noise refers to unnecessary or redundant interference information present in video frame data. When labeling noisy video frames, the samples formed are those with noisy labels. Unlike the pseudo tags in step S2, the sample scoring in this scheme is performed on noisy tagged samples in order to evaluate the importance of these samples based on confidence and training progress. The two can be combined to better train the LSTM model and improve the detection precision and the excavation capability of difficult samples.

The confidence coefficient calculation formula is a statistical evaluation method for measuring the credibility of a certain hypothesis or inference, and is proposed by Karl Pearson, and a statistical model of probability is considered from the point of view of mathematical statistics. The confidence calculation formula helps determine the confidence level of a hypothesis or inference as to the likelihood that it is accepted. In the invention, the confidence score calculation formula is:

C＝p(1-p)/m

where C represents the confidence score, p represents the sample value with the noisy label, i.e. the confidence value, its value ranges from 0 to 1, and m represents the total sample. The confidence score calculation formula can calculate the confidence score which is the confidence degree of whether the sample is available according to the sample value and the total sample amount with the noise label, and the samples with low scores are filtered, so that the samples with high scores continue to be trained.

S4: a scoring mechanism is employed to guide the LSTM model to progressively extract significance knowledge from simple to difficult.

The foregoing is directed to the case of a simple sample, but the samples in reality are not necessarily all simple samples, some video frame samples have complex scenes and comprise a plurality of foreground objects and complex backgrounds, which is the difficult sample, and the effect of step S4 is to easily and difficultly extract the significance knowledge, which is mainly to solve the problem of extracting the significance knowledge of the difficult sample.

Dynamically adjusting the gradient of the sample by introducing a factor ρ; as training proceeds, the factor ρ is linearly increased from 0 to 1 to gradually guide the LSTM model to extract significant knowledge from simple to difficult; in the training process, for samples with high confidence scores, the factor rho is set to be 1, so that the LSTM model is more focused on learning of the samples with high confidence scores, and the LSTM model is guided to gradually extract significance knowledge from simple to difficult. Samples of high confidence scores often represent regions of significance that the model has compared to determine, and placing more weight on these samples can help the LSTM model learn and capture the salient features better.

Loss L of the invention _csd Can be expressed as:

where n is the number of pixels, Φ (x _i ) For pixel x _i For the newly input pixel x _i The corresponding significance prediction value phi (x) is obtained by inputting the significance prediction value phi into a trained LSTM model _i )。

L of the invention _csd /Φ(x _i ) The partial derivatives of (2) are:

wherein the sign function sign (k _i ) Belongs to { -1,1} is respectively a negative value and a positive value, k _i Index value, sign (k) _i ) According to index value k _i To determine the sign of the output.

The partial derivatives are used to calculate the gradient of the loss function relative to the model parameters, thereby performing an update step of the gradient descent optimization algorithm. By calculating the partial derivatives, it can be determined which parameters' changes contribute to the reduction of the loss function and the model parameters are adjusted accordingly to minimize the loss function, the partial derivatives acting as an optimization model during training. As shown in FIG. 4, the loss L of the present invention _csd As a function of the training process, as a reference for assigning simple and difficult samples, an appropriate sample is selected depending on the training situation. Specifically, at the beginning, the loss L of the present invention _csd Low gradients are assigned to difficult samples of the dataset in order to learn reliable significance knowledge from simple samples. As training progresses, the gradient of difficult samples increases to mine more valuable saliency knowledge.

S5: and (3) obtaining a new training data set containing part of useful labeling information by using the new pseudo tag, and repeatedly training the LSTM model by using the new training data set to obtain the saliency detection model.

The data set is reconstructed using the new pseudo tag generated in step S2. The generated new pseudo tag is put together with each frame of image of the annotation in the video sequence (i.e. the two are put together and used as a new training data set) to form a new training data set. That is, the image of each frame label and the pseudo tag are put together and used as a new training data set, and the new data set containing part of useful label information is used for subsequent training. And (3) combining the generated useful pseudo tag with each frame of image in the video sequence after confidence score calculation in the step (S3) to form a new and more accurate training data set for subsequent training. The more accurate pseudo tag with high confidence score and the labeled video frame are put together, so that the detection precision of subsequent training can be improved.

S6: and inputting the new training data set to be detected into the saliency detection model to obtain a saliency detection result of each frame of image in the new training data set.

And inputting each frame of image of the new training data set to be detected into the trained LSTM model. And obtaining a saliency detection result of each frame of image, namely a saliency score of each pixel by using a saliency detection model. As shown in fig. 2 and 3, the saliency detection result of each frame of image is obtained through repeated training of the LSTM model.

S7: generating a new pseudo tag according to the significance detection result for training of the next round; steps S5 and S6 are performed in a loop until the LSTM model is trained sufficiently and stably.

For each frame of image in each video sequence, the new pseudo tag is used for the next training round according to the generated saliency detection result as the new pseudo tag, and S5 to S6 are executed in a circulating way.

Experiments are carried out on the DAVIS data set, the effect of the method for detecting the video saliency generated by the pseudo tag based on the time sequence characteristics is evaluated, and comparison experiments are carried out with other methods, so that the effectiveness of the method is verified. Dividing the data set into a training set and a testing set by adopting a cross-validation method on the DAVIS data set; training the training set by using the method and other methods provided by the invention, and evaluating the performance of the training set and the testing set; the evaluation index can adopt common indexes such as F-Measure and PR curve to Measure the performance of the method; the experiment was repeated a number of times and the average was taken as the final evaluation result. F-Measure combines Precision and Recall, and can comprehensively consider the accuracy and Recall rate in the detection task. The F-Measure evaluation result can be used for obtaining a comprehensive performance index for measuring the overall performance of the method on different categories; PR curve is an important index for evaluating the performance of machine learning model, and can reflect the accuracy and recall rate of model more comprehensively. By calculating the accuracy and recall rate at different thresholds and drawing PR curves, the performance of the model can be more intuitively understood. When PR curves are used, special attention is paid to the area under the curve and the precision value when the recall is equal to 1, and reasonable adjustment is made in combination with the method requirements set forth in the present invention. From the evaluation result, the method provided by the invention can effectively improve the detection precision and stability of video saliency detection.

According to the invention, accurate pseudo labels can be generated by effectively utilizing time sequence information in a video sequence through the characteristics of the LSTM model, and the detection precision of difficult samples is improved by introducing the pseudo labels and confidence perception saliency extraction, so that the saliency extraction is better carried out. The LSTM model is a recurrent neural network capable of efficiently processing sequence data. By inputting successive frames of a video sequence in the LSTM model, the timing dependency between frames can be learned. Due to the unique design structure, LSTM is suitable for processing and predicting very long-spaced and delayed important events in a time series. In the LSTM model, time series features in the video sequence, including inter-frame differences and motion vectors, etc., can be extracted, and by learning the time series features of the video sequence, the LSTM model can capture dynamic changes and correlations between frames. Meanwhile, the invention only needs a small amount of marked salient labels manually, reduces the cost of manual marking, and has better practicability and popularization value. The LSTM model is used as a cyclic neural network model capable of capturing time-series information, and has been widely used in video saliency detection. The LSTM model can effectively capture time dependence in a video sequence, improves detection precision, and can generate a pseudo tag for training a model by predicting significance information of future frames, thereby improving model generalization capability. The pseudo tag generation algorithm and the confidence perception significance extraction scheme based on the time sequence features can effectively extract significance information in the video and generate accurate pseudo tags for training a model. The invention combines the two to comprehensively and effectively extract the saliency information, and considers the two situations of a simple sample and a difficult sample, and the confidence perception saliency extraction scheme can well process when the difficult sample needs training from the simple sample. By combining the techniques, the method and the device can improve the accuracy and efficiency of video saliency detection, and simultaneously reduce the complexity and the calculated amount of the model, so that the method and the device are more suitable for being applied to scenes with limited hardware performances such as mobile equipment. The LSTM network model provided by the invention can improve the stability and generalization performance of the model while considering the detection precision.

Compared with the traditional manual data labeling method, the method has the advantages that the pseudo labels are introduced, so that the demand and cost of the labeling data are reduced, meanwhile, the noise and subjectivity problems in the labeling data can be reduced, and the generalization performance of the LSTM model is improved. Compared with the existing LSTM-based method, the method fully utilizes the time sequence characteristics in the video sequence and is related to the pseudo tag and confidence perception significance extraction scheme, so that the accuracy and stability of detection are improved. Therefore, the invention has wide application prospect and economic benefit.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The method for detecting the video saliency generated by the pseudo tag based on the time sequence features is characterized by comprising the following steps of:

s3: scoring the samples with noise labels according to the confidence coefficient and training progress of the samples by using a confidence coefficient perception significance extraction scheme, and selecting samples with high confidence coefficient to train an LSTM model;

s5: forming a new training data set by using the pseudo tag and the image marked by each frame in the data set, and repeatedly training the LSTM model by using the new training data set to obtain a significance detection model;

2. The method for detecting the saliency of a video generated by a pseudo tag based on time series characteristics according to claim 1, wherein each frame of image in each video sequence is input into a pretrained convolutional neural network, and the convolutional neural network converts each frame of image into a feature vector with fixed dimension to obtain a feature representation of each frame of image; and inputting the characteristic representation of each frame of image into the LSTM model, and extracting the time sequence characteristic.

3. The method of claim 2, wherein the dataset provides pixel-level annotations for each video sequence of the DAVIS dataset, including foreground object segmentation and bounding box annotations; during training, processing the video sequence of the DAVIS dataset using data enhancement techniques of random cropping, horizontal flipping, and brightness adjustment;

4. A method of detecting video saliency based on pseudo tag of time series features as claimed in any one of claims 1 to 3, wherein the LSTM model includes an input layer, an encoder, a decoder and an output layer connected in sequence, the input layer receiving an input video sequence, the encoder encoding the video sequence and extracting time series features, the decoder mapping the time series features of the encoder to a desired output space using a full connection layer, the output layer outputting feature vectors of the time series.

5. The method for detecting the saliency of a video generated by a pseudo tag based on time series characteristics according to claim 4, wherein the encoder comprises a plurality of LSTM layers and Bi-LSTM layers connected in sequence, the LSTM layers capturing short-term dependencies of the video sequence by learning time dependencies; the Bi-LSTM layer captures more comprehensive context information by processing both forward and reverse video sequences.

6. The method for detecting the saliency of a video generated by pseudo tags based on time series features according to claim 5, wherein the encoder comprises a first LSTM layer, a second LSTM layer, a Bi-LSTM layer, a third LSTM layer and a fourth LSTM layer connected in sequence, the first LSTM layer receiving an input video sequence and learning its time dependency, capturing short-term dependency of the video sequence and passing it to the next layer; the second LSTM layer further learns the long-term dependence of the input video sequence, memorizes information in a longer time interval and transmits the information to the next layer; the forward LSTM layer of the Bi-LSTM layer processes the order of the input sequence, and the reverse LSTM layer of the Bi-LSTM layer processes the reverse order of the input sequence; the third LSTM layer further extracts time dependency relationships in the input sequence and captures more abstract features; the fourth LSTM layer encodes the input sequence at a higher level of abstraction and generates the final time series feature representation.

7. The method for detecting the video saliency generated by pseudo tags based on time series features according to any one of claims 1 to 3, 5 and 6, wherein the method for generating the pseudo tags is as follows: calculating similarity scores between feature vectors of the front and rear adjacent frame images, and generating pseudo tags according to the values of the similarity scores;

MSE＝(1/n)*∑(actual-prediction) ²

8. The method for detecting the video saliency of pseudo tag generation based on time series features according to claim 7, wherein the method for selecting samples with high confidence is as follows:

C＝p(1-p)/m

9. The method for detecting the saliency of a generated video based on pseudo tags of time series characteristics according to claim 8, wherein the implementation method of step S4 is as follows: introducing a factor rho to dynamically adjust the gradient of the sample; as training proceeds, the factor ρ is linearly increased from 0 to 1; during training, for samples with high confidence scores, the factor ρ is set to 1;

loss function L _csd The method comprises the following steps:

L _csd /Φ(x _i ) The partial derivatives of (2) are:

10. The method for detecting the saliency of a generated video of a pseudo tag based on time series characteristics according to any one of claims 1 to 3, 5, 6, 8, 9, further comprising step S7: generating a new pseudo tag according to the significance detection result, and circularly executing the step S5 and the step S6 until the LSTM model is sufficiently and stably trained; the saliency detection result of each frame image is a saliency score of each pixel.