CN110334718A

CN110334718A - A kind of two-dimensional video conspicuousness detection method based on shot and long term memory

Info

Publication number: CN110334718A
Application number: CN201910614888.0A
Authority: CN
Inventors: 方玉明; 黄汉秦; 乐晨阳
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2019-10-15

Abstract

The present invention relates to a kind of two-dimensional video conspicuousness detection methods based on shot and long term memory, it is characterised in that: extracts short-term temporal aspect first with 3D convolutional network (3D-ConvNet)；Secondly long-term temporal aspect is extracted using two-way shot and long term memory net (B-ConvLSTM)；The short-term temporal aspect extracted and long-term temporal aspect are merged later；Saliency maps are obtained finally by fusion results deconvolution.The long-term and short-term temporal aspect of the models coupling, can effectively retain the operation information of well-marked target in video, two-dimensional video conspicuousness prognostic experiment the results show that the model proposed can obtain good detection effect.

Description

A kind of two-dimensional video conspicuousness detection method based on shot and long term memory

Technical field

The present invention relates to a kind of visual attention detection methods for detecting two-dimensional video conspicuousness, belong to multimedia technology neck Domain particularly belongs to digital picture and digital technical field of video processing, specially a kind of two-dimensional video based on shot and long term memory Conspicuousness detection method.

Background technique

Visual attention is critically important mechanism in visual perception, it can rapidly detect to show in natural image Information is write, when passing through observation of nature image, selective attention can be allowed by being absorbed in some specific significant information, and Ignore other not important information because of limit process resource.Substantially visual attention method can be divided into two kinds: the bottom of from It is upwards and top-down；Bottom-up processing is that independently lower automatic marking area detects for data-driven and task, and push up certainly to Lower method is to be related to the cognitive process of certain specific tasks.Conspicuousness detection model is intended to predict to infuse during people's landscape on the scene is seen Important area in meaning visual scene.There are many conspicuousness prediction techniques for the design of various visual tasks, such as scheme As segmentation, target detection, video frequency abstract, video compress etc..Most of existing conspicuousness detection models are for static image Design, and the research of saliency detection is restricted since complicated motion feature extracts.

Traditional conspicuousness detection model is mainly set about in terms of two, first is that low-level features, such as brightness, color, texture and Contrast；Followed by semantic information, such as face, personage and text.However, the method for these manual extractions can not comprehensively consider Various factors.In addition, the mode of manual extraction feature is also than relatively time-consuming.Early stage saliency detection model is by simply Current static conspicuousness detection model and additional temporal information is extended to design.It is special due to extracting movement by the way of light stream Sign, it has high computation complexity, therefore the applicability of these methods is restricted.Mould is detected for these saliencies Type, they mainly follow two steps: (1) extracting room and time information from video sequence to calculate separately significant spatial figure With time notable figure；(2) final space-time remarkable figure is calculated by convergence strategy combination significant spatial figure and time notable figure.

Itti et al. early stage describe it is a kind of extract still image conspicuousness method, by using multiple dimensioned center ring around The contrast of mechanism and intensity, color and direction predicts conspicuousness.Li et al. people by utilize provincial characteristics and image detail, A kind of conspicuousness method from bottom to top is devised, and optimizes image boundary to provide more accurate significant result.Sun etc. People establishes conspicuousness detection model using markov absorbing probability, in addition to common brightness, texture, color, direction etc. are rudimentary It further include the information such as image boundary information outside feature.Zhu et al. proposes boundary connectivity concept, indicates that image-region arrives The space layout of image boundary.Compared with the conspicuousness of still image, saliency prediction be must be taken into consideration in video sequence Motion information.Mahadevan et al., which passes through, combines perception movement grouping and center ring around mechanism, devises a kind of time and space significance Detection method.Liu et al. people proposes a kind of time and space significance prediction model based on super-pixel.Leboran et al. assumes consciousness Feature can use higher order statistical representation, to devise conspicuousness detection model.Fang et al. proposes a kind of by not Certainty weights to merge the video sequence conspicuousness method of spatial saliency and time conspicuousness.

With the rapid development of depth learning technology, there are many researchs establishes still image by deep learning Conspicuousness detection model.These saliency detection models have been demonstrated that in marking area extraction be effective.Video sequence The conspicuousness of column is compared with the conspicuousness of still image, and due to complicated temporal information, prediction has more challenge.Tran et al. is logical It crosses using Three dimensional convolution neural network (3D-CNN) and learns the space-time characteristic of video sequence, to introduce 3D ConvNet. Most of saliency researchs based on deep learning do not account for long term time information.In order to overcome above-mentioned described these Disadvantage, by devising a kind of new deep video conspicuousness detection network (DevsNet), by designing new 3D convolutional network (3D-ConvNet) and two-way convolution shot and long term memory network (B-ConvLSTM) carry out time and space significance prediction.Steps are as follows: Secondly building 3D convolutional network (3D-ConvNet) first utilizes two-way shot and long term memory network to extract short-term temporal aspect (B-ConvLSTM) long-term temporal aspect is obtained.By predicting final notable figure in conjunction with short-term and long-term temporal aspect. In short, the main contributions of the method proposed are as follows: (1) by devising a kind of new saliency detection model, respectively It is extracted using 3D convolutional network (3D-ConvNet) and two-way shot and long term memory network (B-ConvLSTM) short-term and long-term Temporal aspect.By combining in short term with long-term temporal aspect, the performance of prediction video saliency image can be improved.(2) by setting A kind of new double-layer double-direction shot and long term memory network (B-ConvLSTM) structure is counted, for extracting the length of saliency detection Phase temporal aspect.The B-ConvLSTM proposed not only can from previous video frame extracting time information, can also be from it Extracting time information in video frame afterwards, this shows proposed network while considering forward and backward temporal characteristics.

Above-mentioned to make referrals to visual attention model although and have very much, but visual attention model is in the research of two-dimensional video Still remain limitation.So needing to propose new method, Lai Gaishan two-dimensional video conspicuousness detection performance in this field.

Summary of the invention

In order to overcome at present for the limitation of the visual attention model research of two-dimensional video, the present invention is with regard to two-dimensional video Visual attention model propose a kind of new method, the feature of extraction includes short-term temporal aspect and long-term temporal aspect. It is short-term to extract wherein to be utilized respectively 3D convolutional network (3D-ConvNet) and two-way shot and long term memory network (B-ConvLSTM) With long-term temporal aspect；Obtained short-term and long-term temporal aspect is finally subjected to fusion and deconvolution obtains final show Write figure.

The present invention relates to a kind of two-dimensional video conspicuousness detection methods based on shot and long term memory, it is characterised in that: first Short-term temporal aspect is extracted using 3D convolutional network (3D-ConvNet)；Secondly two-way shot and long term memory network (B- is used ConvLSTM long-term temporal aspect) is extracted；The short-term temporal aspect extracted and long-term temporal aspect are merged later；Most Saliency maps are obtained by fusion results deconvolution afterwards.The long-term and short-term temporal aspect of the models coupling, can be effective Retain video in well-marked target operation information, two-dimensional video conspicuousness prognostic experiment the results show that propose model can To obtain good detection effect.

The concrete operations of various pieces of the present invention are as follows:

The extraction of short-term temporal aspect:

Herein, the short-term temporal aspect of video, institute are extracted by devising a 3D convolutional network (3D-ConvNet) The kernel size for having 3D convolutional layer is 3*3*3, and each Conv3D layers of stride and filling size are 1*1*1.In these three layers Quantity using convolution kernel is 16,32 and 64, the quantity representative of convolution kernel generate channel number；By each pond layer, depending on Original half is reduced into the resolution sizes and time dimension of frequency frame, it means that it is short-term that 3D convolutional network only learns part Space-time characteristic, as being mentioned before.By also application batch normalization (BN) operation, by reducing internal covariant Amount offset is to accelerate depth network training.

The extraction of long-term temporal aspect:

By the way that the LSTM connected entirely (FC-LSTM) is extended to convolutional layer, as the shot and long term memory network by proposing (ConvLSTM), it and is utilized to capture effective space time information, internal structure turns from the state that is input to and state to state It changes.In the present invention, by extracting long-term temporal aspect as the input at network by using a series of video frames, these are special Sign is to be obtained by VGG16 parameter trained in advance from original video frame.

It is of the invention a little and technical effect:

Inventive algorithm is rationally efficient, propose a kind of short-term temporal aspect in novel method combination two-dimensional video and Long-term temporal aspect.By being utilized respectively 3D convolutional network (3D-ConvNet) and two-way shot and long term memory network (B- ConvLSTM) short-term and long-term temporal aspect is extracted.By combining in short term with long-term temporal aspect, prediction can be improved The performance of video saliency image.Robustness of the present invention is high, and evaluation index is all higher than algorithm best at present, and scalability is strong.

To achieve the goals above, the technical solution adopted by the present invention are as follows:

A kind of two-dimensional video conspicuousness detection method based on shot and long term memory, it is characterised in that the following steps are included:

Step 1: extract two-dimensional video frame in short-term temporal aspect, using 3D convolutional neural networks (3D-ConvNet) come It extracts；

Step 2: extracting the long-term temporal aspect in two-dimensional video frame, utilize two-way shot and long term memory network (B- ConvLSTM it) extracts；

Step 3: the short-term and long-term temporal aspect of extraction being merged, and deconvolution obtains and input video frame point Resolution notable figure of the same size.

Further, the short-term temporal aspect (motion information) in two-dimensional video frame described in step 1.

Further, 3D convolutional network is calculated according to (1):

hⁿ=σ (∑ Wⁿ*h^n-1+bⁿ) (1)

Wherein WⁿIndicate the 3D convolution nuclear parameter of (n-1) hidden layer；h^n-1Represent (n-1) hidden layer；bⁿIt represents corresponding Bias term；Operator ' * ' represents convolution operation；σ represents activation primitive.

Further, we accelerate the training at network using batch standardization (BN algorithm), its calculation formula is by (2) It is shown:

Wherein E (x^(k)) and Var (x^(k)) respectively represent the expectation and variance of kth batch data；It represents after standardizing Result.

Further, the distribution of initial data may be changed after batch standardization (BN algorithm), in order to overcome this to ask Topic, is adjusted using formula (3):

Wherein γ^(k)And β^(k)It is corresponding adjustment parameter；Represent the result after standardization (BN)；y^(k)It represents and adjusts Result after whole.

Further, it is characterised in that: the long-term temporal aspect (motion information) in two-dimensional video frame described in step 2.

Further, by taking one layer of ConvLSTM as an example, each ConvLSTM unit calculation (4)-(8) are as follows:

I_t=σ (W_xi*X_t+W_hi*H_t-1+b_i) (4)

f_t=σ (W_xf*X_t+W_hf*H_t-1+b_f) (5)

o_t=σ (W_xo*X_t+W_ho*H_t-1+b_o) (6)

C_t=f_t o C_t-1+i_t o tanh(W_xc*X_t+W_hc*H_t-1+b_c) (7)

H_t=o_t o tanh(C_t) (8)

Wherein σ and tanh respectively represents sigmoid activation primitive and tanh activation primitive；W_xi, W_hi, W_xf, W_hf, W_xo, W_ho, W_xc, W_hcIt is the parameter of respective layer convolution kernel, b_i, b_f, b_o, b_cRepresent the bias term of respective layer；I_t, f_t, o_tIt respectively represents The input gate of t-th of video frame forgets door and out gate；C_tAnd H_tIt is memory unit and hidden unit；' * ' indicates convolution operation； ' o ' indicates Hadamard (Hadamard) operation.

Further, the utilized loss function of training is calculated by formula (9)-(10):

Wherein y_iIt represents training data and concentrates i-th label figure, and y_i∈(y₁, y₂..., y_N)；N represents training number According to the total number of the image of collection；y′_iI-th notable figure that representative model calculates；δ represents the parameter of whole network；δ ' expression Parameter after being optimized by Adam.

Further, the optimization algorithm Adam of each sub-network can be expressed by (11-14) formula:

m_t=β₁m_t-1+(1-β₁)g_t (11)

Wherein m_tAnd v_tRespectively single order momentum term and second order momentum term；β₁、β₂0.9 is usually taken respectively for power value size With 0.999；Respectively respective correction value；W_tIndicate the t moment i.e. parameter of t iterative model；g_t=Δ J (W_t) table Show gradient magnitude of the t iteration cost function about W；∈ is the number (generally 1e-8) of a value very little in order to avoid denominator It is 0.

Detailed description of the invention

Fig. 1 is algorithm frame figure of the invention.

Fig. 2 is the instance graph of different conspicuousness detection model algorithm comparisons.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description.Obviously, described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art without creative labor it is obtained it is all its His embodiment, shall fall within the protection scope of the present invention.

Wherein, technical characteristic involved in this paper, write a Chinese character in simplified form/abridge, symbol etc., to recognize known in those skilled in the art It is explained based on knowing/being generally understood, definition/explanation.

Process of the invention is as shown in Figure 1, detailed process is as follows:

The present invention devises a 3D convolutional network (3D-ConvNet) to extract the short-term temporal aspect of video.All 3D volumes The kernel size of lamination is 3*3*3, and each Conv3D layers of stride and filling size are 1*1*1, and volume is used in these three layers The quantity of product core is 16,32 and 64, the quantity representative of convolution kernel generate channel number；By each pond layer, video frame Original half is reduced into resolution sizes and time dimension, it means that it is special that 3D convolutional network only learns the short-term space-time in part Sign, as previously mentioned.Herein also by application batch normalization (BN) operation, by reducing internal covariant offset To accelerate depth network training.By the way that the LSTM connected entirely (FC-LSTM) is extended to convolutional layer, as it is proposed that length Phase memory network (ConvLSTM), and be utilized to capture effective space time information, internal structure is from being input to state and state To the conversion of state.In the present invention, long-term temporal aspect is extracted as the input at network by using a series of video frames, What these were characterized in obtaining by VGG16 parameter trained in advance from original video frame.

Experiments have shown that two-dimensional video conspicuousness detection method proposed by the present invention is substantially better than current other methods.Mainly It is assessed by figure two and three kind of method: mean absolute error (MAE), under linearly dependent coefficient (PLCC) and ROC curve Area (AUC).ROC curve is widely used in the detection of visual attention model performance, by defining threshold value, vision note The Saliency maps of meaning power model are divided into significant point and non-significant point.The real class of TPR indicates mesh in visual attention model Punctuate is in the percentage for significant point, and the false positive class of FPR indicates that background dot is detected as significant point in visual attention model Percentage.And AUC is the area under ROC curve, energy better performance is assessed, and visual attention model is better, then it AUC value it is bigger；Linearly dependent coefficient (PLCC) is used to measure the linearly related degree between Saliency maps and bitmap, related Coefficient can be between 0 to 1, and related coefficient is bigger, then the performance of visual attention model is then better.MAE is used to measure prediction Difference between notable figure out and label image.MAE value is smaller to mean that the difference between the two maps is smaller, model Performance it is better.

Wherein, the short-term temporal aspect (motion information) in two-dimensional video frame described in step 1.

Wherein, 3D convolutional network is calculated according to formula/algorithm (1):

hⁿ=σ (∑ Wⁿ*h^n-1+bⁿ) (1)

Wherein, accelerate the training at network using batch standardization (BN algorithm), its calculation formula is by shown in (2):

Wherein, the distribution that may change initial data after batch standardization (BN algorithm) is adopted in order to overcome this problem It is adjusted with formula (3):

Wherein, it is characterised in that: the long-term temporal aspect (motion information) in two-dimensional video frame described in step 2.

Wherein, by taking one layer of ConvLSTM as an example, each ConvLSTM unit calculation (4-8) is as follows:

I_t=σ (W_xi*X_t+W_hi*H_t-1+b_i) (4)

f_t=σ (W_xf*X_t+W_hf*H_t-1+b_f) (5)

o_t=σ (W_xo*X_t+W_ho*H_t-1+b_o) (6)

C_t=f_t o C_t-1+i_t o tanh(W_xc*X_t+W_hc*H_t-1+b_c) (7)

H_t=o_t o tanh(C_t) (8)

Wherein, the utilized loss function of training is calculated by formula (9-10):

Wherein, the optimization algorithm Adam of each sub-network can be expressed by (11-14) formula:

m_t=β₁m_t-1+(1-β₁)g_t (11)

In Fig. 2: the comparison for different conspicuousness detection algorithms.First row to last column are respectively as follows: two-dimensional video frame Original video frame, reference picture, CE calculate notable figure, Fang calculate notable figure, Seo calculate notable figure, SAGE meter Notable figure, the lab diagram of the notable figure that MR is calculated and invention of notable figure, MC calculating that the notable figure of calculation, GAFL are calculated Picture.

From these relatively in, CE conspicuousness detection model can come out background detection, Fang conspicuousness detection model obscures Boundary, the Seo conspicuousness detection model background error detection for detecting target are prospect, SAGE conspicuousness detection model detects mesh more Mark, GAFL conspicuousness detection model missing inspection target, MC conspicuousness detection model edge are not clear enough, MR conspicuousness detection model is carried on the back Scape error detection is the experimental image of prospect and invention.It finally found that, conspicuousness detection method proposed by the present invention and existing The reference picture deposited is most close.

Table 1: the comparison of different conspicuousness detection models:

Above embodiment is the description of the invention, is not limitation of the invention, it is possible to understand that is not departing from this hair A variety of change, modification, replacement and modification, guarantor of the invention can be carried out to these embodiments in the case where bright principle and spirit Shield range is defined by the appended claims and the equivalents thereof.

Claims

1. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory, which comprises the following steps:

Step 1: extracting the short-term temporal aspect in two-dimensional video frame, extracted using 3D convolutional neural networks 3D-ConvNet；

Step 2: extracting the long-term temporal aspect in two-dimensional video frame, mentioned using two-way shot and long term memory network B-ConvLSTM It takes；

Step 3: the short-term and long-term temporal aspect of extraction being merged, and deconvolution obtains and input video frame resolution ratio Notable figure of the same size.

2. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 1, feature exist In: the short-term temporal aspect in two-dimensional video frame described in step 1 includes motion information.

3. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 2, feature exist In: 3D convolutional network is calculated according to formula (1):

hⁿ=σ (∑ Wⁿ*h^n-1+bⁿ) (1)

Wherein WⁿIndicate the 3D convolution nuclear parameter of the (n-1)th hidden layer；h^n-1Represent the (n-1)th hidden layer；bⁿRepresent corresponding bias term； Operator ' * ' represents convolution operation；σ represents activation primitive.

4. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 2, feature exist In: accelerate the training at network using batch standardization BN algorithm, its calculating is by shown in formula (2):

Wherein E (x^(k)) and Var (x^(k)) respectively represent the expectation and variance of kth batch data；Represent the knot after standardization Fruit.

5. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 4, feature exist In: batch standardization BN algorithm after, for change initial data distribution, adjusted using formula (3):

Wherein γ^(k)And β^(k)It is corresponding adjustment parameter；Represent the result after standardization BN algorithm；y^(k)Represent adjustment Result afterwards.

6. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 1, feature exist In: the long-term temporal aspect in two-dimensional video frame described in step 2 includes motion information.

7. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 6, feature exist In: in one layer of ConvLSTM, shown in each ConvLSTM unit calculation such as formula (4)-(8):

I_t=σ (W_xi*X_t+W_hi*H_t-1+b_i) (4)

f_t=σ (W_xf*X_t+W_hf*H_t-1+b_f) (5)

o_t=σ (W_xo*X_t+W_ho*H_t-1+b_o) (6)

C_t=f_t o C_t-1+i_t o tanh(W_xc*X_t+W_hc*H_t-1+b_c) (7)

H_t=o_t o tanh(C_t) (8)

Wherein σ and tanh respectively represents sigmoid activation primitive and tanh activation primitive；W_xi, W_hi, W_xf, W_hf, W_xo, W_ho, W_xc, W_hcIt is the parameter of respective layer convolution kernel, b_i, b_f, b_o, b_cRepresent the bias term of respective layer；I_t, f_t, o_tRespectively represent t-th of view The input gate of frequency frame forgets door and out gate；C_tAnd H_tIt is memory unit and hidden unit；' * ' indicates convolution operation；' o ' is indicated Hadamard operation.

8. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 1, feature exist In: the utilized loss function of training is calculated by formula (9)-(10):

Wherein y_iIt represents training data and concentrates i-th label figure, and y_i∈(y₁, y₂..., y_N)；N represents training dataset The total number of image；y′_iI-th notable figure that representative model calculates；δ represents the parameter of whole network；δ ' expression passes through Parameter after Adam optimization.

9. a kind of two-dimensional video conspicuousness detection method based on shot and long term memory according to claim 8, feature exist In: the optimization algorithm Adam of each sub-network is expressed by formula (11)-(14):

m_t=β₁m_t-1+(1-β₁)g_t (11)

Wherein m_tAnd v_tRespectively single order momentum term and second order momentum term；β₁、β₂For power value size, 0.9 and 0.999 are taken respectively；Respectively respective correction value；W_tIndicate the t moment i.e. parameter of t iterative model；g_t=Δ J (W_t) indicate t iteration Gradient magnitude of the cost function about W；∈ is the number of a value very little, ∈=1e-8.