CN109034001A

CN109034001A - Cross-modal video saliency detection method based on space-time clues

Info

Publication number: CN109034001A
Application number: CN201810725499.0A
Authority: CN
Inventors: 汤进; 范东哲; 李成龙; 王逍
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2018-07-04
Filing date: 2018-07-04
Publication date: 2018-12-18
Anticipated expiration: 2038-07-04
Also published as: CN109034001B

Abstract

The invention discloses a cross-modal video saliency detection method based on a space-time clue, which comprises the steps of obtaining a pair of matched multi-modal video sequence frame pairs, segmenting superpixels of the matched multi-modal video sequence frame pairs by using an S L IC algorithm, calculating the saliency of each pixel point of a superpixel segmentation graph, selecting a node with high similarity as a foreground point, constructing the saliency graph by combining the weight of a saliency value, visible light and thermal infrared modes of the previous stage, comparing the saliency values of two adjacent frames, calculating the maximum overlapping ratio of the spatial positions of the two adjacent frames, finding out the inherent relation between the adjacent frames to obtain a multi-modal video saliency result based on space-time, and solving the model by using Lagrangian number multiplication to obtain a result.

Description

A kind of cross-module state saliency detection method based on Deja Vu

Technical field

The present invention relates to a kind of computer vision technique more particularly to a kind of cross-module state videos based on Deja Vu Conspicuousness detection method.

Background technique

Saliency detection is the basic project of computer vision field, it is intended to most noticeable in positioning video sequence Target area, have in fields such as visual classification, video frequency searching, video frequency abstract, scene understanding, target followings and widely answer With being basis and the critical issue of computer vision.In recent years, the concern by more and more researchers.Although at present this The research of aspect makes some progress, but significant for factors, videos such as noisy background, harsh weather under visible light Property is still a very challenging problem.In order to successfully manage above-mentioned challenge, integrate multiple and different but complementary Modal information may further increase saliency testing result such as visible light and thermal infrared spectrum information.

Currently, most of the algorithm of conspicuousness detection is all based on visible spectrum information, but visible light sensor is imaged It is highly prone to environment, the boisterous influence such as illumination variation and haze.Therefore some research work introduce other mode Data, such as: Thermal Infrared Data.Thermal infrared sensor can obtain the target surface temperature of absolute zero or more and be imaged, right It is insensitive that smog, low-light (level), sleet sky etc. challenge factor.In addition, because of its answering extensively in military field and security monitoring field With and by various research institutions both at home and abroad attention, and obtained significant progress, be gradually applied to many other fields.

In conjunction with a variety of different complementary hint informations, such as visible light and Thermal Infrared Data, it will becoming one kind can Above-mentioned scene is coped with well and challenges and improve the effective means of conspicuousness detection effect.In addition, visible light and thermal infrared letter Breath has complementarity, the effect that conspicuousness can be promoted to detect in terms of different.For example, thermal infrared sensor is a kind of passive Imaging sensor, can capture absolute zero any of the above target sending infra-red radiation, wave-length coverage (0.75~ 13um).Compared with existing sensor, thermal infrared camera has mainly had the advantage that long-range imaging capability；It is unwise to light Sense, can be to avoid the interference of different illumination conditions；There is very strong penetration power, such as: penetrate haze and smog.

Therefore, in the case where illumination condition difference and haze weather, thermal infrared sensor is more than visible spectrum camera Effectively.As shown in Figure 1, Fig. 1 illustrates the advantage of some Thermal Infrared Datas in the presence of a harsh environment.

Visible light camera acquisition image resolution ratio generally with higher, include geometry abundant and grain details, But to light sensitive, video image quality sharply declines under complex scene and environment, such as (Fig. 1 (a)) and (Fig. 1 (b)) point The imaging effect of visible light and Thermal Infrared Data under haze weather and low photoenvironment is not shown.Due to thermal infrared thermal image The surface temperature distribution of object in scene is reflected, therefore to illumination-insensitive, is penetrated with good cloud and mist and special Identify the ability of camouflage.And can be formed with visible data complementary, obtain more robust saliency testing result.

Summary of the invention

Technical problem to be solved by the present invention lies in: the visual modalities data by merging multiple complementations overcome low light According to the influence of the factors such as, haze and mixed and disorderly background, a kind of cross-module state saliency detection side based on Deja Vu is provided Method.

The present invention be by the following technical programs solution above-mentioned technical problem, the present invention the following steps are included:

(1) a pair of matched multi-modal video sequence frame pair is obtained, using SLIC algorithm to its super-pixel segmentation；

(2) conspicuousness that each pixel of super-pixel segmentation figure is calculated using multitask manifold ranking algorithm, before obtaining Then sight spot screens obtained foreground point, be selected above the node of given threshold, allow all nodes and screening after obtain Node compare, the node for selecting similarity big is as foreground point；

(3) by combining saliency value, the weight of two mode of visible light and thermal infrared on last stage to construct notable figure；

(4) saliency value for comparing adjacent two frame of front and back calculates its spatial position Maximum overlap ratio and then finds between consecutive frame Intrinsic relationship, obtain the multi-modal saliency result based on space-time；

(5) model is solved and is obtained a result using lagrange multiplier approach.

It include five continuous to many short windows, each window is divided by original video sequence in the step (2) Frame, it is as follows to propose multitask coordinated manifold ranking algorithmic formula:

Wherein, t indicates t frame；Indicate the l of vector X₂Square of norm, the i.e. quadratic sum of vector element；

I, j indicate different super-pixel block, therefore between the node in the video sequence of two mode of visible light and thermal infrared Side right is defined as It is the color characteristic of each super-pixel block, K indicates different mode, γ^kIndicate the The scale parameter of k mode；

D=diag { D₁₁,...,D_nnIt is metric matrix；

Γ=[Γ¹,...,Γ^K]^TIt is an adaptive parameter vector, is initialised after the first iteration；

R is mode weight vectors, r=[r¹,r²,...,r^M]^T；

α is a balance parameters；

The Section 3 of formula is in order to avoid r over-fitting；

Formula Section 4 is cross-module state consistency constraint item, and effect formula balances harmonious between two mode Property.

As one of preferred embodiment of the invention, in the formula, k value is 2, i.e. two moulds of visible light and thermal infrared State.

In the step (2), respectively using the node of upper and lower, left and right four edges circle of image as background seed point, i.e., Query object calculates sequence score of the figure interior joint relative to query object with the query object at image coboundary, then is subtracted with 1 The score is gone, finally the prospect vector that four direction is found out is done a little multiplied by calculating the initial significant of each mode in the first stage Value:

Wherein, ο symbol Indicate the dot product of vector element, i.e. corresponding element is multiplied；

It respectively indicates using the node of upper and lower, left and right four edges circle of image as background seed Point, the ranking value of each super-pixel block of t frame at corresponding mode k.

In the step (3), the sort algorithm used calculates the value that sequence obtainsAnd mode weight r and preceding single order Section is similar, and the ranking value regularization that will be obtained obtainsRange is between 0 to 1, finally, by combining ranking valueAnd mould State weight r, obtainsThen final notable figure has been obtained；Wherein,It indicates in mode K In the case of by combine mode weight r^kWith the ranking value of each super-pixel blockObtained final saliency value.

In the step (4), for the video sequence of two mode of given a pair of of visible light and thermal infrared, target is more Significant object is found in each frame of mode video pair, utilizes formula:

I, j indicate different super-pixel block, therefore between the node in the video sequence of two mode of visible light and thermal infrared Side right is defined as Refer to that the color characteristic of each super-pixel block, K indicate different mode, γ^k Indicate the scale parameter of kth mode；

D=diag { D₁₁,...,D_nn, indicate metric matrix；

R is mode weight vectors, r=[r¹,r²,...,r^M]^T；

α, λ, β indicate hyper parameter；

The Section 3 of formula is in order to avoid r over-fitting；

Section 4 is cross-module state consistency constraint item, and effect formula balances the coordinative coherence between two mode；

It indicates the saliency value estimated result for correcting t frame using t-1 frame, is based on space between t frame and t-1 frame The Maximum overlap ratio of position；

It is in order to find out the intrinsic relationship between consecutive frame, by calculating consecutive frame most Big overlap ratio, finds motion information, obtains the multi-modal saliency result based on space-time.

Model is optimized as follows:

Wherein, i, j indicate different super-pixel block, it is seen that between the node in the video sequence of infrared two mode of light and heat Side right be defined as Refer to the color characteristic of each super-pixel block, γ^kIndicate kth mode Scale parameter；

The Frobenius norm of representing matrix X, the i.e. quadratic sum of matrix element；

D=diag { D₁₁,...,D_nn, indicate metric matrix；

R is mode weight vectors, r=[r¹,r²,...,r^M]^T；

α is a balance parameters；

The Section 3 of formulaIt is in order to avoid r over-fitting；

Section 4It is cross-module state consistency constraint item, balances the association between two mode Consistency is adjusted, space-time matrix P is introduced^t,t-1And auxiliary variable z^k,tTo replace the s in step (6) formula^k,t。

The present invention has the advantage that the angle that the present invention is merged from information compared with prior art, more by merging The visual modalities data of a complementation overcome the influence of the factors such as low illumination, haze and mixed and disorderly background, introduce the power of each pattern Reliability is indicated again, to realize the adaptive and collaboration fusion of not same source data.In addition, Deja Vu is also included in by the application In multi task model, to obtain smoother time domain effect.By iteratively solving multiple subproblems, mode weight and row have been obtained Order function.To obtain more robust saliency detection effect.

Detailed description of the invention

Fig. 1 is imaging view of the Thermal Infrared Data under complex scene；

Fig. 2 is flow diagram of the invention.

Specific embodiment

It elaborates below to the embodiment of the present invention, the present embodiment carries out under the premise of the technical scheme of the present invention Implement, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to following implementation Example.

As shown in Fig. 2, specific step is as follows for the present embodiment:

(1) a pair of of visible light and thermal infrared video pair are given, also regard thermal infrared video sequence as one of video frame channel. Simple SLIC super-pixel segmentation is carried out to the sequence of frames of video provided first and produces the super-pixel of each frame, to save two The initial configuration element of a mode video content.

(2) since multimodal information fusion has certain complexity, the reliability including heterogeneity, mode between mode And sequence in initial seed point noise.Therefore, the algorithm (Saliency that the present embodiment sorts in popular conventional Detection via Graph-based Manifold Ranking based on figure manifold ranking conspicuousness detect) basis On, it introduces mode reliability weight and seed point optimization respectively to overcome the above problem, proposes a kind of new collaboration manifold row Sequence model.Original video sequence is included five continuous frames to many short windows, each window is divided by the present embodiment, is mentioned Go out the formulating method of multitask coordinated manifold ranking algorithm, and give following formula:

Wherein, K takes 2, that is, two mode of visible light and thermal infrared in the present embodiment；T indicates t frame；It indicates The l of vector X₂Square of norm, the i.e. quadratic sum of vector element.

(3) respectively using the node of upper and lower, left and right four edges circle of image as background seed point, i.e. query object, with Query object at image coboundary calculates sequence score of the figure interior joint relative to query object, then subtracts the score with 1, most The prospect vector that four direction is found out is done a little multiplied by the initial saliency value for calculating each mode in the first stage afterwards:

Wherein, ο symbol The dot product and corresponding element for indicating vector element are multiplied；

By taking the right margin of t frame picture as an example, using the super-pixel of this side as the query node being marked, remaining Then it is used as Unlabeled data.

According to the above-mentioned formula provided, sequence score is calculated by the sort algorithm of propositionAnd to its carry out Regularization, to obtain new sequence scoreIt is set between 0 to 1 by the range of this value.

And so on, respectively under, the node on left, upper three boundaries as background seed point, i.e. query object, with image The query object of boundary calculates sequence score of the figure interior joint relative to query object,

The score is subtracted with 1 again, the foreground point vector that four direction is found out finally is done dot product To obtain the prospect saliency value of first stage.

(4) formula provided according to step (2) uses proposed sort algorithm to calculate the value that sequence obtainsAnd Mode weight r.Ranking value regularization similar with previous stage, will obtaining, regularization operation be in order to prevent over-fitting and it is right The parameter for needing to optimize carries out restrict.It obtainsRange is between 0 to 1；Finally, by combining ranking valueAnd mould State weight r, obtainsThen final notable figure has been obtained.

(5) video sequence of two mode of a pair of of visible light and thermal infrared is given, target is in the every of multi-modal video pair Significant object is found in a frame.In the formula of step (2), multitask concept is introduced effectively to cooperate with different mode.So And it can only guarantee that the spatial smoothness of each frame but has ignored the clue of time.Therefore, for each frame, the present embodiment is mentioned Three important requirements: 1. being consistent property of mode are gone out.2. conspicuousness is also consistent.3. considering the timing letter of consecutive frame Breath.Specifically, the present embodiment provides following formula:

Wherein,Indicate the saliency value estimated result that t frame is corrected using t-1 frame.The principle be based on t frame and The Maximum overlap ratio of spatial position between t-1 frame.Last is to find out the intrinsic relationship between consecutive frame.Pass through calculating The Maximum overlap ratio of consecutive frame, finds motion information, has obtained a kind of side of multi-modal saliency detection based on space-time Method.

(6) it adaptively merges multi-modal information and excavates the internal relation between image block (figure node) to the weight of node The accurately calculating of the property wanted weight is very important.Therefore, the present embodiment studies a conjunctive model, graph structure, side right and point The calculating of power is fused in a unified Optimization Framework, boosting algorithm performance.In addition, conjunctive model becomes comprising multiple optimizations How amount, it is often very difficult solve the model.

Therefore, the present embodiment is studied and proposes the Efficient Solution algorithm of the conjunctive model.Available following Optimized model:

Wherein, i, j indicate different super-pixel block, thus the node in the video sequence of two mode of visible light and thermal infrared it Between side right be defined as Refer to the color characteristic of each super-pixel block, γ^kIndicate kth mode Scale parameter.The Frobenius norm of representing matrix X, the i.e. quadratic sum of matrix element.

D=diag { D₁₁,...,D_nnIt is metric matrix, diag refers to diagonally operating.

Γ=[Γ¹,...,Γ^K]^TIt is an adaptive parameter vector, is initialised after the first iteration.

R is mode weight vectors, r=[r¹,r²,...,r^M]^T。

α is a balance parameters.

The Section 3 of formula is in order to avoid r over-fitting.

Section 4 is cross-module state consistency constraint item, and effect formula balances the coordinative coherence between two mode.

For Optimization Solution above-mentioned formula, the present embodiment introduces a space-time matrix P in last^t,t-1, and Auxiliary variable z^k,tTo replace the s in step (5) described formula^k,t.For this Optimized model, calculation is multiplied using Lagrangian number Method, (the Augmented Lagrange Multiplier, once wide lagrange multiplier approach), which alternately updates, solves each ginseng Number, because its convergent efficiency is very high.Its dependent variable is fixed every time, then all subproblems have the closure of oneself to solve. It, can be by the two alternately in the renewal process of S and r.The fast convergence of algorithm is demonstrated in an experiment.And The mode weight of solution, the feature of image block, weights of importance, consecutive frame Maximum overlap to form target than being fused together Feature representation is cooperateed with, realizes accurate multi-modal saliency detection.

The present embodiment is optimized in original saliency detection method based on visible light, increases thermal infrared The input of frame pair enables and obtains significantly more efficient testing result when coping with the saliency detection under complex scene. Meanwhile also Deja Vu being included in multi task model, Maximum overlap ratio is obtained by the difference of two frames of comparison front and back, to obtain Smoother time domain effect.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of cross-module state saliency detection method based on Deja Vu, which comprises the following steps:

(2) conspicuousness that each pixel of super-pixel segmentation figure is calculated using multitask manifold ranking algorithm, obtains foreground point, Then obtained foreground point is screened, is selected above the node of given threshold, allow obtained section after all nodes and screening Point compares, and the node for selecting similarity big is as foreground point；

(4) saliency value for comparing adjacent two frame of front and back calculates its spatial position Maximum overlap than then finding consolidating between consecutive frame There is relationship, obtains the multi-modal saliency result based on space-time；

2. a kind of cross-module state saliency detection method based on Deja Vu according to claim 1, feature exist In, it include five continuous frames to many short windows, each window is divided by original video sequence in the step (2), It is as follows to propose multitask coordinated manifold ranking algorithmic formula:

The different super-pixel block of i, j expression, therefore the side right between the node in the video sequence of two mode of visible light and thermal infrared It is defined as It is the color characteristic of each super-pixel block, K indicates different mode, γ^kIndicate kth mould The scale parameter of state；

D=diag { D₁₁,...,D_nnIt is metric matrix；

R is mode weight vectors, r=[r¹,r²,...,r^M]^T；

α is a balance parameters；

The Section 3 of formula is in order to avoid r over-fitting；

Formula Section 4 is cross-module state consistency constraint item, and effect formula balances the coordinative coherence between two mode.

3. a kind of cross-module state saliency detection method based on Deja Vu according to claim 2, feature exist In in the formula, k value is 2, i.e. two mode of visible light and thermal infrared.

4. a kind of cross-module state saliency detection method based on Deja Vu according to claim 2, feature exist In in the step (2), respectively using the node of upper and lower, left and right four edges circle of image as background seed point, i.e. inquiry pair As, sequence score of the figure interior joint relative to query object is calculated with the query object at image coboundary, then with 1 subtract this Point, finally the prospect vector that four direction is found out is done a little multiplied by the initial saliency value for calculating each mode in the first stage:

Wherein,Symbol indicates the dot product of vector element, i.e. corresponding element is multiplied；

The node using upper and lower, left and right four edges circle of image is respectively indicated as background seed point, right Answer the ranking value of each super-pixel block of t frame under mode k.

5. a kind of cross-module state saliency detection method based on Deja Vu according to claim 4, feature exist In in the step (3), the sort algorithm used calculates the value that sequence obtainsAnd mode weight r and previous stage phase Seemingly, the ranking value regularization that will be obtained, obtainsRange is between 0 to 1, finally, by combining ranking valueIt is weighed with mode Weight r, obtainsThen final notable figure has been obtained,

Wherein,It indicates in mode K by combining mode weight r^kWith the ranking value of each super-pixel blockObtained final saliency value.

6. a kind of cross-module state saliency detection method based on Deja Vu according to claim 5, feature exist In in the step (4), for the video sequence of two mode of given a pair of of visible light and thermal infrared, target is multi-modal Significant object is found in each frame of video pair, utilizes formula:

The different super-pixel block of i, j expression, therefore the side right between the node in the video sequence of two mode of visible light and thermal infrared It is defined as Refer to that the color characteristic of each super-pixel block, K indicate different mode, γ^kIt indicates The scale parameter of kth mode；

D=diag { D₁₁,...,D_nnIt is metric matrix；

R is mode weight vectors, r=[r¹,r²,...,r^M]^T；

α, λ, β indicate hyper parameter；

The Section 3 of formula is in order to avoid r over-fitting；

The saliency value estimated result that t frame is corrected using t-1 frame is indicated, based on spatial position between t frame and t-1 frame Maximum overlap ratio；

It is in order to find out the intrinsic relationship between consecutive frame, by the maximum weight for calculating consecutive frame Folded ratio, finds motion information, obtains the multi-modal saliency result based on space-time.

7. a kind of cross-module state saliency detection method based on Deja Vu according to claim 6, feature exist In being optimized to model as follows:

Wherein, i, j indicate different super-pixel block, it is seen that the side between node in the video sequence of infrared two mode of light and heat Power is defined as Refer to the color characteristic of each super-pixel block, γ^kIndicate the scale of kth mode Parameter；

D=diag { D₁₁,...,D_nnIt is metric matrix；

R is mode weight vectors, r=[r¹,r²,...,r^M]^T；

α is a balance parameters；

The Section 3 of formulaIt is in order to avoid r over-fitting；

Section 4It is cross-module state consistency constraint item, balances the coordination one between two mode Cause property introduces space-time matrix P^t,t-1And auxiliary variable z^k,tTo replace the s in step (6) formula^k,t。