CN105228033B - A kind of method for processing video frequency and electronic equipment - Google Patents
A kind of method for processing video frequency and electronic equipment Download PDFInfo
- Publication number
- CN105228033B CN105228033B CN201510535580.9A CN201510535580A CN105228033B CN 105228033 B CN105228033 B CN 105228033B CN 201510535580 A CN201510535580 A CN 201510535580A CN 105228033 B CN105228033 B CN 105228033B
- Authority
- CN
- China
- Prior art keywords
- feature
- video
- face
- video frame
- calculated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/60—Analysis of geometric attributes
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Geometry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computer Security & Cryptography (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of method for processing video frequency and electronic equipment, the method includes:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part;Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video frequency abstract.
Description
Technical field
The present invention relates to video processing technique more particularly to a kind of method for processing video frequency and electronic equipment.
Background technology
Intelligent terminal, if smart mobile phone has become the carry-on companion of current people's Working Life, user is by downloading and certainly
The mode of row shooting is easy to accumulate a large amount of video.Particularly with the mobile phone for being equipped with binocular camera, the data volume stored is needed
Bigger.The mobile phone memory relatively limited in face of capacity, the problem of urgent need to resolve is become to the management of video file.
Invention content
In order to solve the above technical problems, an embodiment of the present invention provides a kind of method for processing video frequency and electronic equipments.
Method for processing video frequency provided in an embodiment of the present invention includes:
Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, fortune
Dynamic feature, the crucial point feature in part;
Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Motion attention is special
Sign, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain
Video frequency abstract.
Electronic equipment provided in an embodiment of the present invention includes:
Extraction unit, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:It is colour moment feature, small
Wave textural characteristics, motion feature, the crucial point feature in part;
Second feature collection, the second feature collection is calculated for being based on the fisrt feature collection in first processing units
Including:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
Second processing unit, the linear model for being weighted using iteration again carry out each feature that second feature is concentrated
Fusion treatment, to obtain video frequency abstract.
In the technical solution of the embodiment of the present invention, colour moment feature, Wavelet Texture, movement spy are extracted from video frame
Sign, the crucial point feature in part;Then, based on colour moment feature, Wavelet Texture, motion feature, the local key point extracted
The semantic instruction spy of motion attention feature, the face attention feature based on depth information, video-frequency band is calculated in feature
Sign;Face attention feature, the semantic indicative character of video-frequency band to motion attention feature, based on depth information merge
Processing, to obtain video frequency abstract.In this way, semantic opposite refining and important video-frequency band are extracted from original video, to have
Effect, which reduces in electronic equipment, needs the data volume preserved, improves the utilization ratio and user experience of electronic device memory, also has
It is navigated to from small amount of video file in the future conducive to user and oneself most wants the video found.Also, the embodiment of the present invention
Technical solution combines the letter from visual modalities (visual modality) and word mode (textual modality)
Breath, can more effectively capture the high-level semantics of video content.The depth of object in scene is combined in face attention feature
Information is conducive to grasp high-level semantics from more fully angle.The technical solution of the embodiment of the present invention is independent of for specific
The inspiration heuristic rule that video type is formulated, can be suitable for broad video genre.
Description of the drawings
Fig. 1 is the flow diagram of the method for processing video frequency of the embodiment of the present invention one;
Fig. 2 is the flow diagram of the method for processing video frequency of the embodiment of the present invention two;
Fig. 3 is the overall flow figure that the video frequency abstract of the embodiment of the present invention extracts;
Fig. 4 is the flow chart of the semantic indicative character of the calculating video-frequency band of the embodiment of the present invention;
Fig. 5 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention one;
Fig. 6 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention two.
Specific implementation mode
The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings to this hair
The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used for limiting the embodiment of the present invention.
In the epoch of information explosion, traditional video data browsing has faced unprecedented challenge with way to manage.Cause
This, provides video frequency abstract that is brief and concentrating key message in original video for video user and has important practical significance.Depending on
Frequency abstract is commonly divided into dynamic and static two types:Dynamic video abstract is the shortening version of original video, wherein can wrap
Containing a series of video-frequency bands extracted from former long version;And the pass that static video frequency abstract can be extracted by one group from original video
Key frame is constituted.
Traditional video frequency abstract is generated by extracting visual signature in video or character features.However, this direction
On method mostly be using inspire grope formula regular or simple character analysis (as be based on word frequency statistics).In addition, traditional
The attention model method using face characteristic only account for the face detected plan-position in the scene and size etc.
Information lacks the use to depth information.
Attention model, the semantic information of video and the depth of video frame of the technical solution of the embodiment of the present invention based on user
Degree information estimates the relative importance of video-frequency band in such a way that iteration weights again, to generate dynamic video abstract.
Fig. 1 is the flow diagram of the method for processing video frequency of the embodiment of the present invention one, as shown in Figure 1, the video is handled
Method includes the following steps:
Step 101:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, wavelet texture
Feature, motion feature, the crucial point feature in part.
With reference to Fig. 3, first, fisrt feature collection is extracted from video frame, fisrt feature integrates as low-level features collection, fisrt feature
Collection includes four low-level features:Colour moment feature, Wavelet Texture, motion feature and local key point feature.
Four low-level features concentrated below to fisrt feature are described in detail.
(1) colour moment feature
One video frame is spatially divided into 5 × 5 (altogether 25) nonoverlapping block of pixels, in each block of pixels
First moment and second order third central moment are calculated separately out for three channels of Lab color spaces.The color of 25 block of pixels of the frame
Square is the colour moment feature vector f for constituting the framecm(i)。
(2) Wavelet Texture
Similarly, a video frame is divided into 3 × 3 (altogether 9) nonoverlapping block of pixels, to each piece of brightness point
Amount carries out three-level Haar wavelet decompositions respectively, and then horizontal, vertical and diagonally adjacent for the calculating wavelet coefficient per level-one
Variance.All wavelet coefficient variances of the video frame are to constitute the Wavelet Texture vector f of this framewt(i)。
(3) motion feature
Human eye is to the variation of vision content with sensitive discernment.Based on this basic principle, a video frame is drawn
It is divided into M × N number of non-overlapping block of pixels, each block contains 16 × 16 pixels, and calculates fortune by motion estimation algorithm
Dynamic vector v (i, m, n).M × N number of motion vector is the motion feature f for constituting this video framemv(i)。
(4) the crucial point feature in part
In semantic class video analysis, the bag of words (bag of features, abbreviation BoF) based on local key point can
As the strong supplement by the calculated feature of global information.Therefore, aobvious to capture using the crucial point feature in the part of soft weighting
Region is write, this feature is had at one the importance in the vocabulary of 500 vision words based on key point and defined.Specifically
Ground, the key point in i-th of video frame are obtained by Gaussian difference (Difference of Gaussians, abbreviation DoG) detector,
It is indicated by Scale invariant features transform (Scale-Invariant Feature Transform, abbreviation SIFT) description,
And it is clustered into 500 vision words.Key point feature vector fkp(i) it is defined as:Key point under four neighbours and vision
The Weighted Similarity of word.
Step 102:Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Movement
Attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band.
Next, being based on these low-level features, advanced vision and semantic feature, referred to as second feature are further calculated out
Collection, including:The semantic indicative character of motion attention feature, the face attention feature based on depth information and video-frequency band.
It is further each any given video-frequency band χ next, being based on the above low-level featuress(originate in i-th1(s)
Frame terminates at i-th2(s) frame) calculate advanced vision and semantic feature.Video segmentation is realized by shot cut detection.
The each feature concentrated below to second feature is described in detail.
(1) motion attention feature
Psychological field to the attention force modeling that the research of human attention is computer vision field established can not or
Scarce basis.The Cognition Mechanism of attention to human thinking and movable analysis with understanding in terms of it is very crucial, thus choosing
Directive function can be played during selecting Composition of contents video frequency abstract relatively important in original video.This programme is noted using movement
Meaning power model calculates the advanced motion attention feature for being suitable for semantic analysis.
For (m, n) a block of pixels in i-th of video frame, devise one comprising surrounding 5 × 5 (totally 25) as
The spatial window of plain block and a time window for including 7 block of pixels, and the two windows are all with (m, n) block of pixels of the i-th frame
Centered on.Will [0,2 π) phase range be averagely divided into 8 sections, space phase histogram is counted in spatial windowTime phase histogram is counted in time windowTo
Space Consistency instruction C can be obtained according to following equations(i, m, n) and time consistency indicate Ct(i, m, n):
Cs(i, m, n)=- ∑ζps(ζ)logps(ζ) (1a)
Ct(i, m, n)=- ∑ζpt(ζ)logpt(ζ) (2a)
Wherein,With
It is the phase distribution in spatial window and time window respectively.Next, the motion attention feature of the i-th frame is defined as foloows:
In order to inhibit the noise in adjacent video frames feature, above the sequence of motion attention feature of gained will pass through 9
The processing of rank median filter.To s-th of video-frequency band χs, motion attention feature is by filtered single frames feature exploitation
It obtains:
(2) the face attention feature based on depth information
In video, the appearance of face usually may indicate that more important content.This programme passes through Face datection algorithm
Obtain the area A of face in each video frame (being indexed with alphabetical j)F(j) and position.To j-th of the face detected, it is based on
Depth image d corresponding with the video frameiWith the pixel collection { x | x ∈ Λ (j) } for constituting face, the depth being defined as follows is aobvious
Work property D (j):
Wherein | Λ (j) | it is pixel number contained by j-th of face.It is also fixed according to position of the face in entire video frame
One position weight w of justicefp(j) the opposite attention rate that next approximate reflection the people's face can be obtained from spectators is (closer to video frame center
Region weight it is bigger), as shown in table 1:
Table 1
The different face weights that different zones are assigned in 1 video frame of table.Central area weight is big, fringe region weight
It is small.
The face attention feature of i-th frame may be calculated:
Wherein AfrmFor the area of video frame, Dmax(i)=maxx di(x).In order to reduce Face datection inaccuracy to this
The influence of the scheme overall situation, gained face attention characteristic sequence will also be carried out smooth by median filter (5 rank).Video-frequency band χs's
Face attention feature through following formula by after smooth feature FAC (i) | i=i1(s) ..., i2(s) } it is calculated:
(3) the semantic indicative character of video-frequency band
With reference to Fig. 4, in order to excavate semantic information, the three of 374 concepts and each concept of this programme based on VIREO-374
Kind SVM (Support Vector Machine, abbreviation SVM) extracts the semantic indicative character of video-frequency band.Support vector
Machine is trained based on previously described colour moment, wavelet texture and local key point feature, it is estimated that one in prediction
The probability value of degree in close relations between a given video frame and concept.Calculate the flow of the semantic indicative character of video-frequency band
As shown in Figure 4:
For video-frequency band χs, its intermediate frame i is extracted firstm(s) colour moment feature fcm(im(s)), Wavelet Texture
fwt(imAnd local key point feature f (s))kp(im(s)), then by the prediction of SVM probability value { u is obtainedcm(s, j),
uwt(s, j), ukp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely:
Next, handling the corresponding caption information of video-frequency band.The set Γ constituted based on subtitle vocabularyst(s) with
The set Γ of concept vocabularycp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet::Similarity,
Word semantic similarity is calculated:
Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity
Value.
In order to reduce the influence of uncorrelated concept, word degree of correlation below is defined:
Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two classes
The probability of classification problem uses threshold value 0.5 naturally in formula above.
Finally, the semantic indicative character f of video-frequency bandE(s) it is defined as weighted sums of the ρ (s, j) with u (s, j) for weight:
Step 103:The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated,
To obtain video frequency abstract.
Finally, three kinds of advanced features of the linear model pair weighted again using an iteration are merged, and are generated needed for user
The video frequency abstract of length.
In the embodiment of the present invention, video frequency abstract will finally be determined by the conspicuousness score value to each video-frequency band, thus be adopted
It is merged with following three kinds of advanced features of linear model pair, fusion results are the conspicuousness score value of video-frequency band:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12a)
Wherein wM(s), wF(s) and wE(s) be feature weight.Before linear fusion, each feature is all returned respectively
One changes to section [0,1].
Feature weight is calculated below by a kind of method that iteration weights again.In kth time iteration, weight w#(s)(#
∈ { M, F, E }) by following macroscopical factor-alpha#(s) and microcosmic factor-beta#(s) product (i.e. w#(s)=α#(s)·β#(s)) it determines:
Wherein r#(s) it is feature f#(s) in { f#(s) | s=1,2 ..., NSBy descending arrangement after ranking, NSIt is
The sum of video-frequency band in video.Next, the conspicuousness f of video-frequency band can be calculatedSAL(s) and by its sequence descending it arranges.Root
According to length needed for user, according to fSAL(s) video-frequency band is selected in video frequency abstract one by one from high to low.
Before iterative process starts for the first time, feature weight is initialized according to the principle of equal weight.Iterative process passes through
Cross 15 end.
The technical solution of the embodiment of the present invention extracts colour moment, wavelet texture, movement and part from video frame and closes first
The low-level features such as key point.Next, being based on these low-level features, advanced vision and semantic feature are further calculated out, including
The semantic indicative character of motion attention feature, the face attention feature and video-frequency band of consideration depth information.Then, one is utilized
Three kinds of advanced features of linear model pair that a iteration weights again merge, and generate the video frequency abstract of length needed for user.
Fig. 2 is the flow diagram of the method for processing video frequency of the embodiment of the present invention two, as shown in Fig. 2, the video is handled
Method includes the following steps:
Step 201:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, wavelet texture
Feature, motion feature, the crucial point feature in part.
With reference to Fig. 3, first, fisrt feature collection is extracted from video frame, fisrt feature integrates as low-level features collection, fisrt feature
Collection includes four low-level features:Colour moment feature, Wavelet Texture, motion feature and local key point feature.
Four low-level features concentrated below to fisrt feature are described in detail.
(1) colour moment feature
One video frame is spatially divided into 5 × 5 (altogether 25) nonoverlapping block of pixels, in each block of pixels
First moment and second order third central moment are calculated separately out for three channels of Lab color spaces.The color of 25 block of pixels of the frame
Square is the colour moment feature vector f for constituting the framecm(i)。
(2) Wavelet Texture
Similarly, a video frame is divided into 3 × 3 (altogether 9) nonoverlapping block of pixels, to each piece of brightness point
Amount carries out three-level Haar wavelet decompositions respectively, and then horizontal, vertical and diagonally adjacent for the calculating wavelet coefficient per level-one
Variance.All wavelet coefficient variances of the video frame are to constitute the Wavelet Texture vector f of this framewt(i)。
(3) motion feature
Human eye is to the variation of vision content with sensitive discernment.Based on this basic principle, a video frame is drawn
It is divided into M × N number of non-overlapping block of pixels, each block contains 16 × 16 pixels, and calculates fortune by motion estimation algorithm
Dynamic vector v (i, m, n).M × N number of motion vector is the motion feature f for constituting this video framemv(i)。
(4) the crucial point feature in part
In semantic class video analysis, the bag of words (bag of features, abbreviation BoF) based on local key point can
As the strong supplement by the calculated feature of global information.Therefore, aobvious to capture using the crucial point feature in the part of soft weighting
Region is write, this feature is had at one the importance in the vocabulary of 500 vision words based on key point and defined.Specifically
Ground, the key point in i-th of video frame are obtained by Gaussian difference (Difference of Gaussians, abbreviation DoG) detector,
It is indicated by Scale invariant features transform (Scale-Invariant Feature Transform, abbreviation SIFT) description,
And it is clustered into 500 vision words.Key point feature vector fkp(i) it is defined as:Key point under four neighbours and vision
The Weighted Similarity of word.
Step 202:According to the motion feature that the fisrt feature is concentrated, motion attention feature is calculated.
Next, being based on these low-level features, advanced vision and semantic feature, referred to as second feature are further calculated out
Collection, including:The semantic indicative character of motion attention feature, the face attention feature based on depth information and video-frequency band.
It is further each any given video-frequency band χ next, being based on the above low-level featuress(originate in i-th1(s)
Frame terminates at i-th2(s) frame) calculate advanced vision and semantic feature.Video segmentation is realized by shot cut detection.
Psychological field to the attention force modeling that the research of human attention is computer vision field established can not or
Scarce basis.The Cognition Mechanism of attention to human thinking and movable analysis with understanding in terms of it is very crucial, thus choosing
Directive function can be played during selecting Composition of contents video frequency abstract relatively important in original video.This programme is noted using movement
Meaning power model calculates the advanced motion attention feature for being suitable for semantic analysis.
For (m, n) a block of pixels in i-th of video frame, devise one comprising surrounding 5 × 5 (totally 25) as
The spatial window of plain block and a time window for including 7 block of pixels, and the two windows are all with (m, n) block of pixels of the i-th frame
Centered on.Will [0,2 π) phase range be averagely divided into 8 sections, space phase histogram is counted in spatial windowTime phase histogram is counted in time windowTo
Space Consistency instruction C can be obtained according to following equations(i, m, n) and time consistency indicate Ct(i, m, n):
Cs(i, m, n)=- ∑ζps(ζ)logps(ζ) (1b)
Ct(i, m, n)=- ∑ζpt(ζ)logpt(ζ) (2b)
Wherein,With
It is the phase distribution in spatial window and time window respectively.Next, the motion attention feature of the i-th frame is defined as foloows:
In order to inhibit the noise in adjacent video frames feature, above the sequence of motion attention feature of gained will pass through 9
The processing of rank median filter.To s-th of video-frequency band χs, motion attention feature is by filtered single frames feature exploitation
It obtains:
Step 203:The area of face and position in each video frame are obtained by Face datection algorithm, is based on and the video
The corresponding depth image of frame and the pixel collection for constituting face, are calculated the face attention feature based on depth information.
In video, the appearance of face usually may indicate that more important content.This programme passes through Face datection algorithm
Obtain the area A of face in each video frame (being indexed with alphabetical j)F(j) and position.To j-th of the face detected, it is based on
Depth image d corresponding with the video frameiWith the pixel collection { x | x ∈ Λ (j) } for constituting face, the depth being defined as follows is aobvious
Work property D (j):
Wherein | Λ (j) | it is pixel number contained by j-th of face.It is also fixed according to position of the face in entire video frame
The opposite attention rate that the next approximate reflection the people's faces of one position weight wfp (j) of justice can be obtained from spectators is (closer to video frame center
Region weight it is bigger), as shown in table 1:
Table 1
The different face weights that different zones are assigned in 1 video frame of table.Central area weight is big, fringe region weight
It is small.
The face attention feature of i-th frame may be calculated:
Wherein AfrmFor the area of video frame, Dmax(i)=maxx di(x).In order to reduce Face datection inaccuracy to this
The influence of the scheme overall situation, gained face attention characteristic sequence will also be carried out smooth by median filter (5 rank).Video-frequency band χs's
Face attention feature through following formula by after smooth feature FAC (i) | i=i1(s) ..., i2(s) } it is calculated:
Step 204:The SVM carries out the colour moment feature, Wavelet Texture, the crucial point feature in part
The detection of semantic concept obtains concept and spends closely.
In the embodiment of the present invention, based on the colour moment feature, Wavelet Texture and local key point feature, training support
Vector machine.SVM selects LibSVM packets, and Radial basis kernel function (radial is used to colour moment feature and Wavelet Texture
Basis function, abbreviation RBF), and the side's Chi core (Chi-square kernel) is used to local key point feature.
With reference to Fig. 4, in order to excavate semantic information, 374 concept (semantics of this programme based on VIREO-37
Concept) and three kinds of SVMs of each concept (SVM, Support Vector Machine) extraction video-frequency band language
Adopted indicative character.SVM is trained based on previously described colour moment, wavelet texture and local key point feature,
It is estimated that the probability value of the degree in close relations between given a video frame and concept in prediction.Calculate video-frequency band
The flow of semantic indicative character is as shown in Figure 4:
For video-frequency band χs, its intermediate frame i is extracted firstm(s) colour moment feature fcm(im(s)), Wavelet Texture
fwt(imAnd local key point feature f (s))kp(im(s)), then by the prediction of SVM probability value { u is obtainedcm(s, j),
uwt(s, j), ukp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely:
In the embodiment of the present invention, obtained from the audio signal of the video frame using speech recognition technology and video content
Relevant text information;Alternatively,
It is obtained from the subtitle of the video frame and the relevant text information of video content.
Step 205:Based on the text information and concept lexical information, word semantic similarity is calculated.
Next, handling the corresponding subtitle of video-frequency band (subtitle) information.The collection constituted based on subtitle vocabulary
Close Γst(s) with the set Γ of concept vocabularycp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet::
Word semantic similarity (textual semantic similarity) is calculated in Similarity:
Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity
Value.
In order to reduce the influence of uncorrelated concept, word degree of correlation below (textual relatedness) is defined:
Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two classes
The probability of classification problem uses threshold value 0.5 naturally in formula above.
Step 206:It is spent closely based on the word semantic similarity and the concept, it is special that the semantic instruction is calculated
Sign.
With reference to Fig. 4, in order to excavate semantic information, the three of 374 concepts and each concept of this programme based on VIREO-374
Kind SVM (Support Vector Machine, abbreviation SVM) extracts the semantic indicative character of video-frequency band.Support vector
Machine is trained based on previously described colour moment, wavelet texture and local key point feature, it is estimated that one in prediction
The probability value of degree in close relations between a given video frame and concept.Calculate the flow of the semantic indicative character of video-frequency band
As shown in Figure 4:
For video-frequency band χs, its intermediate frame i is extracted firstm(s) colour moment feature fcm(im(s)), Wavelet Texture
fwt(imAnd local key point feature f (s))kp(im(s)), then by the prediction of SVM probability value { u is obtainedcm(s, j),
uwt(s, j), ukp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely:
Next, handling the corresponding caption information of video-frequency band.The set Γ constituted based on subtitle vocabularyst(s) with
The set Γ of concept vocabularycp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet::Similarity,
Word semantic similarity is calculated:
Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity
Value.
In order to reduce the influence of uncorrelated concept, word degree of correlation below is defined:
Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two
The probability of class classification problem uses threshold value 0.5 naturally in formula above.
Finally, the semantic indicative character f of video-frequency bandE(s) it is defined as weighted sums of the ρ (s, j) with u (s, j) for weight:
Step 207:Linear superposition is carried out to each feature that second feature is concentrated according to feature weight value, obtains video-frequency band
Conspicuousness score value.
Finally, three kinds of advanced features of the linear model pair weighted again using an iteration are merged, and are generated needed for user
The video frequency abstract of length.
In the embodiment of the present invention, video frequency abstract will finally be determined by the conspicuousness score value to each video-frequency band, thus be adopted
It is merged with following three kinds of advanced features of linear model pair, fusion results are the conspicuousness score value of video-frequency band:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12b)
Wherein wM(s), wF(s) and wE(s) be feature weight.Before linear fusion, each feature is all returned respectively
One changes to section [0,1].
Feature weight is calculated below by a kind of method that iteration weights again.In kth time iteration, weight w#(s)(#
∈ { M, F, E }) by following macroscopical factor-alpha#(s) and microcosmic factor-beta#(s) product (i.e. w#(s)=α#(s)·β#(s)) it determines:
Wherein r#(s) it is feature f#(s) in { f#(s) | s=1,2 ..., NSBy descending arrangement after ranking, NSIt is
The sum of video-frequency band in video.Next, the conspicuousness f of video-frequency band can be calculatedSAL(s) and by its sequence descending it arranges.Root
It, can be according to f according to length needed for userSAL(s) video-frequency band is selected in video frequency abstract one by one from high to low.
Before iterative process starts for the first time, feature weight is initialized according to the principle of equal weight.Iterative process passes through
Cross 15 end.
The technical solution of the embodiment of the present invention extracts colour moment, wavelet texture, movement and part from video frame and closes first
The low-level features such as key point.Next, being based on these low-level features, advanced vision and semantic feature are further calculated out, including
The semantic indicative character of motion attention feature, the face attention feature and video-frequency band of consideration depth information.Then, one is utilized
Three kinds of advanced features of linear model pair that a iteration weights again merge, and generate the video frequency abstract of length needed for user.
Fig. 5 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention one, as shown in figure 5, the electronic equipment
Including:
Extraction unit 51, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:Colour moment feature,
Wavelet Texture, motion feature, the crucial point feature in part;
Second feature collection, the second feature is calculated for being based on the fisrt feature collection in first processing units 52
Collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
Second processing unit 53, each feature that linear model for being weighted using iteration again concentrates second feature into
Row fusion treatment, to obtain video frequency abstract.
It will be appreciated by those skilled in the art that before the realization function of each unit in electronic equipment shown in fig. 5 can refer to
It states the associated description of method for processing video frequency and understands.The function of each unit in electronic equipment shown in fig. 5 can be by running on
Program on processor and realize, can also be realized by specific logic circuit.
Fig. 6 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention two, as shown in fig. 6, the electronic equipment
Including:
Extraction unit 61, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:Colour moment feature,
Wavelet Texture, motion feature, the crucial point feature in part;
Second feature collection, the second feature is calculated for being based on the fisrt feature collection in first processing units 62
Collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
Second processing unit 63, each feature that linear model for being weighted using iteration again concentrates second feature into
Row fusion treatment, to obtain video frequency abstract.
The first processing units 62 include:
Motion attention feature subelement 621, the motion feature for being concentrated according to the fisrt feature, is calculated fortune
Dynamic attention feature;
Face attention feature subelement 622, the face for obtaining face in each video frame by Face datection algorithm
Product and position, the pixel collection based on depth image corresponding with the video frame and composition face, are calculated based on depth
The face attention feature of information.
The electronic equipment further includes:
Training unit 64, for based on the colour moment feature, Wavelet Texture and local key point feature, training support
Vector machine.
The electronic equipment further includes:
Word Input unit 65, for being obtained from the audio signal of the video frame using speech recognition technology and video
The relevant text information of content;Alternatively, being obtained from the subtitle of the video frame and the relevant text information of video content.
The first processing units 62 include:
Semantic indicative character subelement 623, for special to the colour moment feature, wavelet texture using the SVM
Sign, the crucial point feature in part carry out the detection of semantic concept, obtain concept and spend closely;Based on the text information and concept vocabulary
Word semantic similarity is calculated in information;It is spent closely based on the word semantic similarity and the concept, institute is calculated
Predicate justice indicative character.
The second processing unit 63 includes:
Linear superposition subelement 631, it is linear for being carried out to each feature that second feature is concentrated according to feature weight value
Superposition, obtains the conspicuousness score value of video-frequency band;
Video frequency abstract subelement 632, for according to preset length of summarization, according to the conspicuousness score value of video-frequency band from height to
Video-frequency band is selected as video frequency abstract by low sequence one by one.
It will be appreciated by those skilled in the art that before the realization function of each unit in electronic equipment shown in fig. 6 can refer to
It states the associated description of method for processing video frequency and understands.The function of each unit in electronic equipment shown in fig. 6 can be by running on
Program on processor and realize, can also be realized by specific logic circuit.
It, in the absence of conflict, can be in any combination between technical solution recorded in the embodiment of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed method and smart machine, Ke Yitong
Other modes are crossed to realize.Apparatus embodiments described above are merely indicative, for example, the division of the unit, only
Only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as:Multiple units or component can be tied
It closes, or is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed each group
At the mutual coupling in part or direct-coupling or communication connection can be by some interfaces, equipment or unit it is indirect
Coupling or communication connection, can be electrical, mechanical or other forms.
The above-mentioned unit illustrated as separating component can be or may not be and be physically separated, aobvious as unit
The component shown can be or may not be physical unit, you can be located at a place, may be distributed over multiple network lists
In member;Some or all of wherein unit can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in various embodiments of the present invention can be fully integrated into a second processing unit,
Can also be each unit individually as a unit, it can also be during two or more units be integrated in one unit;
The form that hardware had both may be used in above-mentioned integrated unit is realized, the form that hardware adds SFU software functional unit can also be used real
It is existing.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain
Lid is within protection scope of the present invention.
Claims (10)
1. a kind of method for processing video frequency, the method includes:
Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, movement are special
Sign, the crucial point feature in part;
Corresponding with video frame depth image is obtained, and in the relevant text information of video content and the video frame
The area of face and position;
Based on the motion feature, the motion attention feature of second feature concentration is calculated;
Area and position based on the face and the depth image, be calculated that the second feature concentrates based on depth
The face attention feature of information;
It obtains concept based on the colour moment feature, the Wavelet Texture, the crucial point feature in the part and spends closely;
Based on the text information and concept lexical information, word semantic similarity is obtained;
Based on the concept degree and the word semantic similarity closely, the video-frequency band that the second feature is concentrated is calculated
Semantic indicative character;
The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video
Abstract.
2. method for processing video frequency according to claim 1, described to obtain the area of face and position in the video frame, packet
It includes:
The area of face and position in each video frame are obtained by Face datection algorithm;
Correspondingly, the area and position based on the face and the depth image, are calculated the second feature collection
In the face attention feature based on depth information, including:
Pixel collection based on depth image corresponding with the video frame and composition face, is calculated based on depth information
Face attention feature.
3. method for processing video frequency according to claim 1, the method further include:
It is obtained and the relevant text information of video content from the audio signal of the video frame using speech recognition technology;Or
Person,
It is obtained from the subtitle of the video frame and the relevant text information of video content.
4. method for processing video frequency according to claim 3, it is described based on the colour moment feature, the Wavelet Texture,
The crucial point feature in part obtains concept and spends closely, including:
Based on the colour moment feature, Wavelet Texture and local key point feature, training SVM;
The SVM carries out the colour moment feature, Wavelet Texture, the crucial point feature in part the inspection of semantic concept
It surveys, obtains concept and spend closely.
5. method for processing video frequency according to claim 1, the linear model weighted using iteration is to second feature again
The each feature concentrated carries out fusion treatment, to obtain video frequency abstract;Including:
Linear superposition is carried out to each feature that second feature is concentrated according to feature weight value, obtains the conspicuousness point of video-frequency band
Value;
According to preset length of summarization, video-frequency band is selected as regarding one by one according to the sequence of the conspicuousness score value of video-frequency band from high to low
Frequency is made a summary.
6. a kind of electronic equipment, the electronic equipment include:
Extraction unit, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:Colour moment feature, ripplet
Manage feature, motion feature, the crucial point feature in part;
The extraction unit is additionally operable to obtain face in depth image corresponding with the video frame and the video frame
Area and position;
Word Input unit, for obtain in the video frame with the relevant text information of video content;
Motion attention feature subelement, for being based on the motion feature, the movement that second feature concentration is calculated pays attention to
Power feature;
Face attention feature subelement is calculated for area and position and the depth image based on the face
The face attention feature based on depth information that the second feature is concentrated;
Semantic indicative character subelement, for special based on the colour moment feature, the Wavelet Texture, the local key point
Concept is obtained to spend closely;Based on the text information and concept lexical information, word semantic similarity is calculated;Based on institute
Concept degree and the word semantic similarity closely are stated, the semantic instruction that the video-frequency band that the second feature is concentrated is calculated is special
Sign;
Second processing unit, the linear model for being weighted using iteration again merge each feature that second feature is concentrated
Processing, to obtain video frequency abstract.
7. electronic equipment according to claim 6, it is characterised in that:
The extraction unit is additionally operable to obtain area and the position of the face in each video frame by Face datection algorithm
It sets;
The face attention feature subelement is additionally operable to based on depth image corresponding with the video frame and constitutes the picture of face
The face attention feature based on depth information is calculated in vegetarian refreshments set.
8. electronic equipment according to claim 6, the Word Input unit further include:
For being obtained from the audio signal of the video frame and the relevant text information of video content using speech recognition technology;
Alternatively, being obtained from the subtitle of the video frame and the relevant text information of video content.
9. electronic equipment according to claim 8, the electronic equipment further include:
Training unit, for based on the colour moment feature, Wavelet Texture and local key point feature, training support vector
Machine;
The semanteme indicative character subelement is additionally operable to special to the colour moment feature, wavelet texture using the SVM
Sign, the crucial point feature in part carry out the detection of semantic concept, obtain concept and spend closely.
10. electronic equipment according to claim 9, the second processing unit include:
Linear superposition subelement is obtained for carrying out linear superposition to each feature that second feature is concentrated according to feature weight value
To the conspicuousness score value of video-frequency band;
Video frequency abstract subelement, for according to preset length of summarization, according to the conspicuousness score value of video-frequency band from high to low suitable
Video-frequency band is selected as video frequency abstract by sequence one by one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510535580.9A CN105228033B (en) | 2015-08-27 | 2015-08-27 | A kind of method for processing video frequency and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510535580.9A CN105228033B (en) | 2015-08-27 | 2015-08-27 | A kind of method for processing video frequency and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105228033A CN105228033A (en) | 2016-01-06 |
CN105228033B true CN105228033B (en) | 2018-11-09 |
Family
ID=54996666
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510535580.9A Active CN105228033B (en) | 2015-08-27 | 2015-08-27 | A kind of method for processing video frequency and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105228033B (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9936239B2 (en) * | 2016-06-28 | 2018-04-03 | Intel Corporation | Multiple stream tuning |
CN106355171A (en) * | 2016-11-24 | 2017-01-25 | 深圳凯达通光电科技有限公司 | Video monitoring internetworking system |
CN106934397B (en) | 2017-03-13 | 2020-09-01 | 北京市商汤科技开发有限公司 | Image processing method and device and electronic equipment |
CN107222795B (en) * | 2017-06-23 | 2020-07-31 | 南京理工大学 | Multi-feature fusion video abstract generation method |
CN107979764B (en) * | 2017-12-06 | 2020-03-31 | 中国石油大学(华东) | Video subtitle generating method based on semantic segmentation and multi-layer attention framework |
CN109413510B (en) * | 2018-10-19 | 2021-05-18 | 深圳市商汤科技有限公司 | Video abstract generation method and device, electronic equipment and computer storage medium |
CN111327945B (en) | 2018-12-14 | 2021-03-30 | 北京沃东天骏信息技术有限公司 | Method and apparatus for segmenting video |
CN109932617B (en) * | 2019-04-11 | 2021-02-26 | 东南大学 | Self-adaptive power grid fault diagnosis method based on deep learning |
CN110347870A (en) * | 2019-06-19 | 2019-10-18 | 西安理工大学 | The video frequency abstract generation method of view-based access control model conspicuousness detection and hierarchical clustering method |
CN110225368B (en) * | 2019-06-27 | 2020-07-10 | 腾讯科技(深圳)有限公司 | Video positioning method and device and electronic equipment |
CN111984820B (en) * | 2019-12-19 | 2023-10-27 | 重庆大学 | Video abstraction method based on double self-attention capsule network |
CN113158720B (en) * | 2020-12-15 | 2024-06-18 | 嘉兴学院 | Video abstraction method and device based on dual-mode feature and attention mechanism |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1685344A (en) * | 2002-11-01 | 2005-10-19 | 三菱电机株式会社 | Method for summarizing unknown content of video |
WO2007099496A1 (en) * | 2006-03-03 | 2007-09-07 | Koninklijke Philips Electronics N.V. | Method and device for automatic generation of summary of a plurality of images |
CN101743596A (en) * | 2007-06-15 | 2010-06-16 | 皇家飞利浦电子股份有限公司 | Method and apparatus for automatically generating summaries of a multimedia file |
CN102880866A (en) * | 2012-09-29 | 2013-01-16 | 宁波大学 | Method for extracting face features |
KR20130061058A (en) * | 2011-11-30 | 2013-06-10 | 고려대학교 산학협력단 | Video summary method and system using visual features in the video |
CN103200463A (en) * | 2013-03-27 | 2013-07-10 | 天脉聚源(北京)传媒科技有限公司 | Method and device for generating video summary |
CN103210651A (en) * | 2010-11-15 | 2013-07-17 | 华为技术有限公司 | Method and system for video summarization |
CN104508682A (en) * | 2012-08-03 | 2015-04-08 | 柯达阿拉里斯股份有限公司 | Identifying key frames using group sparsity analysis |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7409407B2 (en) * | 2004-05-07 | 2008-08-05 | Mitsubishi Electric Research Laboratories, Inc. | Multimedia event detection and summarization |
US8467610B2 (en) * | 2010-10-20 | 2013-06-18 | Eastman Kodak Company | Video summarization using sparse basis function combination |
-
2015
- 2015-08-27 CN CN201510535580.9A patent/CN105228033B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1685344A (en) * | 2002-11-01 | 2005-10-19 | 三菱电机株式会社 | Method for summarizing unknown content of video |
WO2007099496A1 (en) * | 2006-03-03 | 2007-09-07 | Koninklijke Philips Electronics N.V. | Method and device for automatic generation of summary of a plurality of images |
CN101743596A (en) * | 2007-06-15 | 2010-06-16 | 皇家飞利浦电子股份有限公司 | Method and apparatus for automatically generating summaries of a multimedia file |
CN103210651A (en) * | 2010-11-15 | 2013-07-17 | 华为技术有限公司 | Method and system for video summarization |
KR20130061058A (en) * | 2011-11-30 | 2013-06-10 | 고려대학교 산학협력단 | Video summary method and system using visual features in the video |
CN104508682A (en) * | 2012-08-03 | 2015-04-08 | 柯达阿拉里斯股份有限公司 | Identifying key frames using group sparsity analysis |
CN102880866A (en) * | 2012-09-29 | 2013-01-16 | 宁波大学 | Method for extracting face features |
CN103200463A (en) * | 2013-03-27 | 2013-07-10 | 天脉聚源(北京)传媒科技有限公司 | Method and device for generating video summary |
Non-Patent Citations (2)
Title |
---|
Hierarchical 3D kernel descriptors for action recognition using depth sequences;Yu Kong et.al;《2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition》;20150508;全文 * |
Multi-scale information maximization based visual attention modeling for video summarization;Naveed Ejaz et.al;《2012 6th International Conference on Next Generation Mobile Appllications, Service and Technologies》;20120914;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN105228033A (en) | 2016-01-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105228033B (en) | A kind of method for processing video frequency and electronic equipment | |
Gao et al. | Human action monitoring for healthcare based on deep learning | |
Gao et al. | Discriminative multiple canonical correlation analysis for information fusion | |
US9176987B1 (en) | Automatic face annotation method and system | |
WO2020177673A1 (en) | Video sequence selection method, computer device and storage medium | |
Yang et al. | Grounded semantic role labeling | |
US20110243452A1 (en) | Electronic apparatus, image processing method, and program | |
Eroglu Erdem et al. | BAUM-2: A multilingual audio-visual affective face database | |
Chanti et al. | Improving bag-of-visual-words towards effective facial expressive image classification | |
Haq et al. | Video summarization techniques: a review | |
Paleari et al. | Towards multimodal emotion recognition: a new approach | |
Abebe et al. | A long short-term memory convolutional neural network for first-person vision activity recognition | |
Abebe et al. | Inertial-vision: cross-domain knowledge transfer for wearable sensors | |
Lv et al. | Storyrolenet: Social network construction of role relationship in video | |
Prabhu et al. | Facial Expression Recognition Using Enhanced Convolution Neural Network with Attention Mechanism. | |
Rapantzikos et al. | Spatiotemporal features for action recognition and salient event detection | |
Nguyen et al. | Type-to-track: Retrieve any object via prompt-based tracking | |
Shin et al. | Dynamic Korean sign language recognition using pose estimation based and attention-based neural network | |
Shao et al. | TAMNet: two attention modules-based network on facial expression recognition under uncertainty | |
Afdhal et al. | Emotion recognition using the shapes of the wrinkles | |
Sun et al. | Camera-assisted video saliency prediction and its applications | |
CN114510942A (en) | Method for acquiring entity words, and method, device and equipment for training model | |
Ayache et al. | CLIPS-LSR Experiments at TRECVID 2006. | |
CN113821669A (en) | Searching method, searching device, electronic equipment and storage medium | |
Li et al. | Multi-feature hierarchical topic models for human behavior recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |