CN105228033B - A kind of method for processing video frequency and electronic equipment - Google Patents

A kind of method for processing video frequency and electronic equipment Download PDF

Info

Publication number
CN105228033B
CN105228033B CN201510535580.9A CN201510535580A CN105228033B CN 105228033 B CN105228033 B CN 105228033B CN 201510535580 A CN201510535580 A CN 201510535580A CN 105228033 B CN105228033 B CN 105228033B
Authority
CN
China
Prior art keywords
feature
video
face
video frame
calculated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510535580.9A
Other languages
Chinese (zh)
Other versions
CN105228033A (en
Inventor
董培
靳玉茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201510535580.9A priority Critical patent/CN105228033B/en
Publication of CN105228033A publication Critical patent/CN105228033A/en
Application granted granted Critical
Publication of CN105228033B publication Critical patent/CN105228033B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Geometry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of method for processing video frequency and electronic equipment, the method includes:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part;Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video frequency abstract.

Description

A kind of method for processing video frequency and electronic equipment
Technical field
The present invention relates to video processing technique more particularly to a kind of method for processing video frequency and electronic equipment.
Background technology
Intelligent terminal, if smart mobile phone has become the carry-on companion of current people's Working Life, user is by downloading and certainly The mode of row shooting is easy to accumulate a large amount of video.Particularly with the mobile phone for being equipped with binocular camera, the data volume stored is needed Bigger.The mobile phone memory relatively limited in face of capacity, the problem of urgent need to resolve is become to the management of video file.
Invention content
In order to solve the above technical problems, an embodiment of the present invention provides a kind of method for processing video frequency and electronic equipments.
Method for processing video frequency provided in an embodiment of the present invention includes:
Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, fortune Dynamic feature, the crucial point feature in part;
Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Motion attention is special Sign, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain Video frequency abstract.
Electronic equipment provided in an embodiment of the present invention includes:
Extraction unit, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:It is colour moment feature, small Wave textural characteristics, motion feature, the crucial point feature in part;
Second feature collection, the second feature collection is calculated for being based on the fisrt feature collection in first processing units Including:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
Second processing unit, the linear model for being weighted using iteration again carry out each feature that second feature is concentrated Fusion treatment, to obtain video frequency abstract.
In the technical solution of the embodiment of the present invention, colour moment feature, Wavelet Texture, movement spy are extracted from video frame Sign, the crucial point feature in part;Then, based on colour moment feature, Wavelet Texture, motion feature, the local key point extracted The semantic instruction spy of motion attention feature, the face attention feature based on depth information, video-frequency band is calculated in feature Sign;Face attention feature, the semantic indicative character of video-frequency band to motion attention feature, based on depth information merge Processing, to obtain video frequency abstract.In this way, semantic opposite refining and important video-frequency band are extracted from original video, to have Effect, which reduces in electronic equipment, needs the data volume preserved, improves the utilization ratio and user experience of electronic device memory, also has It is navigated to from small amount of video file in the future conducive to user and oneself most wants the video found.Also, the embodiment of the present invention Technical solution combines the letter from visual modalities (visual modality) and word mode (textual modality) Breath, can more effectively capture the high-level semantics of video content.The depth of object in scene is combined in face attention feature Information is conducive to grasp high-level semantics from more fully angle.The technical solution of the embodiment of the present invention is independent of for specific The inspiration heuristic rule that video type is formulated, can be suitable for broad video genre.
Description of the drawings
Fig. 1 is the flow diagram of the method for processing video frequency of the embodiment of the present invention one;
Fig. 2 is the flow diagram of the method for processing video frequency of the embodiment of the present invention two;
Fig. 3 is the overall flow figure that the video frequency abstract of the embodiment of the present invention extracts;
Fig. 4 is the flow chart of the semantic indicative character of the calculating video-frequency band of the embodiment of the present invention;
Fig. 5 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention one;
Fig. 6 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention two.
Specific implementation mode
The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings to this hair The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used for limiting the embodiment of the present invention.
In the epoch of information explosion, traditional video data browsing has faced unprecedented challenge with way to manage.Cause This, provides video frequency abstract that is brief and concentrating key message in original video for video user and has important practical significance.Depending on Frequency abstract is commonly divided into dynamic and static two types:Dynamic video abstract is the shortening version of original video, wherein can wrap Containing a series of video-frequency bands extracted from former long version;And the pass that static video frequency abstract can be extracted by one group from original video Key frame is constituted.
Traditional video frequency abstract is generated by extracting visual signature in video or character features.However, this direction On method mostly be using inspire grope formula regular or simple character analysis (as be based on word frequency statistics).In addition, traditional The attention model method using face characteristic only account for the face detected plan-position in the scene and size etc. Information lacks the use to depth information.
Attention model, the semantic information of video and the depth of video frame of the technical solution of the embodiment of the present invention based on user Degree information estimates the relative importance of video-frequency band in such a way that iteration weights again, to generate dynamic video abstract.
Fig. 1 is the flow diagram of the method for processing video frequency of the embodiment of the present invention one, as shown in Figure 1, the video is handled Method includes the following steps:
Step 101:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, wavelet texture Feature, motion feature, the crucial point feature in part.
With reference to Fig. 3, first, fisrt feature collection is extracted from video frame, fisrt feature integrates as low-level features collection, fisrt feature Collection includes four low-level features:Colour moment feature, Wavelet Texture, motion feature and local key point feature.
Four low-level features concentrated below to fisrt feature are described in detail.
(1) colour moment feature
One video frame is spatially divided into 5 × 5 (altogether 25) nonoverlapping block of pixels, in each block of pixels First moment and second order third central moment are calculated separately out for three channels of Lab color spaces.The color of 25 block of pixels of the frame Square is the colour moment feature vector f for constituting the framecm(i)。
(2) Wavelet Texture
Similarly, a video frame is divided into 3 × 3 (altogether 9) nonoverlapping block of pixels, to each piece of brightness point Amount carries out three-level Haar wavelet decompositions respectively, and then horizontal, vertical and diagonally adjacent for the calculating wavelet coefficient per level-one Variance.All wavelet coefficient variances of the video frame are to constitute the Wavelet Texture vector f of this framewt(i)。
(3) motion feature
Human eye is to the variation of vision content with sensitive discernment.Based on this basic principle, a video frame is drawn It is divided into M × N number of non-overlapping block of pixels, each block contains 16 × 16 pixels, and calculates fortune by motion estimation algorithm Dynamic vector v (i, m, n).M × N number of motion vector is the motion feature f for constituting this video framemv(i)。
(4) the crucial point feature in part
In semantic class video analysis, the bag of words (bag of features, abbreviation BoF) based on local key point can As the strong supplement by the calculated feature of global information.Therefore, aobvious to capture using the crucial point feature in the part of soft weighting Region is write, this feature is had at one the importance in the vocabulary of 500 vision words based on key point and defined.Specifically Ground, the key point in i-th of video frame are obtained by Gaussian difference (Difference of Gaussians, abbreviation DoG) detector, It is indicated by Scale invariant features transform (Scale-Invariant Feature Transform, abbreviation SIFT) description, And it is clustered into 500 vision words.Key point feature vector fkp(i) it is defined as:Key point under four neighbours and vision The Weighted Similarity of word.
Step 102:Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes:Movement Attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band.
Next, being based on these low-level features, advanced vision and semantic feature, referred to as second feature are further calculated out Collection, including:The semantic indicative character of motion attention feature, the face attention feature based on depth information and video-frequency band.
It is further each any given video-frequency band χ next, being based on the above low-level featuress(originate in i-th1(s) Frame terminates at i-th2(s) frame) calculate advanced vision and semantic feature.Video segmentation is realized by shot cut detection.
The each feature concentrated below to second feature is described in detail.
(1) motion attention feature
Psychological field to the attention force modeling that the research of human attention is computer vision field established can not or Scarce basis.The Cognition Mechanism of attention to human thinking and movable analysis with understanding in terms of it is very crucial, thus choosing Directive function can be played during selecting Composition of contents video frequency abstract relatively important in original video.This programme is noted using movement Meaning power model calculates the advanced motion attention feature for being suitable for semantic analysis.
For (m, n) a block of pixels in i-th of video frame, devise one comprising surrounding 5 × 5 (totally 25) as The spatial window of plain block and a time window for including 7 block of pixels, and the two windows are all with (m, n) block of pixels of the i-th frame Centered on.Will [0,2 π) phase range be averagely divided into 8 sections, space phase histogram is counted in spatial windowTime phase histogram is counted in time windowTo Space Consistency instruction C can be obtained according to following equations(i, m, n) and time consistency indicate Ct(i, m, n):
Cs(i, m, n)=- ∑ζps(ζ)logps(ζ) (1a)
Ct(i, m, n)=- ∑ζpt(ζ)logpt(ζ) (2a)
Wherein,With It is the phase distribution in spatial window and time window respectively.Next, the motion attention feature of the i-th frame is defined as foloows:
In order to inhibit the noise in adjacent video frames feature, above the sequence of motion attention feature of gained will pass through 9 The processing of rank median filter.To s-th of video-frequency band χs, motion attention feature is by filtered single frames feature exploitation It obtains:
(2) the face attention feature based on depth information
In video, the appearance of face usually may indicate that more important content.This programme passes through Face datection algorithm Obtain the area A of face in each video frame (being indexed with alphabetical j)F(j) and position.To j-th of the face detected, it is based on Depth image d corresponding with the video frameiWith the pixel collection { x | x ∈ Λ (j) } for constituting face, the depth being defined as follows is aobvious Work property D (j):
Wherein | Λ (j) | it is pixel number contained by j-th of face.It is also fixed according to position of the face in entire video frame One position weight w of justicefp(j) the opposite attention rate that next approximate reflection the people's face can be obtained from spectators is (closer to video frame center Region weight it is bigger), as shown in table 1:
Table 1
The different face weights that different zones are assigned in 1 video frame of table.Central area weight is big, fringe region weight It is small.
The face attention feature of i-th frame may be calculated:
Wherein AfrmFor the area of video frame, Dmax(i)=maxx di(x).In order to reduce Face datection inaccuracy to this The influence of the scheme overall situation, gained face attention characteristic sequence will also be carried out smooth by median filter (5 rank).Video-frequency band χs's Face attention feature through following formula by after smooth feature FAC (i) | i=i1(s) ..., i2(s) } it is calculated:
(3) the semantic indicative character of video-frequency band
With reference to Fig. 4, in order to excavate semantic information, the three of 374 concepts and each concept of this programme based on VIREO-374 Kind SVM (Support Vector Machine, abbreviation SVM) extracts the semantic indicative character of video-frequency band.Support vector Machine is trained based on previously described colour moment, wavelet texture and local key point feature, it is estimated that one in prediction The probability value of degree in close relations between a given video frame and concept.Calculate the flow of the semantic indicative character of video-frequency band As shown in Figure 4:
For video-frequency band χs, its intermediate frame i is extracted firstm(s) colour moment feature fcm(im(s)), Wavelet Texture fwt(imAnd local key point feature f (s))kp(im(s)), then by the prediction of SVM probability value { u is obtainedcm(s, j), uwt(s, j), ukp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely:
Next, handling the corresponding caption information of video-frequency band.The set Γ constituted based on subtitle vocabularyst(s) with The set Γ of concept vocabularycp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet::Similarity, Word semantic similarity is calculated:
Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity Value.
In order to reduce the influence of uncorrelated concept, word degree of correlation below is defined:
Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two classes The probability of classification problem uses threshold value 0.5 naturally in formula above.
Finally, the semantic indicative character f of video-frequency bandE(s) it is defined as weighted sums of the ρ (s, j) with u (s, j) for weight:
Step 103:The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, To obtain video frequency abstract.
Finally, three kinds of advanced features of the linear model pair weighted again using an iteration are merged, and are generated needed for user The video frequency abstract of length.
In the embodiment of the present invention, video frequency abstract will finally be determined by the conspicuousness score value to each video-frequency band, thus be adopted It is merged with following three kinds of advanced features of linear model pair, fusion results are the conspicuousness score value of video-frequency band:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12a)
Wherein wM(s), wF(s) and wE(s) be feature weight.Before linear fusion, each feature is all returned respectively One changes to section [0,1].
Feature weight is calculated below by a kind of method that iteration weights again.In kth time iteration, weight w#(s)(# ∈ { M, F, E }) by following macroscopical factor-alpha#(s) and microcosmic factor-beta#(s) product (i.e. w#(s)=α#(s)·β#(s)) it determines:
Wherein r#(s) it is feature f#(s) in { f#(s) | s=1,2 ..., NSBy descending arrangement after ranking, NSIt is The sum of video-frequency band in video.Next, the conspicuousness f of video-frequency band can be calculatedSAL(s) and by its sequence descending it arranges.Root According to length needed for user, according to fSAL(s) video-frequency band is selected in video frequency abstract one by one from high to low.
Before iterative process starts for the first time, feature weight is initialized according to the principle of equal weight.Iterative process passes through Cross 15 end.
The technical solution of the embodiment of the present invention extracts colour moment, wavelet texture, movement and part from video frame and closes first The low-level features such as key point.Next, being based on these low-level features, advanced vision and semantic feature are further calculated out, including The semantic indicative character of motion attention feature, the face attention feature and video-frequency band of consideration depth information.Then, one is utilized Three kinds of advanced features of linear model pair that a iteration weights again merge, and generate the video frequency abstract of length needed for user.
Fig. 2 is the flow diagram of the method for processing video frequency of the embodiment of the present invention two, as shown in Fig. 2, the video is handled Method includes the following steps:
Step 201:Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, wavelet texture Feature, motion feature, the crucial point feature in part.
With reference to Fig. 3, first, fisrt feature collection is extracted from video frame, fisrt feature integrates as low-level features collection, fisrt feature Collection includes four low-level features:Colour moment feature, Wavelet Texture, motion feature and local key point feature.
Four low-level features concentrated below to fisrt feature are described in detail.
(1) colour moment feature
One video frame is spatially divided into 5 × 5 (altogether 25) nonoverlapping block of pixels, in each block of pixels First moment and second order third central moment are calculated separately out for three channels of Lab color spaces.The color of 25 block of pixels of the frame Square is the colour moment feature vector f for constituting the framecm(i)。
(2) Wavelet Texture
Similarly, a video frame is divided into 3 × 3 (altogether 9) nonoverlapping block of pixels, to each piece of brightness point Amount carries out three-level Haar wavelet decompositions respectively, and then horizontal, vertical and diagonally adjacent for the calculating wavelet coefficient per level-one Variance.All wavelet coefficient variances of the video frame are to constitute the Wavelet Texture vector f of this framewt(i)。
(3) motion feature
Human eye is to the variation of vision content with sensitive discernment.Based on this basic principle, a video frame is drawn It is divided into M × N number of non-overlapping block of pixels, each block contains 16 × 16 pixels, and calculates fortune by motion estimation algorithm Dynamic vector v (i, m, n).M × N number of motion vector is the motion feature f for constituting this video framemv(i)。
(4) the crucial point feature in part
In semantic class video analysis, the bag of words (bag of features, abbreviation BoF) based on local key point can As the strong supplement by the calculated feature of global information.Therefore, aobvious to capture using the crucial point feature in the part of soft weighting Region is write, this feature is had at one the importance in the vocabulary of 500 vision words based on key point and defined.Specifically Ground, the key point in i-th of video frame are obtained by Gaussian difference (Difference of Gaussians, abbreviation DoG) detector, It is indicated by Scale invariant features transform (Scale-Invariant Feature Transform, abbreviation SIFT) description, And it is clustered into 500 vision words.Key point feature vector fkp(i) it is defined as:Key point under four neighbours and vision The Weighted Similarity of word.
Step 202:According to the motion feature that the fisrt feature is concentrated, motion attention feature is calculated.
Next, being based on these low-level features, advanced vision and semantic feature, referred to as second feature are further calculated out Collection, including:The semantic indicative character of motion attention feature, the face attention feature based on depth information and video-frequency band.
It is further each any given video-frequency band χ next, being based on the above low-level featuress(originate in i-th1(s) Frame terminates at i-th2(s) frame) calculate advanced vision and semantic feature.Video segmentation is realized by shot cut detection.
Psychological field to the attention force modeling that the research of human attention is computer vision field established can not or Scarce basis.The Cognition Mechanism of attention to human thinking and movable analysis with understanding in terms of it is very crucial, thus choosing Directive function can be played during selecting Composition of contents video frequency abstract relatively important in original video.This programme is noted using movement Meaning power model calculates the advanced motion attention feature for being suitable for semantic analysis.
For (m, n) a block of pixels in i-th of video frame, devise one comprising surrounding 5 × 5 (totally 25) as The spatial window of plain block and a time window for including 7 block of pixels, and the two windows are all with (m, n) block of pixels of the i-th frame Centered on.Will [0,2 π) phase range be averagely divided into 8 sections, space phase histogram is counted in spatial windowTime phase histogram is counted in time windowTo Space Consistency instruction C can be obtained according to following equations(i, m, n) and time consistency indicate Ct(i, m, n):
Cs(i, m, n)=- ∑ζps(ζ)logps(ζ) (1b)
Ct(i, m, n)=- ∑ζpt(ζ)logpt(ζ) (2b)
Wherein,With It is the phase distribution in spatial window and time window respectively.Next, the motion attention feature of the i-th frame is defined as foloows:
In order to inhibit the noise in adjacent video frames feature, above the sequence of motion attention feature of gained will pass through 9 The processing of rank median filter.To s-th of video-frequency band χs, motion attention feature is by filtered single frames feature exploitation It obtains:
Step 203:The area of face and position in each video frame are obtained by Face datection algorithm, is based on and the video The corresponding depth image of frame and the pixel collection for constituting face, are calculated the face attention feature based on depth information.
In video, the appearance of face usually may indicate that more important content.This programme passes through Face datection algorithm Obtain the area A of face in each video frame (being indexed with alphabetical j)F(j) and position.To j-th of the face detected, it is based on Depth image d corresponding with the video frameiWith the pixel collection { x | x ∈ Λ (j) } for constituting face, the depth being defined as follows is aobvious Work property D (j):
Wherein | Λ (j) | it is pixel number contained by j-th of face.It is also fixed according to position of the face in entire video frame The opposite attention rate that the next approximate reflection the people's faces of one position weight wfp (j) of justice can be obtained from spectators is (closer to video frame center Region weight it is bigger), as shown in table 1:
Table 1
The different face weights that different zones are assigned in 1 video frame of table.Central area weight is big, fringe region weight It is small.
The face attention feature of i-th frame may be calculated:
Wherein AfrmFor the area of video frame, Dmax(i)=maxx di(x).In order to reduce Face datection inaccuracy to this The influence of the scheme overall situation, gained face attention characteristic sequence will also be carried out smooth by median filter (5 rank).Video-frequency band χs's Face attention feature through following formula by after smooth feature FAC (i) | i=i1(s) ..., i2(s) } it is calculated:
Step 204:The SVM carries out the colour moment feature, Wavelet Texture, the crucial point feature in part The detection of semantic concept obtains concept and spends closely.
In the embodiment of the present invention, based on the colour moment feature, Wavelet Texture and local key point feature, training support Vector machine.SVM selects LibSVM packets, and Radial basis kernel function (radial is used to colour moment feature and Wavelet Texture Basis function, abbreviation RBF), and the side's Chi core (Chi-square kernel) is used to local key point feature.
With reference to Fig. 4, in order to excavate semantic information, 374 concept (semantics of this programme based on VIREO-37 Concept) and three kinds of SVMs of each concept (SVM, Support Vector Machine) extraction video-frequency band language Adopted indicative character.SVM is trained based on previously described colour moment, wavelet texture and local key point feature, It is estimated that the probability value of the degree in close relations between given a video frame and concept in prediction.Calculate video-frequency band The flow of semantic indicative character is as shown in Figure 4:
For video-frequency band χs, its intermediate frame i is extracted firstm(s) colour moment feature fcm(im(s)), Wavelet Texture fwt(imAnd local key point feature f (s))kp(im(s)), then by the prediction of SVM probability value { u is obtainedcm(s, j), uwt(s, j), ukp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely:
In the embodiment of the present invention, obtained from the audio signal of the video frame using speech recognition technology and video content Relevant text information;Alternatively,
It is obtained from the subtitle of the video frame and the relevant text information of video content.
Step 205:Based on the text information and concept lexical information, word semantic similarity is calculated.
Next, handling the corresponding subtitle of video-frequency band (subtitle) information.The collection constituted based on subtitle vocabulary Close Γst(s) with the set Γ of concept vocabularycp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet:: Word semantic similarity (textual semantic similarity) is calculated in Similarity:
Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity Value.
In order to reduce the influence of uncorrelated concept, word degree of correlation below (textual relatedness) is defined:
Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two classes The probability of classification problem uses threshold value 0.5 naturally in formula above.
Step 206:It is spent closely based on the word semantic similarity and the concept, it is special that the semantic instruction is calculated Sign.
With reference to Fig. 4, in order to excavate semantic information, the three of 374 concepts and each concept of this programme based on VIREO-374 Kind SVM (Support Vector Machine, abbreviation SVM) extracts the semantic indicative character of video-frequency band.Support vector Machine is trained based on previously described colour moment, wavelet texture and local key point feature, it is estimated that one in prediction The probability value of degree in close relations between a given video frame and concept.Calculate the flow of the semantic indicative character of video-frequency band As shown in Figure 4:
For video-frequency band χs, its intermediate frame i is extracted firstm(s) colour moment feature fcm(im(s)), Wavelet Texture fwt(imAnd local key point feature f (s))kp(im(s)), then by the prediction of SVM probability value { u is obtainedcm(s, j), uwt(s, j), ukp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely:
Next, handling the corresponding caption information of video-frequency band.The set Γ constituted based on subtitle vocabularyst(s) with The set Γ of concept vocabularycp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet::Similarity, Word semantic similarity is calculated:
Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity Value.
In order to reduce the influence of uncorrelated concept, word degree of correlation below is defined:
Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two The probability of class classification problem uses threshold value 0.5 naturally in formula above.
Finally, the semantic indicative character f of video-frequency bandE(s) it is defined as weighted sums of the ρ (s, j) with u (s, j) for weight:
Step 207:Linear superposition is carried out to each feature that second feature is concentrated according to feature weight value, obtains video-frequency band Conspicuousness score value.
Finally, three kinds of advanced features of the linear model pair weighted again using an iteration are merged, and are generated needed for user The video frequency abstract of length.
In the embodiment of the present invention, video frequency abstract will finally be determined by the conspicuousness score value to each video-frequency band, thus be adopted It is merged with following three kinds of advanced features of linear model pair, fusion results are the conspicuousness score value of video-frequency band:
fSAL(s)=wM(s)fM(s)+wF(s)fF(s)+wE(s)fE(s) (12b)
Wherein wM(s), wF(s) and wE(s) be feature weight.Before linear fusion, each feature is all returned respectively One changes to section [0,1].
Feature weight is calculated below by a kind of method that iteration weights again.In kth time iteration, weight w#(s)(# ∈ { M, F, E }) by following macroscopical factor-alpha#(s) and microcosmic factor-beta#(s) product (i.e. w#(s)=α#(s)·β#(s)) it determines:
Wherein r#(s) it is feature f#(s) in { f#(s) | s=1,2 ..., NSBy descending arrangement after ranking, NSIt is The sum of video-frequency band in video.Next, the conspicuousness f of video-frequency band can be calculatedSAL(s) and by its sequence descending it arranges.Root It, can be according to f according to length needed for userSAL(s) video-frequency band is selected in video frequency abstract one by one from high to low.
Before iterative process starts for the first time, feature weight is initialized according to the principle of equal weight.Iterative process passes through Cross 15 end.
The technical solution of the embodiment of the present invention extracts colour moment, wavelet texture, movement and part from video frame and closes first The low-level features such as key point.Next, being based on these low-level features, advanced vision and semantic feature are further calculated out, including The semantic indicative character of motion attention feature, the face attention feature and video-frequency band of consideration depth information.Then, one is utilized Three kinds of advanced features of linear model pair that a iteration weights again merge, and generate the video frequency abstract of length needed for user.
Fig. 5 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention one, as shown in figure 5, the electronic equipment Including:
Extraction unit 51, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part;
Second feature collection, the second feature is calculated for being based on the fisrt feature collection in first processing units 52 Collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
Second processing unit 53, each feature that linear model for being weighted using iteration again concentrates second feature into Row fusion treatment, to obtain video frequency abstract.
It will be appreciated by those skilled in the art that before the realization function of each unit in electronic equipment shown in fig. 5 can refer to It states the associated description of method for processing video frequency and understands.The function of each unit in electronic equipment shown in fig. 5 can be by running on Program on processor and realize, can also be realized by specific logic circuit.
Fig. 6 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention two, as shown in fig. 6, the electronic equipment Including:
Extraction unit 61, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part;
Second feature collection, the second feature is calculated for being based on the fisrt feature collection in first processing units 62 Collection includes:Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band;
Second processing unit 63, each feature that linear model for being weighted using iteration again concentrates second feature into Row fusion treatment, to obtain video frequency abstract.
The first processing units 62 include:
Motion attention feature subelement 621, the motion feature for being concentrated according to the fisrt feature, is calculated fortune Dynamic attention feature;
Face attention feature subelement 622, the face for obtaining face in each video frame by Face datection algorithm Product and position, the pixel collection based on depth image corresponding with the video frame and composition face, are calculated based on depth The face attention feature of information.
The electronic equipment further includes:
Training unit 64, for based on the colour moment feature, Wavelet Texture and local key point feature, training support Vector machine.
The electronic equipment further includes:
Word Input unit 65, for being obtained from the audio signal of the video frame using speech recognition technology and video The relevant text information of content;Alternatively, being obtained from the subtitle of the video frame and the relevant text information of video content.
The first processing units 62 include:
Semantic indicative character subelement 623, for special to the colour moment feature, wavelet texture using the SVM Sign, the crucial point feature in part carry out the detection of semantic concept, obtain concept and spend closely;Based on the text information and concept vocabulary Word semantic similarity is calculated in information;It is spent closely based on the word semantic similarity and the concept, institute is calculated Predicate justice indicative character.
The second processing unit 63 includes:
Linear superposition subelement 631, it is linear for being carried out to each feature that second feature is concentrated according to feature weight value Superposition, obtains the conspicuousness score value of video-frequency band;
Video frequency abstract subelement 632, for according to preset length of summarization, according to the conspicuousness score value of video-frequency band from height to Video-frequency band is selected as video frequency abstract by low sequence one by one.
It will be appreciated by those skilled in the art that before the realization function of each unit in electronic equipment shown in fig. 6 can refer to It states the associated description of method for processing video frequency and understands.The function of each unit in electronic equipment shown in fig. 6 can be by running on Program on processor and realize, can also be realized by specific logic circuit.
It, in the absence of conflict, can be in any combination between technical solution recorded in the embodiment of the present invention.
In several embodiments provided by the present invention, it should be understood that disclosed method and smart machine, Ke Yitong Other modes are crossed to realize.Apparatus embodiments described above are merely indicative, for example, the division of the unit, only Only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as:Multiple units or component can be tied It closes, or is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed each group At the mutual coupling in part or direct-coupling or communication connection can be by some interfaces, equipment or unit it is indirect Coupling or communication connection, can be electrical, mechanical or other forms.
The above-mentioned unit illustrated as separating component can be or may not be and be physically separated, aobvious as unit The component shown can be or may not be physical unit, you can be located at a place, may be distributed over multiple network lists In member;Some or all of wherein unit can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in various embodiments of the present invention can be fully integrated into a second processing unit, Can also be each unit individually as a unit, it can also be during two or more units be integrated in one unit; The form that hardware had both may be used in above-mentioned integrated unit is realized, the form that hardware adds SFU software functional unit can also be used real It is existing.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.

Claims (10)

1. a kind of method for processing video frequency, the method includes:
Fisrt feature collection is extracted from video frame, the fisrt feature collection includes:Colour moment feature, Wavelet Texture, movement are special Sign, the crucial point feature in part;
Corresponding with video frame depth image is obtained, and in the relevant text information of video content and the video frame The area of face and position;
Based on the motion feature, the motion attention feature of second feature concentration is calculated;
Area and position based on the face and the depth image, be calculated that the second feature concentrates based on depth The face attention feature of information;
It obtains concept based on the colour moment feature, the Wavelet Texture, the crucial point feature in the part and spends closely;
Based on the text information and concept lexical information, word semantic similarity is obtained;
Based on the concept degree and the word semantic similarity closely, the video-frequency band that the second feature is concentrated is calculated Semantic indicative character;
The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video Abstract.
2. method for processing video frequency according to claim 1, described to obtain the area of face and position in the video frame, packet It includes:
The area of face and position in each video frame are obtained by Face datection algorithm;
Correspondingly, the area and position based on the face and the depth image, are calculated the second feature collection In the face attention feature based on depth information, including:
Pixel collection based on depth image corresponding with the video frame and composition face, is calculated based on depth information Face attention feature.
3. method for processing video frequency according to claim 1, the method further include:
It is obtained and the relevant text information of video content from the audio signal of the video frame using speech recognition technology;Or Person,
It is obtained from the subtitle of the video frame and the relevant text information of video content.
4. method for processing video frequency according to claim 3, it is described based on the colour moment feature, the Wavelet Texture, The crucial point feature in part obtains concept and spends closely, including:
Based on the colour moment feature, Wavelet Texture and local key point feature, training SVM;
The SVM carries out the colour moment feature, Wavelet Texture, the crucial point feature in part the inspection of semantic concept It surveys, obtains concept and spend closely.
5. method for processing video frequency according to claim 1, the linear model weighted using iteration is to second feature again The each feature concentrated carries out fusion treatment, to obtain video frequency abstract;Including:
Linear superposition is carried out to each feature that second feature is concentrated according to feature weight value, obtains the conspicuousness point of video-frequency band Value;
According to preset length of summarization, video-frequency band is selected as regarding one by one according to the sequence of the conspicuousness score value of video-frequency band from high to low Frequency is made a summary.
6. a kind of electronic equipment, the electronic equipment include:
Extraction unit, for extracting fisrt feature collection from video frame, the fisrt feature collection includes:Colour moment feature, ripplet Manage feature, motion feature, the crucial point feature in part;
The extraction unit is additionally operable to obtain face in depth image corresponding with the video frame and the video frame Area and position;
Word Input unit, for obtain in the video frame with the relevant text information of video content;
Motion attention feature subelement, for being based on the motion feature, the movement that second feature concentration is calculated pays attention to Power feature;
Face attention feature subelement is calculated for area and position and the depth image based on the face The face attention feature based on depth information that the second feature is concentrated;
Semantic indicative character subelement, for special based on the colour moment feature, the Wavelet Texture, the local key point Concept is obtained to spend closely;Based on the text information and concept lexical information, word semantic similarity is calculated;Based on institute Concept degree and the word semantic similarity closely are stated, the semantic instruction that the video-frequency band that the second feature is concentrated is calculated is special Sign;
Second processing unit, the linear model for being weighted using iteration again merge each feature that second feature is concentrated Processing, to obtain video frequency abstract.
7. electronic equipment according to claim 6, it is characterised in that:
The extraction unit is additionally operable to obtain area and the position of the face in each video frame by Face datection algorithm It sets;
The face attention feature subelement is additionally operable to based on depth image corresponding with the video frame and constitutes the picture of face The face attention feature based on depth information is calculated in vegetarian refreshments set.
8. electronic equipment according to claim 6, the Word Input unit further include:
For being obtained from the audio signal of the video frame and the relevant text information of video content using speech recognition technology; Alternatively, being obtained from the subtitle of the video frame and the relevant text information of video content.
9. electronic equipment according to claim 8, the electronic equipment further include:
Training unit, for based on the colour moment feature, Wavelet Texture and local key point feature, training support vector Machine;
The semanteme indicative character subelement is additionally operable to special to the colour moment feature, wavelet texture using the SVM Sign, the crucial point feature in part carry out the detection of semantic concept, obtain concept and spend closely.
10. electronic equipment according to claim 9, the second processing unit include:
Linear superposition subelement is obtained for carrying out linear superposition to each feature that second feature is concentrated according to feature weight value To the conspicuousness score value of video-frequency band;
Video frequency abstract subelement, for according to preset length of summarization, according to the conspicuousness score value of video-frequency band from high to low suitable Video-frequency band is selected as video frequency abstract by sequence one by one.
CN201510535580.9A 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment Active CN105228033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510535580.9A CN105228033B (en) 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510535580.9A CN105228033B (en) 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment

Publications (2)

Publication Number Publication Date
CN105228033A CN105228033A (en) 2016-01-06
CN105228033B true CN105228033B (en) 2018-11-09

Family

ID=54996666

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510535580.9A Active CN105228033B (en) 2015-08-27 2015-08-27 A kind of method for processing video frequency and electronic equipment

Country Status (1)

Country Link
CN (1) CN105228033B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9936239B2 (en) * 2016-06-28 2018-04-03 Intel Corporation Multiple stream tuning
CN106355171A (en) * 2016-11-24 2017-01-25 深圳凯达通光电科技有限公司 Video monitoring internetworking system
CN106934397B (en) 2017-03-13 2020-09-01 北京市商汤科技开发有限公司 Image processing method and device and electronic equipment
CN107222795B (en) * 2017-06-23 2020-07-31 南京理工大学 Multi-feature fusion video abstract generation method
CN107979764B (en) * 2017-12-06 2020-03-31 中国石油大学(华东) Video subtitle generating method based on semantic segmentation and multi-layer attention framework
CN109413510B (en) * 2018-10-19 2021-05-18 深圳市商汤科技有限公司 Video abstract generation method and device, electronic equipment and computer storage medium
CN111327945B (en) 2018-12-14 2021-03-30 北京沃东天骏信息技术有限公司 Method and apparatus for segmenting video
CN109932617B (en) * 2019-04-11 2021-02-26 东南大学 Self-adaptive power grid fault diagnosis method based on deep learning
CN110347870A (en) * 2019-06-19 2019-10-18 西安理工大学 The video frequency abstract generation method of view-based access control model conspicuousness detection and hierarchical clustering method
CN110225368B (en) * 2019-06-27 2020-07-10 腾讯科技(深圳)有限公司 Video positioning method and device and electronic equipment
CN111984820B (en) * 2019-12-19 2023-10-27 重庆大学 Video abstraction method based on double self-attention capsule network
CN113158720B (en) * 2020-12-15 2024-06-18 嘉兴学院 Video abstraction method and device based on dual-mode feature and attention mechanism

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685344A (en) * 2002-11-01 2005-10-19 三菱电机株式会社 Method for summarizing unknown content of video
WO2007099496A1 (en) * 2006-03-03 2007-09-07 Koninklijke Philips Electronics N.V. Method and device for automatic generation of summary of a plurality of images
CN101743596A (en) * 2007-06-15 2010-06-16 皇家飞利浦电子股份有限公司 Method and apparatus for automatically generating summaries of a multimedia file
CN102880866A (en) * 2012-09-29 2013-01-16 宁波大学 Method for extracting face features
KR20130061058A (en) * 2011-11-30 2013-06-10 고려대학교 산학협력단 Video summary method and system using visual features in the video
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary
CN103210651A (en) * 2010-11-15 2013-07-17 华为技术有限公司 Method and system for video summarization
CN104508682A (en) * 2012-08-03 2015-04-08 柯达阿拉里斯股份有限公司 Identifying key frames using group sparsity analysis

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7409407B2 (en) * 2004-05-07 2008-08-05 Mitsubishi Electric Research Laboratories, Inc. Multimedia event detection and summarization
US8467610B2 (en) * 2010-10-20 2013-06-18 Eastman Kodak Company Video summarization using sparse basis function combination

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1685344A (en) * 2002-11-01 2005-10-19 三菱电机株式会社 Method for summarizing unknown content of video
WO2007099496A1 (en) * 2006-03-03 2007-09-07 Koninklijke Philips Electronics N.V. Method and device for automatic generation of summary of a plurality of images
CN101743596A (en) * 2007-06-15 2010-06-16 皇家飞利浦电子股份有限公司 Method and apparatus for automatically generating summaries of a multimedia file
CN103210651A (en) * 2010-11-15 2013-07-17 华为技术有限公司 Method and system for video summarization
KR20130061058A (en) * 2011-11-30 2013-06-10 고려대학교 산학협력단 Video summary method and system using visual features in the video
CN104508682A (en) * 2012-08-03 2015-04-08 柯达阿拉里斯股份有限公司 Identifying key frames using group sparsity analysis
CN102880866A (en) * 2012-09-29 2013-01-16 宁波大学 Method for extracting face features
CN103200463A (en) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 Method and device for generating video summary

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Hierarchical 3D kernel descriptors for action recognition using depth sequences;Yu Kong et.al;《2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition》;20150508;全文 *
Multi-scale information maximization based visual attention modeling for video summarization;Naveed Ejaz et.al;《2012 6th International Conference on Next Generation Mobile Appllications, Service and Technologies》;20120914;全文 *

Also Published As

Publication number Publication date
CN105228033A (en) 2016-01-06

Similar Documents

Publication Publication Date Title
CN105228033B (en) A kind of method for processing video frequency and electronic equipment
Gao et al. Human action monitoring for healthcare based on deep learning
Gao et al. Discriminative multiple canonical correlation analysis for information fusion
US9176987B1 (en) Automatic face annotation method and system
WO2020177673A1 (en) Video sequence selection method, computer device and storage medium
Yang et al. Grounded semantic role labeling
US20110243452A1 (en) Electronic apparatus, image processing method, and program
Eroglu Erdem et al. BAUM-2: A multilingual audio-visual affective face database
Chanti et al. Improving bag-of-visual-words towards effective facial expressive image classification
Haq et al. Video summarization techniques: a review
Paleari et al. Towards multimodal emotion recognition: a new approach
Abebe et al. A long short-term memory convolutional neural network for first-person vision activity recognition
Abebe et al. Inertial-vision: cross-domain knowledge transfer for wearable sensors
Lv et al. Storyrolenet: Social network construction of role relationship in video
Prabhu et al. Facial Expression Recognition Using Enhanced Convolution Neural Network with Attention Mechanism.
Rapantzikos et al. Spatiotemporal features for action recognition and salient event detection
Nguyen et al. Type-to-track: Retrieve any object via prompt-based tracking
Shin et al. Dynamic Korean sign language recognition using pose estimation based and attention-based neural network
Shao et al. TAMNet: two attention modules-based network on facial expression recognition under uncertainty
Afdhal et al. Emotion recognition using the shapes of the wrinkles
Sun et al. Camera-assisted video saliency prediction and its applications
CN114510942A (en) Method for acquiring entity words, and method, device and equipment for training model
Ayache et al. CLIPS-LSR Experiments at TRECVID 2006.
CN113821669A (en) Searching method, searching device, electronic equipment and storage medium
Li et al. Multi-feature hierarchical topic models for human behavior recognition

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant