CN105228033B

CN105228033B - A kind of method for processing video frequency and electronic equipment

Info

Publication number: CN105228033B
Application number: CN201510535580.9A
Authority: CN
Inventors: 董培; 靳玉茹
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2015-08-27
Filing date: 2015-08-27
Publication date: 2018-11-09
Anticipated expiration: 2035-08-27
Also published as: CN105228033A

Abstract

The invention discloses a kind of method for processing video frequency and electronic equipment, the method includes：Fisrt feature collection is extracted from video frame, the fisrt feature collection includes：Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part；Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes：Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band；The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video frequency abstract.

Description

A kind of method for processing video frequency and electronic equipment

Technical field

The present invention relates to video processing technique more particularly to a kind of method for processing video frequency and electronic equipment.

Background technology

Intelligent terminal, if smart mobile phone has become the carry-on companion of current people's Working Life, user is by downloading and certainly The mode of row shooting is easy to accumulate a large amount of video.Particularly with the mobile phone for being equipped with binocular camera, the data volume stored is needed Bigger.The mobile phone memory relatively limited in face of capacity, the problem of urgent need to resolve is become to the management of video file.

Invention content

In order to solve the above technical problems, an embodiment of the present invention provides a kind of method for processing video frequency and electronic equipments.

Method for processing video frequency provided in an embodiment of the present invention includes：

Fisrt feature collection is extracted from video frame, the fisrt feature collection includes：Colour moment feature, Wavelet Texture, fortune Dynamic feature, the crucial point feature in part；

Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes：Motion attention is special Sign, the face attention feature based on depth information, the semantic indicative character of video-frequency band；

The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain Video frequency abstract.

Electronic equipment provided in an embodiment of the present invention includes：

Extraction unit, for extracting fisrt feature collection from video frame, the fisrt feature collection includes：It is colour moment feature, small Wave textural characteristics, motion feature, the crucial point feature in part；

Second feature collection, the second feature collection is calculated for being based on the fisrt feature collection in first processing units Including：Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band；

Second processing unit, the linear model for being weighted using iteration again carry out each feature that second feature is concentrated Fusion treatment, to obtain video frequency abstract.

In the technical solution of the embodiment of the present invention, colour moment feature, Wavelet Texture, movement spy are extracted from video frame Sign, the crucial point feature in part；Then, based on colour moment feature, Wavelet Texture, motion feature, the local key point extracted The semantic instruction spy of motion attention feature, the face attention feature based on depth information, video-frequency band is calculated in feature Sign；Face attention feature, the semantic indicative character of video-frequency band to motion attention feature, based on depth information merge Processing, to obtain video frequency abstract.In this way, semantic opposite refining and important video-frequency band are extracted from original video, to have Effect, which reduces in electronic equipment, needs the data volume preserved, improves the utilization ratio and user experience of electronic device memory, also has It is navigated to from small amount of video file in the future conducive to user and oneself most wants the video found.Also, the embodiment of the present invention Technical solution combines the letter from visual modalities (visual modality) and word mode (textual modality) Breath, can more effectively capture the high-level semantics of video content.The depth of object in scene is combined in face attention feature Information is conducive to grasp high-level semantics from more fully angle.The technical solution of the embodiment of the present invention is independent of for specific The inspiration heuristic rule that video type is formulated, can be suitable for broad video genre.

Description of the drawings

Fig. 1 is the flow diagram of the method for processing video frequency of the embodiment of the present invention one；

Fig. 2 is the flow diagram of the method for processing video frequency of the embodiment of the present invention two；

Fig. 3 is the overall flow figure that the video frequency abstract of the embodiment of the present invention extracts；

Fig. 4 is the flow chart of the semantic indicative character of the calculating video-frequency band of the embodiment of the present invention；

Fig. 5 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention one；

Fig. 6 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention two.

Specific implementation mode

The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings to this hair The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used for limiting the embodiment of the present invention.

In the epoch of information explosion, traditional video data browsing has faced unprecedented challenge with way to manage.Cause This, provides video frequency abstract that is brief and concentrating key message in original video for video user and has important practical significance.Depending on Frequency abstract is commonly divided into dynamic and static two types：Dynamic video abstract is the shortening version of original video, wherein can wrap Containing a series of video-frequency bands extracted from former long version；And the pass that static video frequency abstract can be extracted by one group from original video Key frame is constituted.

Traditional video frequency abstract is generated by extracting visual signature in video or character features.However, this direction On method mostly be using inspire grope formula regular or simple character analysis (as be based on word frequency statistics).In addition, traditional The attention model method using face characteristic only account for the face detected plan-position in the scene and size etc. Information lacks the use to depth information.

Attention model, the semantic information of video and the depth of video frame of the technical solution of the embodiment of the present invention based on user Degree information estimates the relative importance of video-frequency band in such a way that iteration weights again, to generate dynamic video abstract.

Fig. 1 is the flow diagram of the method for processing video frequency of the embodiment of the present invention one, as shown in Figure 1, the video is handled Method includes the following steps：

Step 101：Fisrt feature collection is extracted from video frame, the fisrt feature collection includes：Colour moment feature, wavelet texture Feature, motion feature, the crucial point feature in part.

With reference to Fig. 3, first, fisrt feature collection is extracted from video frame, fisrt feature integrates as low-level features collection, fisrt feature Collection includes four low-level features：Colour moment feature, Wavelet Texture, motion feature and local key point feature.

Four low-level features concentrated below to fisrt feature are described in detail.

(1) colour moment feature

One video frame is spatially divided into 5 × 5 (altogether 25) nonoverlapping block of pixels, in each block of pixels First moment and second order third central moment are calculated separately out for three channels of Lab color spaces.The color of 25 block of pixels of the frame Square is the colour moment feature vector f for constituting the frame_cm(i)。

(2) Wavelet Texture

Similarly, a video frame is divided into 3 × 3 (altogether 9) nonoverlapping block of pixels, to each piece of brightness point Amount carries out three-level Haar wavelet decompositions respectively, and then horizontal, vertical and diagonally adjacent for the calculating wavelet coefficient per level-one Variance.All wavelet coefficient variances of the video frame are to constitute the Wavelet Texture vector f of this frame_wt(i)。

(3) motion feature

Human eye is to the variation of vision content with sensitive discernment.Based on this basic principle, a video frame is drawn It is divided into M × N number of non-overlapping block of pixels, each block contains 16 × 16 pixels, and calculates fortune by motion estimation algorithm Dynamic vector v (i, m, n).M × N number of motion vector is the motion feature f for constituting this video frame_mv(i)。

(4) the crucial point feature in part

In semantic class video analysis, the bag of words (bag of features, abbreviation BoF) based on local key point can As the strong supplement by the calculated feature of global information.Therefore, aobvious to capture using the crucial point feature in the part of soft weighting Region is write, this feature is had at one the importance in the vocabulary of 500 vision words based on key point and defined.Specifically Ground, the key point in i-th of video frame are obtained by Gaussian difference (Difference of Gaussians, abbreviation DoG) detector, It is indicated by Scale invariant features transform (Scale-Invariant Feature Transform, abbreviation SIFT) description, And it is clustered into 500 vision words.Key point feature vector f_kp(i) it is defined as：Key point under four neighbours and vision The Weighted Similarity of word.

Step 102：Based on the fisrt feature collection, second feature collection is calculated, the second feature collection includes：Movement Attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band.

Next, being based on these low-level features, advanced vision and semantic feature, referred to as second feature are further calculated out Collection, including：The semantic indicative character of motion attention feature, the face attention feature based on depth information and video-frequency band.

It is further each any given video-frequency band χ next, being based on the above low-level features_s(originate in i-th₁(s) Frame terminates at i-th₂(s) frame) calculate advanced vision and semantic feature.Video segmentation is realized by shot cut detection.

The each feature concentrated below to second feature is described in detail.

(1) motion attention feature

Psychological field to the attention force modeling that the research of human attention is computer vision field established can not or Scarce basis.The Cognition Mechanism of attention to human thinking and movable analysis with understanding in terms of it is very crucial, thus choosing Directive function can be played during selecting Composition of contents video frequency abstract relatively important in original video.This programme is noted using movement Meaning power model calculates the advanced motion attention feature for being suitable for semantic analysis.

For (m, n) a block of pixels in i-th of video frame, devise one comprising surrounding 5 × 5 (totally 25) as The spatial window of plain block and a time window for including 7 block of pixels, and the two windows are all with (m, n) block of pixels of the i-th frame Centered on.Will [0,2 π) phase range be averagely divided into 8 sections, space phase histogram is counted in spatial windowTime phase histogram is counted in time windowTo Space Consistency instruction C can be obtained according to following equation_s(i, m, n) and time consistency indicate C_t(i, m, n)：

C_s(i, m, n)=- ∑_ζp_s(ζ)logp_s(ζ) (1a)

C_t(i, m, n)=- ∑_ζp_t(ζ)logp_t(ζ) (2a)

Wherein,With It is the phase distribution in spatial window and time window respectively.Next, the motion attention feature of the i-th frame is defined as foloows：

In order to inhibit the noise in adjacent video frames feature, above the sequence of motion attention feature of gained will pass through 9 The processing of rank median filter.To s-th of video-frequency band χ_s, motion attention feature is by filtered single frames feature exploitation It obtains：

(2) the face attention feature based on depth information

In video, the appearance of face usually may indicate that more important content.This programme passes through Face datection algorithm Obtain the area A of face in each video frame (being indexed with alphabetical j)_F(j) and position.To j-th of the face detected, it is based on Depth image d corresponding with the video frame_iWith the pixel collection { x | x ∈ Λ (j) } for constituting face, the depth being defined as follows is aobvious Work property D (j)：

Wherein | Λ (j) | it is pixel number contained by j-th of face.It is also fixed according to position of the face in entire video frame One position weight w of justice_fp(j) the opposite attention rate that next approximate reflection the people's face can be obtained from spectators is (closer to video frame center Region weight it is bigger), as shown in table 1：

Table 1

The different face weights that different zones are assigned in 1 video frame of table.Central area weight is big, fringe region weight It is small.

The face attention feature of i-th frame may be calculated：

Wherein A_frmFor the area of video frame, D_max(i)=max_x d_i(x).In order to reduce Face datection inaccuracy to this The influence of the scheme overall situation, gained face attention characteristic sequence will also be carried out smooth by median filter (5 rank).Video-frequency band χ_s's Face attention feature through following formula by after smooth feature FAC (i) | i=i₁(s) ..., i₂(s) } it is calculated：

(3) the semantic indicative character of video-frequency band

With reference to Fig. 4, in order to excavate semantic information, the three of 374 concepts and each concept of this programme based on VIREO-374 Kind SVM (Support Vector Machine, abbreviation SVM) extracts the semantic indicative character of video-frequency band.Support vector Machine is trained based on previously described colour moment, wavelet texture and local key point feature, it is estimated that one in prediction The probability value of degree in close relations between a given video frame and concept.Calculate the flow of the semantic indicative character of video-frequency band As shown in Figure 4：

For video-frequency band χ_s, its intermediate frame i is extracted first_m(s) colour moment feature f_cm(i_m(s)), Wavelet Texture f_wt(i_mAnd local key point feature f (s))_kp(i_m(s)), then by the prediction of SVM probability value { u is obtained_cm(s, j), u_wt(s, j), u_kp(s, j) | j=1,2 ..., 374 }, and then calculate concept and spend closely：

Next, handling the corresponding caption information of video-frequency band.The set Γ constituted based on subtitle vocabulary_st(s) with The set Γ of concept vocabulary_cp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet::Similarity, Word semantic similarity is calculated：

Wherein η (γ, ω) indicates subtitle vocabulary γ and concept vocabulary ω in WordNet::Similarity in Similarity Value.

In order to reduce the influence of uncorrelated concept, word degree of correlation below is defined：

Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two classes The probability of classification problem uses threshold value 0.5 naturally in formula above.

Finally, the semantic indicative character f of video-frequency band_E(s) it is defined as weighted sums of the ρ (s, j) with u (s, j) for weight：

Step 103：The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, To obtain video frequency abstract.

Finally, three kinds of advanced features of the linear model pair weighted again using an iteration are merged, and are generated needed for user The video frequency abstract of length.

In the embodiment of the present invention, video frequency abstract will finally be determined by the conspicuousness score value to each video-frequency band, thus be adopted It is merged with following three kinds of advanced features of linear model pair, fusion results are the conspicuousness score value of video-frequency band：

f_SAL(s)=w_M(s)f_M(s)+w_F(s)f_F(s)+w_E(s)f_E(s) (12a)

Wherein w_M(s), w_F(s) and w_E(s) be feature weight.Before linear fusion, each feature is all returned respectively One changes to section [0,1].

Feature weight is calculated below by a kind of method that iteration weights again.In kth time iteration, weight w_#(s)(# ∈ { M, F, E }) by following macroscopical factor-alpha_#(s) and microcosmic factor-beta_#(s) product (i.e. w_#(s)=α_#(s)·β_#(s)) it determines：

Wherein r_#(s) it is feature f_#(s) in { f_#(s) | s=1,2 ..., N_SBy descending arrangement after ranking, N_SIt is The sum of video-frequency band in video.Next, the conspicuousness f of video-frequency band can be calculated_SAL(s) and by its sequence descending it arranges.Root According to length needed for user, according to f_SAL(s) video-frequency band is selected in video frequency abstract one by one from high to low.

Before iterative process starts for the first time, feature weight is initialized according to the principle of equal weight.Iterative process passes through Cross 15 end.

The technical solution of the embodiment of the present invention extracts colour moment, wavelet texture, movement and part from video frame and closes first The low-level features such as key point.Next, being based on these low-level features, advanced vision and semantic feature are further calculated out, including The semantic indicative character of motion attention feature, the face attention feature and video-frequency band of consideration depth information.Then, one is utilized Three kinds of advanced features of linear model pair that a iteration weights again merge, and generate the video frequency abstract of length needed for user.

Fig. 2 is the flow diagram of the method for processing video frequency of the embodiment of the present invention two, as shown in Fig. 2, the video is handled Method includes the following steps：

Step 201：Fisrt feature collection is extracted from video frame, the fisrt feature collection includes：Colour moment feature, wavelet texture Feature, motion feature, the crucial point feature in part.

(1) colour moment feature

(2) Wavelet Texture

(3) motion feature

(4) the crucial point feature in part

Step 202：According to the motion feature that the fisrt feature is concentrated, motion attention feature is calculated.

C_s(i, m, n)=- ∑_ζp_s(ζ)logp_s(ζ) (1b)

C_t(i, m, n)=- ∑_ζp_t(ζ)logp_t(ζ) (2b)

Step 203：The area of face and position in each video frame are obtained by Face datection algorithm, is based on and the video The corresponding depth image of frame and the pixel collection for constituting face, are calculated the face attention feature based on depth information.

Wherein | Λ (j) | it is pixel number contained by j-th of face.It is also fixed according to position of the face in entire video frame The opposite attention rate that the next approximate reflection the people's faces of one position weight wfp (j) of justice can be obtained from spectators is (closer to video frame center Region weight it is bigger), as shown in table 1：

Table 1

The face attention feature of i-th frame may be calculated：

Step 204：The SVM carries out the colour moment feature, Wavelet Texture, the crucial point feature in part The detection of semantic concept obtains concept and spends closely.

In the embodiment of the present invention, based on the colour moment feature, Wavelet Texture and local key point feature, training support Vector machine.SVM selects LibSVM packets, and Radial basis kernel function (radial is used to colour moment feature and Wavelet Texture Basis function, abbreviation RBF), and the side's Chi core (Chi-square kernel) is used to local key point feature.

With reference to Fig. 4, in order to excavate semantic information, 374 concept (semantics of this programme based on VIREO-37 Concept) and three kinds of SVMs of each concept (SVM, Support Vector Machine) extraction video-frequency band language Adopted indicative character.SVM is trained based on previously described colour moment, wavelet texture and local key point feature, It is estimated that the probability value of the degree in close relations between given a video frame and concept in prediction.Calculate video-frequency band The flow of semantic indicative character is as shown in Figure 4：

In the embodiment of the present invention, obtained from the audio signal of the video frame using speech recognition technology and video content Relevant text information；Alternatively,

It is obtained from the subtitle of the video frame and the relevant text information of video content.

Step 205：Based on the text information and concept lexical information, word semantic similarity is calculated.

Next, handling the corresponding subtitle of video-frequency band (subtitle) information.The collection constituted based on subtitle vocabulary Close Γ_st(s) with the set Γ of concept vocabulary_cp(j), pass through the similarity measurement tool WordNet of external dictionary WordNet:: Word semantic similarity (textual semantic similarity) is calculated in Similarity：

In order to reduce the influence of uncorrelated concept, word degree of correlation below (textual relatedness) is defined：

Step 206：It is spent closely based on the word semantic similarity and the concept, it is special that the semantic instruction is calculated Sign.

Wherein Q is to ensureThe normalization coefficient of establishment.What is provided due to SVM is two The probability of class classification problem uses threshold value 0.5 naturally in formula above.

Step 207：Linear superposition is carried out to each feature that second feature is concentrated according to feature weight value, obtains video-frequency band Conspicuousness score value.

f_SAL(s)=w_M(s)f_M(s)+w_F(s)f_F(s)+w_E(s)f_E(s) (12b)

Wherein r_#(s) it is feature f_#(s) in { f_#(s) | s=1,2 ..., N_SBy descending arrangement after ranking, N_SIt is The sum of video-frequency band in video.Next, the conspicuousness f of video-frequency band can be calculated_SAL(s) and by its sequence descending it arranges.Root It, can be according to f according to length needed for user_SAL(s) video-frequency band is selected in video frequency abstract one by one from high to low.

Fig. 5 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention one, as shown in figure 5, the electronic equipment Including：

Extraction unit 51, for extracting fisrt feature collection from video frame, the fisrt feature collection includes：Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part；

Second feature collection, the second feature is calculated for being based on the fisrt feature collection in first processing units 52 Collection includes：Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band；

Second processing unit 53, each feature that linear model for being weighted using iteration again concentrates second feature into Row fusion treatment, to obtain video frequency abstract.

It will be appreciated by those skilled in the art that before the realization function of each unit in electronic equipment shown in fig. 5 can refer to It states the associated description of method for processing video frequency and understands.The function of each unit in electronic equipment shown in fig. 5 can be by running on Program on processor and realize, can also be realized by specific logic circuit.

Fig. 6 is the structure composition schematic diagram of the electronic equipment of the embodiment of the present invention two, as shown in fig. 6, the electronic equipment Including：

Extraction unit 61, for extracting fisrt feature collection from video frame, the fisrt feature collection includes：Colour moment feature, Wavelet Texture, motion feature, the crucial point feature in part；

Second feature collection, the second feature is calculated for being based on the fisrt feature collection in first processing units 62 Collection includes：Motion attention feature, the face attention feature based on depth information, the semantic indicative character of video-frequency band；

Second processing unit 63, each feature that linear model for being weighted using iteration again concentrates second feature into Row fusion treatment, to obtain video frequency abstract.

The first processing units 62 include：

Motion attention feature subelement 621, the motion feature for being concentrated according to the fisrt feature, is calculated fortune Dynamic attention feature；

Face attention feature subelement 622, the face for obtaining face in each video frame by Face datection algorithm Product and position, the pixel collection based on depth image corresponding with the video frame and composition face, are calculated based on depth The face attention feature of information.

The electronic equipment further includes：

Training unit 64, for based on the colour moment feature, Wavelet Texture and local key point feature, training support Vector machine.

The electronic equipment further includes：

Word Input unit 65, for being obtained from the audio signal of the video frame using speech recognition technology and video The relevant text information of content；Alternatively, being obtained from the subtitle of the video frame and the relevant text information of video content.

The first processing units 62 include：

Semantic indicative character subelement 623, for special to the colour moment feature, wavelet texture using the SVM Sign, the crucial point feature in part carry out the detection of semantic concept, obtain concept and spend closely；Based on the text information and concept vocabulary Word semantic similarity is calculated in information；It is spent closely based on the word semantic similarity and the concept, institute is calculated Predicate justice indicative character.

The second processing unit 63 includes：

Linear superposition subelement 631, it is linear for being carried out to each feature that second feature is concentrated according to feature weight value Superposition, obtains the conspicuousness score value of video-frequency band；

Video frequency abstract subelement 632, for according to preset length of summarization, according to the conspicuousness score value of video-frequency band from height to Video-frequency band is selected as video frequency abstract by low sequence one by one.

It will be appreciated by those skilled in the art that before the realization function of each unit in electronic equipment shown in fig. 6 can refer to It states the associated description of method for processing video frequency and understands.The function of each unit in electronic equipment shown in fig. 6 can be by running on Program on processor and realize, can also be realized by specific logic circuit.

It, in the absence of conflict, can be in any combination between technical solution recorded in the embodiment of the present invention.

In several embodiments provided by the present invention, it should be understood that disclosed method and smart machine, Ke Yitong Other modes are crossed to realize.Apparatus embodiments described above are merely indicative, for example, the division of the unit, only Only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as：Multiple units or component can be tied It closes, or is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed each group At the mutual coupling in part or direct-coupling or communication connection can be by some interfaces, equipment or unit it is indirect Coupling or communication connection, can be electrical, mechanical or other forms.

The above-mentioned unit illustrated as separating component can be or may not be and be physically separated, aobvious as unit The component shown can be or may not be physical unit, you can be located at a place, may be distributed over multiple network lists In member；Some or all of wherein unit can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in various embodiments of the present invention can be fully integrated into a second processing unit, Can also be each unit individually as a unit, it can also be during two or more units be integrated in one unit； The form that hardware had both may be used in above-mentioned integrated unit is realized, the form that hardware adds SFU software functional unit can also be used real It is existing.

The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.

Claims

1. a kind of method for processing video frequency, the method includes：

Fisrt feature collection is extracted from video frame, the fisrt feature collection includes：Colour moment feature, Wavelet Texture, movement are special Sign, the crucial point feature in part；

Corresponding with video frame depth image is obtained, and in the relevant text information of video content and the video frame The area of face and position；

Based on the motion feature, the motion attention feature of second feature concentration is calculated；

Area and position based on the face and the depth image, be calculated that the second feature concentrates based on depth The face attention feature of information；

It obtains concept based on the colour moment feature, the Wavelet Texture, the crucial point feature in the part and spends closely；

Based on the text information and concept lexical information, word semantic similarity is obtained；

Based on the concept degree and the word semantic similarity closely, the video-frequency band that the second feature is concentrated is calculated Semantic indicative character；

The linear model weighted again using iteration carries out fusion treatment to each feature that second feature is concentrated, to obtain video Abstract.

2. method for processing video frequency according to claim 1, described to obtain the area of face and position in the video frame, packet It includes：

The area of face and position in each video frame are obtained by Face datection algorithm；

Correspondingly, the area and position based on the face and the depth image, are calculated the second feature collection In the face attention feature based on depth information, including：

Pixel collection based on depth image corresponding with the video frame and composition face, is calculated based on depth information Face attention feature.

3. method for processing video frequency according to claim 1, the method further include：

It is obtained and the relevant text information of video content from the audio signal of the video frame using speech recognition technology；Or Person,

4. method for processing video frequency according to claim 3, it is described based on the colour moment feature, the Wavelet Texture, The crucial point feature in part obtains concept and spends closely, including：

Based on the colour moment feature, Wavelet Texture and local key point feature, training SVM；

The SVM carries out the colour moment feature, Wavelet Texture, the crucial point feature in part the inspection of semantic concept It surveys, obtains concept and spend closely.

5. method for processing video frequency according to claim 1, the linear model weighted using iteration is to second feature again The each feature concentrated carries out fusion treatment, to obtain video frequency abstract；Including：

Linear superposition is carried out to each feature that second feature is concentrated according to feature weight value, obtains the conspicuousness point of video-frequency band Value；

According to preset length of summarization, video-frequency band is selected as regarding one by one according to the sequence of the conspicuousness score value of video-frequency band from high to low Frequency is made a summary.

6. a kind of electronic equipment, the electronic equipment include：

Extraction unit, for extracting fisrt feature collection from video frame, the fisrt feature collection includes：Colour moment feature, ripplet Manage feature, motion feature, the crucial point feature in part；

The extraction unit is additionally operable to obtain face in depth image corresponding with the video frame and the video frame Area and position；

Word Input unit, for obtain in the video frame with the relevant text information of video content；

Motion attention feature subelement, for being based on the motion feature, the movement that second feature concentration is calculated pays attention to Power feature；

Face attention feature subelement is calculated for area and position and the depth image based on the face The face attention feature based on depth information that the second feature is concentrated；

Semantic indicative character subelement, for special based on the colour moment feature, the Wavelet Texture, the local key point Concept is obtained to spend closely；Based on the text information and concept lexical information, word semantic similarity is calculated；Based on institute Concept degree and the word semantic similarity closely are stated, the semantic instruction that the video-frequency band that the second feature is concentrated is calculated is special Sign；

Second processing unit, the linear model for being weighted using iteration again merge each feature that second feature is concentrated Processing, to obtain video frequency abstract.

7. electronic equipment according to claim 6, it is characterised in that：

The extraction unit is additionally operable to obtain area and the position of the face in each video frame by Face datection algorithm It sets；

The face attention feature subelement is additionally operable to based on depth image corresponding with the video frame and constitutes the picture of face The face attention feature based on depth information is calculated in vegetarian refreshments set.

8. electronic equipment according to claim 6, the Word Input unit further include：

For being obtained from the audio signal of the video frame and the relevant text information of video content using speech recognition technology； Alternatively, being obtained from the subtitle of the video frame and the relevant text information of video content.

9. electronic equipment according to claim 8, the electronic equipment further include：

Training unit, for based on the colour moment feature, Wavelet Texture and local key point feature, training support vector Machine；

The semanteme indicative character subelement is additionally operable to special to the colour moment feature, wavelet texture using the SVM Sign, the crucial point feature in part carry out the detection of semantic concept, obtain concept and spend closely.

10. electronic equipment according to claim 9, the second processing unit include：

Linear superposition subelement is obtained for carrying out linear superposition to each feature that second feature is concentrated according to feature weight value To the conspicuousness score value of video-frequency band；

Video frequency abstract subelement, for according to preset length of summarization, according to the conspicuousness score value of video-frequency band from high to low suitable Video-frequency band is selected as video frequency abstract by sequence one by one.