CN101539925A - Audio/video file-abstracting method based on attention-degree analysis - Google Patents

Audio/video file-abstracting method based on attention-degree analysis Download PDF

Info

Publication number
CN101539925A
CN101539925A CN200810102344A CN200810102344A CN101539925A CN 101539925 A CN101539925 A CN 101539925A CN 200810102344 A CN200810102344 A CN 200810102344A CN 200810102344 A CN200810102344 A CN 200810102344A CN 101539925 A CN101539925 A CN 101539925A
Authority
CN
China
Prior art keywords
audio
attention
classification
video
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810102344A
Other languages
Chinese (zh)
Inventor
郑轶佳
黄庆明
蒋树强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN200810102344A priority Critical patent/CN101539925A/en
Publication of CN101539925A publication Critical patent/CN101539925A/en
Pending legal-status Critical Current

Links

Abstract

The invention provides an audio file-abstracting method based on attention-degree analysis and an audio/video file-abstracting method based on the same. The audio file-abstracting method comprises the following steps that: 1) based on binary hierarchical classification tree algorithm, audio frequency is classified according to typical sound types in audio files, wherein the binary hierarchical classification tree algorithm selects different characteristics and classifiers for each classification layer; 2) according to the results of audio frequency classification, an attention-degree analysis model is built for the typical sound types so as to obtain an attention-degree curve of the typical sound types; and 3) an abstract is determined according to the attention-degree curve of the typical sound types. Because the method is combined with the visual mode analysis of audio/video files, the method can finish abstracting the audio/video files better. The method is good in the effect of classifying audio/video frequency. Because the model is built by selecting the characteristics in accordance with human subjective emotional laws, the method has the advantages that the method has strong framework expansibility and can be widely applied to various types of audio/video files of sports, movie/TV, news, interview and the like.

Description

A kind of audio/video file-abstracting method based on attention-degree analysis
Technical field
The present invention relates to the audio frequency and video analysis field, further relate to content-based audio frequency and video summary analysis field, particularly a kind of based on attention-degree analysis the audio file method of abstracting and based on the audio/video file-abstracting method of the method.
Background technology
Audio, video data has carried a large amount of abundant semantic informations as a kind of dynamic, directly perceived, vivid Digital Media, appear at various information service and application scenario more and more, how from a large amount of audio, video datas, to excavate effective content information automatically and efficiently, set up audio frequency and video summaries and the wonderful that extracts has wherein become advanced problems of current content-based audio frequency and video analysis field.
Audio, video data is stored with the destructuring form.By set up audio frequency and video summaries, extract wonderful wherein can make things convenient on the one hand the user to audio, video data storehouse content carry out based on semanteme quick retrieval, browse, help the construction in multimedia digital library.Along with the extensive application of multimedia technology in the individual digital mobile device (mobile phone, palm PC, e-commerce etc.), also can satisfy ever-increasing people on the other hand to browsing the requirement of audio, video data whenever and wherever possible.And the restriction of radio transmission apparatus bandwidth requires and can enough limited bandwidth become the original promptly the most excellent information of most worthy that obtains to download cost to save, and audio frequency and video summary technology has satisfied this mobile subscriber's customized demand.
Current content-based audio frequency and video analysis mainly is divided into two classes: a class is the understanding at objective fact in the video, analyzed to as if having an objective attribute, not with people's emotional change or the difference between individuality and change; Another kind of is that the emotion information of passing in the audio-video document is understood, and is based on that people analyze the subjective perception of audio frequency and video.The present invention pays close attention to the latter.Some fragments in the audio frequency and video can attract spectators' notice usually more than other partial contents, cause spectators' sympathetic response, influence its emotional change, and these fragments have higher emotion attention rate (attention).The emotion attention-degree analysis also can be described as the notice analysis, will obtain the fragment that these have higher attention rate automatically exactly from audio-video document, to help the generating audio frequency and video summary, is convenient to audio frequency and video transmission and personalized customization.
In the prior art, though more existing work that audio-video document sound intermediate frequency attention rate is analyzed, they are also not enough in the research aspect sense of hearing mode attention rate.Be article A generic framework of user attentionmodel and its application in video summarization.Yu fei Ma comparatively typically working aspect the audio attention rate analysis at present, Xian sheng Hua, Lulie, Hong jiang Zhang, IEEE transaction on multimedia, the method of mentioning in 2005, this method is comparatively simple, and its main contents are as follows:
At first, the single sorter of employing use is trained in conjunction with the audio frequency low-level image feature and is tested the audio frequency in the video file is classified;
Then,, the typical sound type in the video is set up the attention-degree analysis model, obtain the attention rate curve of each sound type according to described audio classification result;
From the low-level image feature angle to influencing the principal element of user's attention rate the audio frequency---volume element and volume change element carry out bottom-up (bottom-up) modeling:
Wherein With Be respectively normalized audio frequency average energy and normalized audio frequency average energy peak value; E AvrAnd E PeakBe respectively audio frequency average energy and average energy peak value; MaxE AvrAnd MaxE PeakBe respectively the maximal value of the two.
Use middle level affective characteristics M SpeechWith M MusicThe voice and the music factor that influence user's attention rate in the audio frequency are carried out top-down (top-down) modeling:
M speech = N speech w N total w ;
M music = N music w N total w
M wherein SpeechWith M MusicBe respectively the model that influences the middle level of user's attention rate affective characteristics voice and music factor in the audio frequency.N Speech w, N Music w, N Total wBe respectively voice, music and the subsegment number (subsegment) altogether of sound in a moving window w scope.
At last, determine summary according to the attention rate curve of tut type.
The mode that the use linear weighted function merges merges above-mentioned each final user of model acquisition pays close attention to model, and then determines summary:
M=λ 1×M as2×M speech3×M music
On the audio-video document attention rate timing variations curve that this model forms, select to surpass the excellent abstract fragments of the peak value segment of threshold value as audio-video document by setting threshold.
This analytical approach audio classification weak effect, thus make the attention rate curve precision of follow-up sound type low; And be only applicable to the audio attention rate analysis in the particular type audio-video document, other factors that influence user's attention rate in the audio frequency do not analyzed, the scope of application is narrower.
Summary of the invention
The objective of the invention is to overcome existing method of abstracting audio classification weak effect based on attention-degree analysis, and the defective of narrow application range, thereby provide a kind of audio classification effective, be applicable to the audio frequency and video abstraction generating method of the content analysis of all kinds audio frequency and video.
For achieving the above object, according to an aspect of the present invention, provide a kind of audio file method of abstracting, comprised the following steps: based on attention-degree analysis
1) the classification tree algorithm based on y-bend level type carries out audio classification according to the typical sound type in the audio file, and the classification tree algorithm of wherein said y-bend level type is selected different characteristic and sorter for use at each classification layer;
2) according to the audio classification result, typical sound type is set up the attention-degree analysis model, obtain the attention rate curve of typical sound type;
3) determine summary according to the attention rate curve of typical sound type.
According to a further aspect in the invention, above-mentioned steps 1) comprise the following steps:
11) cut apart audio file and become audio example;
12) based on the classification tree algorithm of described y-bend level type, described audio example is classified according to typical sound type.
According to another aspect of the invention, get between the adjacent audio example of this audio example 50% overlapping.
According to another aspect of the invention, above-mentioned steps 2) in typical sound type is set up the attention-degree analysis model according to following factors: energy, tone and average zero-crossing rate.
According to another aspect of the invention, above-mentioned steps 2) also comprise step: the result of attention-degree analysis Model Calculation is normalized to interval [0,1].
According to another aspect of the invention, the Gaussian normalization standard is adopted in above-mentioned normalization.
According to another aspect of the invention, above-mentioned steps 3) employing order Decision Fusion method merges the attention-degree analysis curve, and then determines summary.
According to another aspect of the invention, above-mentioned typical sound type comprises excellent sound and excellent asynchronous sound synchronously.
According to another aspect of the invention, above-mentioned steps 3) comprise the following steps:
Use excellent asynchronous sound curve coarse localization wonderful right margin;
Utilize the accurately border of the described exciting part in location of voice Boundary Detection.
In accordance with a further aspect of the present invention, before step 1), also comprise the step of audio file being carried out the pre-emphasis processing.
In accordance with a further aspect of the present invention, also provide a kind of audio/video file-abstracting method, comprised the following steps: based on attention-degree analysis
A) the classification tree algorithm based on y-bend level type carries out audio classification according to the typical sound type in the audio file, and wherein the classification tree algorithm of y-bend level type is selected different characteristic and sorter for use at each classification layer;
B) according to the audio classification result, typical sound type is set up the attention-degree analysis model, obtain the attention rate curve of typical sound type;
C) the time attention rate and the space attention rate of video file are carried out modeling, obtain the vision excellence line of writing music;
D) determine summary according to the attention rate curve of typical sound type and the vision excellence line of writing music.
The present invention adopts the audio classification tree algorithm based on y-bend level type structure and multi-categorizer selection, to the good classification effect of audio frequency.Choose the feature that meets human subjective emotion rule and carry out modeling, by the principal element that influences audience attention in the audio-video document is analyzed, obtain audio frequency and video emotion attention rate situation of change, and then generation summary, it is strong to have the framework extendability, can be widely used in the advantage of all kinds of audio-video documents such as physical culture, video display, news, interview.Excellent asynchronous factor in the unified model adopts the non-linear fusion method, makes this method have robustness and predictive ability.
Description of drawings
Below in conjunction with accompanying drawing the specific embodiment of the present invention is described in further detail, wherein:
Fig. 1 is based on the audio frequency digest algorithm process flow diagram of emotion attention-degree analysis.
Fig. 2 is based on the audio classification algorithms synoptic diagram of y-bend level type structure and multi-categorizer trade-off decision.
Fig. 3 is based on the audio frequency and video digest algorithm process flow diagram of emotion attention-degree analysis.
Fig. 4 is an order Decision Fusion algorithm synoptic diagram.
Fig. 5 is the emotion attention rate curve synoptic diagram of one section video file.
Fig. 6 is that the method synoptic diagram is determined on excellent segment candidates border.
Embodiment
The present invention analyzes from the subjective emotion cognition angle of the mankind for audio-video document, chooses the most effective feature and carries out modeling, proposes to meet the modeling method of human subjective perception rule.
Fig. 1 is an algorithm flow chart according to an embodiment of the invention.These method concrete steps are as follows:
At first, select typical sound type in this audio file, particularly the sound type that can show emotion adopts and classifies according to typical sound type based on the classification tree algorithm of y-bend level type, and audio file is demarcated according to different typical sound types according to timing variations.
Every class audio file all can have its representational typical sound type, has comprised the more semantic information of horn of plenty in these typical sound types usually, more can cause spectators' notice.For example in interview, voice, quiet, spectators' laugh, applause are typical sound type wherein, generally can be immediately following spectators' laugh or brouhaha after one section exciting part; In the audio frequency of video files, sound such as voice, quiet, music is typical sound type; In the audio frequency of sports cast, spectators' cheer, announcer separate and say, the related sound of competing etc. is typical sound type, the fierceness that generally can follow spectators' cheer or announcer after the exciting part of goal score is separated and is said, during goal always with batting sound.The comparatively outstanding sound type of generally following wonderful to occur simultaneously is called excellent sound synchronously, for example above-mentioned batting sound; In addition, some sound type occurs after can following wonderful closely, is referred to as excellent asynchronous sound, laugh and cheer after for example above-mentioned wonderful takes place.Excellent sound model synchronously is meant the model corresponding to the synchronous sound type of excellence, and excellent asynchronous sound model is meant the model corresponding to the asynchronous sound type of excellence.
Being treated to example with the audio file in the sports tournament describes this step.In order to reduce sharp-pointed noise effect, promote high-frequency signal, original audio data is made pre-emphasis (re-emphasize) handle.If x (n) is an original signal, y (n) is for handling the back signal, then:
Y (n)=x (n)-0.97*x (n-1) formula (1)
Audio file after handling is divided into the audio example (audio samples) of fixed length, get between adjacent audio example 50% overlapping, these are had the elementary cell of overlapping audio example as subsequent treatment such as classification.
To audio classification, the audio classification algorithms of prior art is not used hierarchy basically, only use single sorter (support vector machine SVM, Hidden Markov Model (HMM) HMM, Adaboost sorter etc.), characteristic of division is fixed, or simply using the single layer structure that merges a plurality of sorters to adopt ballot differentiation decision methods to carry out audio classification, classifying quality is relatively poor.The invention provides the method that a kind of classification tree algorithm based on y-bend level type carries out the audio example classification.After the classification, whole audio file has been divided into the fragment of each sound type according to sequential, and this classification results can also be as the basis on follow-up definite excellent segment candidates border as the foundation to each representative sound type modeling.This method specifically describes as follows:
, carry out by different level during in the training of audio classification with test.Each level can be selected the bottom audio frequency characteristics to the two class sound type classifying quality optimums of this layer respectively, and can use different sorters to classify, as shown in Figure 2.
Select about the bottom audio frequency characteristics, use energy (Energy) and two features of zero-crossing rate (ZeroCross Rate) to distinguish quiet and non-quiet at ground floor.Use MFCC (12 dimension) at the second layer, Pitch, quiet ratio, low frequency energy ratio, high zero-crossing rate ratio are distinguished voice and non-voice.Wherein, MFCC is the spectral coefficient that arrives of Mel scale frequency territory extraction, and it has described the nonlinear characteristic of people's ear to the frequency perception, is usually used in speech recognition and Speaker Identification.Tone (Pitch) is the tonality feature in the voice, is one of key character of differentiating voice and non-voice.Quiet ratio is the audio example feature, is defined as follows:
silencerate = silence all Formula (2)
Promptly quiet sampled point number accounts for the number percent of whole audio example sampling number in the section audio example.Because voice have more pause part than other types sound, so quiet ratio is to distinguish the desirable features of voice and other types sound.The low frequency energy ratio is a frequency domain audio example feature, in non-quiet audio frequency, voice contain more how quiet than other types sound, so the ratio that voice signal frequency domain energy is lower than certain threshold value will be higher than other types, so this feature also is to distinguish a notable feature of voice and non-voice.The low frequency energy ratio is defined as:
LERate = 1 2 N Σ n = 0 N - 1 [ sgn ( avg ( E / 2 ) - E ( n ) ) + 1 ] Formula (3)
High zero-crossing rate ratio is defined as:
ZCRRate = 1 2 N Σ n = 0 N - 1 [ sgn ( ZCR ( n ) - 1.5 avgZCR ) + 1 ] Formula (4)
More than in two formulas, N is a frame number in the audio example, E (n) is the frequency domain energy of n frame, avg represents to ask average calculating operation, sgn represents to return the integer of bracket inner function.At the 3rd layer, use short-time average energy, zero-crossing rate, bandwidth to differentiate cheer and non-cheer.The rate of change of zero-crossing rate will be lower than the other types audio frequency in cheer, so this audio example is characterized as one of desirable features of distinguishing cheer.The 4th layer is used sub belt energy, bandwidth, zero-crossing rate, frequency central feature classify batting sound and other types sound.Below only be to handle example with the voice data in the sports tournament, the selection of audio classification feature can be expanded and upgrade according to above-mentioned thought at different voice datas.
Because the difference of sampling, sample distribution and feature extracting method, different sorters can show the preference to certain independent feature or certain independent classification problem separately, so comprehensively the advantage of each sorter makes up sorter and selects to make a strategic decision and differentiate the accuracy that can improve classification, reach than using single sorter more performance.Select for use in the present invention in audio classification use more and multiple different sorters with better classifying quality as the candidate classification device, as support vector machine SVM, Hidden Markov Model (HMM) (HMM), gauss hybrid models (GMM) etc.If the set of the sorter of usefulness is F={F 1, F 2..., F l, the training sample set of i layer is combined into X i={ X I1, X I2..., X In, the audio categories of i layer is To sorter F on the i layer jSystem of selection be:
F j = arg max { P i ( F j ) × R i ( F j ) P i ( F j ) + R i ( F j ) } Formula (5)
Wherein:
Formula (6)
Formula (7)
Max represents the bracket internal separated type is got maximal value, and arg represents to get makes the bracket internal separated type get the value of peaked parameter j.So the implication of formula (5) is: the sorter of the two class sound type classifying quality optimums of this layer is selected to make the bracket inner function obtain peaked sorter.When test, use this sorter that unknown data is classified and to reduce computational complexity, raising operation efficiency.
On the basis that representative sound type is accurately classified, the present invention sets up the attention-degree analysis model to the representative sound type in the audio file, obtains the attention rate curve of each sound type.
The principal element that influences user's attention rate in the audio frequency has: energy (Energy i), tone (Pitch i) and average zero-crossing rate ( ), wherein the size of energy can be weighed the strong and weak degree of all kinds of sound, and the height of tone can be weighed the sharp-pointed degree of voice, and average zero-crossing rate can be weighed the emergency degree of music.More than be preferred factor, comprehensively other factors, for example features such as bandwidth, linear predictor coefficient, sub belt energy.Typical sound type comprises voice, brouhaha, laugh, music and batting sound etc. in the audio frequency as tennis tournament, and the present invention adopts above-mentioned factor as follows to these typical sound type attention rate model representations:
M spe = ( Σ i = 1 n Energy i ) · ( Σ i = 1 n Pitch i ) / n 2 × 100 % Formula (8)
M app = ( Σ i = 1 p Energy i ) / p × 100 % Formula (9)
M lau = ( Σ i = 1 q Energy i ) / q × 100 % Formula (10)
M mus = ( Σ i = 1 r Energy i ) ( ZCR ‾ ) / r × 100 % Formula (11)
M hit = ( Σ i = 1 k Energy i ) / k × 100 % Formula (12)
M wherein Spe, M App, M Lau, M Mus, M HitBe respectively to the attention rate model of voice, brouhaha, laugh, music and batting sound.N, p, q, r, k are respectively sampled point numbers in each audio example.
The result of above-mentioned each sound type attention rate Model Calculation is normalized in the interval [0,1], for example: carry out this operation by the Gaussian normalization standard.For each sound type of one section definite audio file, connect the corresponding attention rate value on each audio example, on sequential, obtain many attention rate change curves: voice attention rate curve C Spec, laugh attention rate curve C Lauc, applause attention rate curve C Appc, music attention rate curve C MusWith batting sound attention rate curve C HitThe attention rate situation of change of these curves when different aspects have reflected that spectators listen attentively to this document.
The attention rate curve that merges all sound types is determined the excellent degree situation of change of final audio file sequential, represents with excellent attention rate timing curve.To each different sound type attention rate model, can adopt the method for known technology, thereby the singularity of not considering excellent asynchronous sound model adopts the linear weighted function amalgamation mode to merge.The present invention also provides a kind of preferred order Decision Fusion method, and this method adopts the non-linear fusion method that more meets human subjective perception characteristics, has stronger robustness and predictive ability.The excellent attention rate of the audio example that employing order Decision Fusion algorithm obtains is shown below:
M a = ( λ spe · M spe + λ mus · M mus + λ hit · M hit ) · e ( Σ i = 1 p M app ) · e ( Σ i = 1 q M lau ) * G ( n ) Formula (13)
λ wherein Spe, λ Mus, λ HitBe respectively each excellent weight of sound model synchronously, satisfy all greater than 0 and λ Spe+ λ Mus+ λ Hit=1.P, q are respectively the duration (is unit with the second) of excellent asynchronous sound clip (applause and laugh).G (n) is Gauss's smoothing windows, and n is a smoothing parameter, and preferred n gets 60.The curve that is formed by the excellent attention rate of audio file promptly is the excellent attention rate timing curve of this document.
As the method for known technology, the segment that can directly use threshold method to choose on the excellent attention rate timing curve of audio file to exceed threshold value is as excellent segment candidates (threshold value l can set according to concrete needs).This method is particularly useful for not existing the audio file of excellent asynchronous influence factor, for example horror film, documentary film etc.
If have excellent asynchronous influence factor in the audio file, for example sitcom, talk show etc. can also adopt preferred version as described below.With above-mentioned sports tournament is example, generally one section cheer can occur after wonderful takes place at once.
At first, use excellent asynchronous sound curve C LaucAnd C AppcCoarse localization occurs in the wonderful position before applause or the cheer.On the basis that whole section audio file is accurately classified, the left margin that takes place with asynchronous sound of excellence such as laugh etc. is as the right margin of excellent segment candidates, check forward from this beginning, if the voice snippet length spe before it is greater than pre-set threshold thr, then the starting point of this voice snippet is made as the left margin of excellent segment candidates, otherwise continues to search the starting point of voice snippet the last period forward till this fragment length is more than or equal to thr.
Utilize voice Boundary Detection (silence detection) accurately to locate the border, the left and right sides of these exciting parts again.Because complete hint expression has a bit of pause after intact in voice snippet, so need find out the integrality of pause point in these voice to avoid destroying video when determining the border initial.That section video clips between the border, the left and right sides is exactly final summary, as shown in Figure 3.
Above-mentioned is the audio frequency method of abstracting, not only can handle simple audio file, also can handle the sense of hearing mode in the audio-video document.Simultaneously for the latter, on the basis of this method, increase the analysis of video attention rate, thereby the factor of the comprehensive sense of hearing and vision two aspects is carried out the processing of making a summary to all kinds of audio-video documents more perfectly, and the attention-degree analysis algorithm flow that audio frequency and video combine as shown in Figure 4.
Wherein video attention-degree analysis method concrete steps are as follows:
Characteristics of image in the video file such as color, texture, shape etc. can be calculated acquisition from a two field picture, be called " feature in the frame of video ".Corresponding with it, the characteristics of image that needs to obtain from least two two field pictures is called " feature between frame of video ".Because the wonderful in the video file can continue multiframe usually, the individual cases of single frame of video are little to whole section video influence usually.So the present invention comes vision attention is set up evaluation criterion from employing of operation efficiency angle and wonderful closely-related " feature between frame of video ".
Vision mode not only comprises spatial information and also comprises temporal information, and these information all can exert an influence to user's notice.The present invention carries out attention rate to spatial information and temporal information respectively in vision mode and represents.Usually average motion vector (motion vector) can characterize the motion conditions between frame of video preferably, and this video scene often has bigger motion change intensity when the average motion vector in the second is big, is easier to attract spectators' notice.Although the motion vector movable information in the reflecting video truly not sometimes utilizes this feature to reduce computational complexity in most cases and obtains correct result.The present invention is visual space information attention degree M SpaBe expressed as:
M spa = ( Σ i = 1 k MV i ) / k × 100 % Formula (14)
MV wherein iThe motion vector of the i frame that expression obtains from decode procedure, k be frame of video speed (for example: 25 frame/seconds).
At time dimension, camera lens conversion ratio (shot change rate) is normally used for describing camera motion.When camera lens switched comparatively frequently, normally the blood-and-thunder moment of video content, spectators' notice is also easier to be attracted.Vision temporal information attention rate M TemBe expressed as:
M Tem=e ((1-(n (k)-p (k))/δ)* 100% formula (15)
Wherein p (k) and n (k) are respectively the shot boundary frame numbers of k frame left and right sides arest neighbors; Parameter δ is a constant, is determined by n (k)-p (k), is used to guarantee M TemValue be distributed between 0% to 100%.
Similarly, the span of each vision excellence degree judgement schematics also can use the Gaussian normalization standard to be limited to interval [0,1] in, for one section definite video file, can use above-mentioned formula (14) and (15) on sequential, to obtain two vision excellences line of writing music: visual space attention rate curve C Sc, vision time attention rate curve C TcIntegrated voice attention rate curve C Spec, laugh attention rate curve C Lauc, applause attention rate curve C Appc, music attention rate curve C MusWith batting sound attention rate curve C Hit, based on order Decision Fusion algorithm many curves are merged and to obtain final audio-video document attention rate timing variations curve, as shown in Figure 5.The emotion attention rate curve of one section audio-video document as shown in Figure 6.
Similar with the audio frequency summary, the order Decision Fusion algorithm of audio frequency and video summary is expressed as follows: use excellent synchronistic model M Spa, M Tem, M Spe, M Mus, M HitDeng in conjunction with excellent asynchronous model M App, M LauDetermine the excellent degree situation of change of final video file sequential.The excellent degree evaluation standard of the video file that adopts this order Decision Fusion algorithm to be obtained is shown below:
M a = ( λ spa · M spa + λ tem · M tem + λ spe · M spe + λ mus · M mus + λ hit · M hit ) · e ( Σ i = 1 p M app ) · e ( Σ i = 1 q M lau ) * G ( n )
Formula (16)
λ wherein Spa, λ Tem, λ Spe, λ Mus, λ HitBe respectively each excellent weight of sound model synchronously, satisfy all greater than 0 and λ Spa+ λ Tem+ λ Spe+ λ Mus+ λ Hit=1.P, q are respectively excellent asynchronous sound model: the duration of applause model, laugh model (is unit with the second).G (n) is Gauss's smoothing windows, and n is smoothing parameter (as n desirable 60).
The present invention is applicable to the extraction of all types of audio frequency and audio/video file-abstracting, and wherein files in different types only needs to do trickle adjustment when the attention rate modeling, and holistic approach is constant.This method has the advantage that computation complexity is lower, summary fragment that obtain meets human subjective perception rule, and the audio frequency and the audio/video file-abstracting that use this method to generate in experiment have been obtained good effect.
Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subjected to the restriction of given any specific exemplary teachings.

Claims (13)

1. the audio file method of abstracting based on attention-degree analysis comprises the following steps:
1) the classification tree algorithm based on y-bend level type carries out audio classification according to the typical sound type in the audio file, and the classification tree algorithm of wherein said y-bend level type is selected different characteristic and sorter for use at each classification layer;
2) according to described audio classification result, described typical sound type is set up the attention-degree analysis model, obtain the attention rate curve of described typical sound type;
3) determine summary according to the attention rate curve of described typical sound type.
2. method according to claim 1 is characterized in that described step 1) comprises the following steps:
11) cut apart audio file and become audio example;
12) based on the classification tree algorithm of described y-bend level type, described audio example is classified according to described typical sound type.
3. method according to claim 2 is characterized in that, get between the adjacent audio example of described audio example 50% overlapping.
4. method according to claim 2 is characterized in that, in step 12), at the audio frequency characteristics of described each classification layer employing to this layer two quasi-representative sound type classifying quality optimum.
5. method according to claim 2 is characterized in that, in step 12), and the probability selection sort device that can correctly classify according to sorter at described each classification layer.
6. method according to claim 1 is characterized in that, in described step 2) in described typical sound type is set up the attention-degree analysis model according to following factors: energy, tone and average zero-crossing rate.
7. method according to claim 1 is characterized in that, in described step 2) also comprise step: the result of described attention-degree analysis Model Calculation is normalized to interval [0,1].
8. method according to claim 7 is characterized in that, the Gaussian normalization standard is adopted in described normalization.
9. method according to claim 1 is characterized in that, described step 3) employing order Decision Fusion method merges described attention-degree analysis curve, and then determines summary.
10. method according to claim 1 is characterized in that, described typical sound type comprises excellent sound and excellent asynchronous sound synchronously.
11. method according to claim 1 is characterized in that, described step 3) comprises the following steps:
Use excellent asynchronous sound curve coarse localization wonderful right margin;
Utilize the accurately border of the described exciting part in location of voice Boundary Detection.
12. method according to claim 1 is characterized in that, also comprises described audio file is carried out the step that pre-emphasis is handled before described step 1).
13. the audio/video file-abstracting method based on attention-degree analysis comprises the following steps:
A) the classification tree algorithm based on y-bend level type carries out audio classification according to the typical sound type in the audio file, and the classification tree algorithm of wherein said y-bend level type is selected different characteristic and sorter for use at each classification layer;
B) according to described audio classification result, described typical sound type is set up the attention-degree analysis model, obtain the attention rate curve of described typical sound type;
C) the time attention rate and the space attention rate of video file are carried out modeling, obtain the vision excellence line of writing music;
D) determine summary according to the attention rate curve of described typical sound type and the described vision excellence line of writing music.
CN200810102344A 2008-03-20 2008-03-20 Audio/video file-abstracting method based on attention-degree analysis Pending CN101539925A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810102344A CN101539925A (en) 2008-03-20 2008-03-20 Audio/video file-abstracting method based on attention-degree analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810102344A CN101539925A (en) 2008-03-20 2008-03-20 Audio/video file-abstracting method based on attention-degree analysis

Publications (1)

Publication Number Publication Date
CN101539925A true CN101539925A (en) 2009-09-23

Family

ID=41123115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810102344A Pending CN101539925A (en) 2008-03-20 2008-03-20 Audio/video file-abstracting method based on attention-degree analysis

Country Status (1)

Country Link
CN (1) CN101539925A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385861A (en) * 2010-08-31 2012-03-21 国际商业机器公司 System and method for generating text content summary from speech content
CN102750383A (en) * 2012-06-28 2012-10-24 中国科学院软件研究所 Spiral abstract generation method oriented to video content
CN103428406A (en) * 2012-05-23 2013-12-04 中兴通讯股份有限公司 Method and device for analyzing monitoring video
CN103942247A (en) * 2014-02-25 2014-07-23 华为技术有限公司 Information providing method and device of multimedia resources
CN104469547A (en) * 2014-12-10 2015-03-25 西安理工大学 Video abstraction generation method based on arborescence moving target trajectory
CN106506448A (en) * 2016-09-26 2017-03-15 北京小米移动软件有限公司 Live display packing, device and terminal
CN108307250A (en) * 2018-01-23 2018-07-20 浙江大华技术股份有限公司 A kind of method and device generating video frequency abstract
CN110532422A (en) * 2019-08-07 2019-12-03 北京三快在线科技有限公司 Cover generating means and method, electronic equipment and computer readable storage medium

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385861A (en) * 2010-08-31 2012-03-21 国际商业机器公司 System and method for generating text content summary from speech content
CN102385861B (en) * 2010-08-31 2013-07-31 国际商业机器公司 System and method for generating text content summary from speech content
US8868419B2 (en) 2010-08-31 2014-10-21 Nuance Communications, Inc. Generalizing text content summary from speech content
CN103428406B (en) * 2012-05-23 2017-11-07 中兴通讯股份有限公司 Monitoring video analysis method and device
CN103428406A (en) * 2012-05-23 2013-12-04 中兴通讯股份有限公司 Method and device for analyzing monitoring video
CN102750383B (en) * 2012-06-28 2014-11-26 中国科学院软件研究所 Spiral abstract generation method oriented to video content
CN102750383A (en) * 2012-06-28 2012-10-24 中国科学院软件研究所 Spiral abstract generation method oriented to video content
CN103942247A (en) * 2014-02-25 2014-07-23 华为技术有限公司 Information providing method and device of multimedia resources
CN104469547A (en) * 2014-12-10 2015-03-25 西安理工大学 Video abstraction generation method based on arborescence moving target trajectory
CN104469547B (en) * 2014-12-10 2017-06-06 西安理工大学 A kind of video abstraction generating method based on tree-shaped movement objective orbit
CN106506448A (en) * 2016-09-26 2017-03-15 北京小米移动软件有限公司 Live display packing, device and terminal
CN106506448B (en) * 2016-09-26 2021-04-23 北京小米移动软件有限公司 Live broadcast display method and device and terminal
CN108307250A (en) * 2018-01-23 2018-07-20 浙江大华技术股份有限公司 A kind of method and device generating video frequency abstract
CN110532422A (en) * 2019-08-07 2019-12-03 北京三快在线科技有限公司 Cover generating means and method, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN101539925A (en) Audio/video file-abstracting method based on attention-degree analysis
US8793127B2 (en) Method and apparatus for automatically determining speaker characteristics for speech-directed advertising or other enhancement of speech-controlled devices or services
US20110320197A1 (en) Method for indexing multimedia information
CN103700370A (en) Broadcast television voice recognition method and system
KR20060020114A (en) System and method for providing music search service
WO2007114796A1 (en) Apparatus and method for analysing a video broadcast
CN107274916A (en) The method and device operated based on voiceprint to audio/video file
CN102073636A (en) Program climax search method and system
CN100508587C (en) News video retrieval method based on speech classifying identification
Cotton et al. Soundtrack classification by transient events
CN106302987A (en) A kind of audio frequency recommends method and apparatus
CN107480152A (en) A kind of audio analysis and search method and system
CN110019961A (en) Method for processing video frequency and device, for the device of video processing
Foucard et al. Multi-scale temporal fusion by boosting for music classification.
Ghosal et al. Speech/music classification using occurrence pattern of zcr and ste
Jang et al. Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel
Ghosal et al. Automatic male-female voice discrimination
Stappen et al. MuSe 2020--The First International Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop
Stappen et al. Muse 2020 challenge and workshop: Multimodal sentiment analysis, emotion-target engagement and trustworthiness detection in real-life media: Emotional car reviews in-the-wild
Iwan et al. Temporal video segmentation: detecting the end-of-act in circus performance videos
Ellis et al. Accessing minimal-impact personal audio archives
Bajpai et al. Combining evidence from subsegmental and segmental features for audio clip classification
Shao et al. Automatic summarization of music videos
Saz et al. Background-tracking acoustic features for genre identification of broadcast shows
Roach et al. Video genre verification using both acoustic and visual modes

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
C10 Entry into substantive examination
RJ01 Rejection of invention patent application after publication

Open date: 20090923

C12 Rejection of a patent application after its publication