CN101576955A

CN101576955A - Method and system for detecting advertisement in audio/video

Info

Publication number: CN101576955A
Application number: CNA2009100874283A
Authority: CN
Inventors: 李新辉; 王向东; 高扬; 钱跃良; 林守勋
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2009-06-22
Filing date: 2009-06-22
Publication date: 2009-11-11
Anticipated expiration: 2029-06-22
Also published as: CN101576955B

Abstract

The invention relates to a method and a system for detecting advertisement in an audio/video. The method comprises the following steps of: step 1, extracting audio frequency from the audio/video to be detected, and extracting short-time energy of frames and Mel frequency cepstrum coefficient characteristic; and step 2, searching and finding two groups of frames, which have mutual similarity meeting a predetermined condition, from the audio frequency according to the short-time energy of the frames and the Mel frequency cepstrum coefficient characteristic, the positions of the frames in each group are continuous in the audio frequency, and a corresponding audio/video segment of each group of frames in the audio/video to be detected is the advertisement. Compared with the prior art, the method and the system can more accurately and efficiently detect an advertisement segment in the audio/video.

Description

From audio frequency and video, detect the method and system of advertisement

Technical field

The present invention relates to the detection range of audio frequency and video, particularly relate to the method and system that from audio frequency and video, detect advertisement.

Background technology

Purposes of commercial detection is meant in video, audio program the position that location and mark advertisement are occurred.Automatically purposes of commercial detection is exactly to utilize computing machine automatically from looking, detect advertising segment the sound stream and accurately locating the position of this advertising segment.

The method of at present common automatic purposes of commercial detection comprises rule-based method, based on the method for sign, based on the method for shot classification, based on the method for identification.

Rule-based method uses a series of feature and rule to distinguish advertisement and common broadcast TV program, and common broadcast TV program is non-advertisement.Usually advertisement all is appearance in groups, and every group of advertisement is called as advertisement group.Advertisement and advertisement group have direct measurement feature and weigh feature indirectly.The direct measurement feature of advertisement and advertisement group comprises: length limited, and the length of a common advertisement is no more than 30 seconds, and advertisement group is no more than 6 minutes; Usually separated by 3 to 5 black frames between advertisement and the non-advertising programme and between advertisement and the advertisement; The volume of advertisement is giving great volume than TV programme generally.The indirect measurement feature of advertisement and advertisement group comprises: advertisement has usually than non-advertising programme and has higher camera lens switching frequency and have more abundant variation on color; It often is a name that static image comes exhibiting product or company that advertisement comprises many static images, particularly last scene.The problem of rule-based method comprises: the program for all categories finds unified rule to have difficulties; The feature of selected expression advertisement is stable inadequately and reliable sometimes; Be difficult to set up unified detection system by those features.For example, a lot of rule-based methods detect advertisement according to black frame, but a lot of TV station has not used black frame now, and the program as the film also may contain many black frames.And not necessarily have black frame during the conversion of general programs fragment and advertising segment, even black frame also can need at random insertion for certain montage, above-mentioned situation directly causes the failure based on black frame detection method.Therefore, rule-based method mainly concentrates on and detects some characteristics kind program, as news program, in advertisement.

Based on the method for sign, the station symbol by TV station detects advertisement.Whether this method can adopt the method detection station symbol of rim detection to exist according to conceal the existence that station symbol detects advertisement automatically when TV station breaks for commercialsy.The problem of this method is that at present a lot of TV stations do not conceal station symbol when breaking for commercialsy, and this phenomenon is more and more, so this method that detects advertisement by station symbol had just lost efficacy.For example, the relevant regulations of China national General Bureau of Radio, Film and Television explicitly calls for all advertisement necessity to have station symbol.In addition, the station symbol of TV station becomes and becomes increasingly complex at present, and station symbol is translucent sometimes, detects to get up to have difficulties.

Based on the method for shot classification, be camera lens with video slicing, and from camera lens, extract correlated characteristic, utilize these features that television cameras are divided into general programs camera lens and advertisement camera lens then.But the just simple usually classification of this method does not have to consider how to eliminate the influence of wrong story board, does not have to consider how to merge the problem that the advertisement camera lens obtains advertising segment simultaneously yet.The problem of this method maximum is, the difference on the feature of do not exist between non-advertising programme and the advertisement significantly, determining, so this kind method is difficult to all programs are reached the performance of very high detection.In addition,, but when detecting the blanking camera lens or being fade-in fade-out camera lens, will run into some problems, cause the testing result mistake even said method has good effect aspect the detection shearing lens.

Method based on identification requires that a large and complete advertising database is arranged in advance, and the feature of the predefined advertising programme fragment of this ad data library storage utilizes this database identification to be embedded in the advertising segment of TV programme the inside then.Yet the shortcoming of this method is that the database that comprises mass advertising obtains difficulty, if will expend huge human and material resources with manually intercepting and marking from program.And this method can not detect non-existent advertising segment in the database.In addition, the increase detection efficiency along with the database scale can descend.

The data that above-mentioned the whole bag of tricks is handled are video datas, because video itself, the data volume of the required processing of above-mentioned the whole bag of tricks is big, feature complexity height, so computing velocity is slow.

Summary of the invention

In order to solve above-mentioned technical matters, the invention provides from audio frequency and video the method and system that detect advertisement, can detect advertising segment in the video/audio more accurately and efficiently than prior art.

The invention discloses the method that detects advertisement from audio frequency and video, described method comprises:

Step 1 is extracted audio frequency from audio frequency and video to be detected, extract the short-time energy and the Mei Er cepstrum coefficient feature of frame from audio frequency;

Step 2, from described audio frequency, find out mutual similarity according to the short-time energy of frame and Mei Er cepstrum coefficient feature and satisfy two pre-conditioned framings, frame position in described audio frequency in every group of inside is continuous, and every framing corresponding audio frequency and video fragment in audio frequency and video to be detected is advertisement.

Described step 2 further is:

Step 21 is divided into the energy envelope unit according to the short-time energy of frame with described audio frequency;

Step 22, find out two groups of position Continuous Energy envelope unit that mutual energy envelope shape similarity satisfies preset shape similarity condition according to the short-time energy of frame and the length of energy envelope unit from described energy envelope unit, the energy envelope sequence is formed in every group of energy envelope unit;

Step 23, judge according to the Mei Er cepstrum coefficient feature of frame in the energy envelope sequence whether the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, if satisfy, then described energy envelope sequence corresponding audio frequency and video fragment in audio frequency and video to be detected is advertisement.

When the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, also comprise after the described step 23:

Step 31, for two groups of energy envelope sequences that satisfy semantic similarity condition frame before the start frame separately, judge successively whether the semantic similarity between the corresponding frame satisfies described semantic similarity condition, first frame next frame in audio frequency that does not satisfy described semantic similarity condition is the reference position of advertisement.

Step 41, for two groups of energy envelope sequences that satisfy semantic similarity condition frame after the end frame separately, judge successively whether the semantic similarity between the corresponding frame satisfies described semantic similarity condition, first frame previous frame in audio frequency that does not satisfy described semantic similarity condition is the end position of advertisement.

Described step 1 also comprises,

Step 51 is carried out smoothing processing to the short-time energy of frame, with the short-time energy of the short-time energy after the smoothing processing as frame.

Described step 21 further is,

Step 61 according to the short-time energy of frame, will be positioned at energy trace rising edge and energy ascensional range and surpass the division points of the frame of default range value as the energy envelope unit;

Step 62 is divided into the energy envelope unit from described division points with audio frequency.

Described step 22 further is,

Step 71, from described energy envelope unit, find out the continuous energy envelope unit, two groups of positions that satisfies length similarity condition, the candidate energies envelope sequence is formed in every group of energy envelope unit, and the difference of the length of the energy envelope unit of same position was less than the preset length difference between described length similarity condition was every group;

Step 72, judge that according to the short-time energy of frame in the described candidate energies envelope sequence whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than the energy jump degree of correlation threshold values of presetting, if then described candidate energies envelope sequence is described energy envelope sequence.

Described step 23 further is,

Step 81 is calculated the Euclidean distance of the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence;

Step 82, judge Euclidean distance less than the number of the frame of predeterminable range threshold values whether greater than default quantity threshold values, if the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

Described step 61 further is,

Step 91, for each frame in the audio frequency, whether the short-time energy of judging described frame is less than the short-time energy of the next frame of described frame, if then described framing bit is in the energy trace rising edge;

Step 92 for the frame that is positioned at the energy trace rising edge, is calculated as follows the energy ascensional range of described frame,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

Wherein, DF is the energy ascensional range of described frame, and k is the sequence number of described frame in audio frequency, and STEN is the short-time energy after the smoothing processing of frame, and m is default comparison frame number value;

If DF is greater than described default range value, then described frame is as the division points of energy envelope.

Also comprise between described step 71 and the described step 72,

Step 101, whether the length of judging described candidate energies envelope sequence is more than or equal to default advertisement length threshold values, if carry out described step 72.

Described step 72 further is,

Step 111 is calculated as follows the energy ascensional range of frame in the described candidate energies envelope sequence,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

Wherein, DF is the energy ascensional range of frame in the described candidate energies envelope sequence, and k is the sequence number of described frame in audio frequency, and STEN is the short-time energy after the smoothing processing of frame, and m is default comparison frame number value;

Step 112, calculate the energy jump degree degree of correlation between described candidate energies envelope sequence according to the energy ascensional range of described frame, if the energy jump degree of correlation between described candidate energies envelope sequence is greater than default energy jump degree of correlation threshold values, then described candidate energies envelope sequence is described energy envelope sequence.

Described step 81 further is,

Step 121, the i frame of first energy envelope sequence is with the i+e frame correspondence of second energy envelope sequence, and e is an integer, and span is default scope;

Step 122, the different values of corresponding e are calculated the Euclidean distance of the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence, and the Euclidean distance that the peek of corresponding identical e is calculated is formed an Euclidean distance group;

Described step 82 further is,

Step 123 for each Euclidean distance group, is calculated wherein numerical value less than the number of the Euclidean distance of predeterminable range threshold values, gets the individual numerical value of individual numerical value maximum in all Euclidean distance groups as described energy envelope sequence;

Whether step 124, the individual numerical value of judging described energy envelope sequence greater than default quantity threshold values, if greater than, the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

The invention also discloses the system that detects advertisement from audio frequency and video, described system comprises:

Parameter extraction module is used for extracting audio frequency from audio frequency and video to be detected, extracts the short-time energy and the Mei Er cepstrum coefficient feature of frame from audio frequency;

Module is searched in advertisement, be used for finding out mutual similarity from described audio frequency and satisfy two pre-conditioned framings according to the short-time energy of frame and Mei Er cepstrum coefficient feature, the frame of every group of inside position in described audio frequency is continuous, and every framing corresponding audio frequency and video fragment in audio frequency and video to be detected is advertisement.

Described advertisement is searched module and is further comprised:

The dividing elements module is used for according to the short-time energy of frame described audio frequency being divided into the energy envelope unit;

The similar module of searching of shape, be used for finding out two groups of position Continuous Energy envelope unit that mutual energy envelope shape similarity satisfies preset shape similarity condition from described energy envelope unit according to the short-time energy of frame and the length of energy envelope unit, the energy envelope sequence is formed in every group of energy envelope unit;

The semantic similar module of searching, be used for judging according to the Mei Er cepstrum coefficient feature of energy envelope sequence frame whether the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, if satisfy, then described energy envelope sequence corresponding audio frequency and video fragment in audio frequency and video to be detected is advertisement.

The similar module of searching of described semanteme, when the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, also be used for for the frame before every group of energy envelope sequence start frame, judge successively whether this frame satisfies described semantic similarity condition with the semantic similarity between other energy envelope sequences, first frame next frame in audio frequency that does not satisfy described semantic similarity condition is the reference position of advertisement.

The similar module of searching of described semanteme, when the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, also be used for for the frame after every group of energy envelope EOS frame, judge successively whether this frame satisfies described semantic similarity condition with the semantic similarity between other energy envelope sequences, first frame previous frame in audio frequency that does not satisfy described semantic similarity condition is the end position of advertisement.

Described parameter extraction module also is used for smoothing processing is carried out in the short-time energy of frame, with the short-time energy of the short-time energy after the smoothing processing as frame.

Described dividing elements module is further used for the short-time energy according to frame, will be positioned at energy trace rising edge and energy ascensional range and surpass the division points of the frame of default range value as the energy envelope unit; From described division points audio frequency is divided into the energy envelope unit.

The similar module of searching of described shape is further used for finding out the continuous energy envelope unit, two groups of positions that satisfies length similarity condition from described energy envelope unit, the candidate energies envelope sequence is formed in every group of energy envelope unit, and the difference of the length of the energy envelope unit of same position was less than the preset length difference between described length similarity condition was every group; Judge that according to the short-time energy of frame in the described candidate energies envelope sequence whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than the energy jump degree of correlation threshold values of presetting, if then described candidate energies envelope sequence is described energy envelope sequence.

Described semanteme is similar searches the Euclidean distance that module is further used for calculating the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence; Judge Euclidean distance less than the number of the frame of predeterminable range threshold values whether greater than default quantity threshold values, if the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

Described dividing elements module is further used in the short-time energy according to frame, in the time of will being positioned at frame that energy trace rising edge and energy ascensional range surpass default range value as the division points of energy envelope unit,

Be further used for for each frame in the audio frequency, whether the short-time energy of judging described frame is less than the short-time energy of the next frame of described frame, if then described framing bit is in the energy trace rising edge; For the frame that is positioned at the energy trace rising edge, be calculated as follows the energy ascensional range of described frame,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

Wherein, DF is the energy ascensional range of described frame, and k is the sequence number of described frame in audio frequency, and STEN is the short-time energy after the smoothing processing of frame, and m is default comparison frame number value; If DF is greater than described default range value, then described frame is as the division points of energy envelope.

The similar module of searching of described shape is used to also judge that whether the length of described candidate energies envelope sequence is more than or equal to default advertisement length threshold values, if the short-time energy of then carrying out frame in the described candidate energies envelope sequence of described foundation judges that whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than default energy jump degree of correlation threshold values.

Described shape is similar searches module and is judging the energy jump degree of correlation between the candidate energies envelope sequence according to the short-time energy of frame in the described candidate energies envelope sequence whether during greater than the energy jump degree of correlation threshold values preset,

Be further used for being calculated as follows the energy ascensional range of frame in the described candidate energies envelope sequence,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

Wherein, DF is the energy ascensional range of frame in the described candidate energies envelope sequence, and k is the sequence number of described frame in audio frequency, and STEN is the short-time energy after the smoothing processing of frame, and m is default comparison frame number value; Calculate the energy jump degree degree of correlation between described candidate energies envelope sequence according to the energy ascensional range of described frame, if the energy jump degree of correlation between described candidate energies envelope sequence is greater than default energy jump degree of correlation threshold values, then described candidate energies envelope sequence is described energy envelope sequence.

The similar module of searching of described semanteme is when the Euclidean distance of the Mei Er cepstrum coefficient that calculates every pair of corresponding interframe between described energy envelope sequence, be further used for i+e frame correspondence with same second the energy envelope sequence of i frame of first energy envelope sequence, e is an integer, and span is default scope; The different values of corresponding e are calculated the Euclidean distance of the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence, and the Euclidean distance that the peek of corresponding identical e is calculated is formed an Euclidean distance group;

Described semanteme is similar to be searched module and is judging Euclidean distance less than the number of the frame of predeterminable range threshold values whether during greater than default quantity threshold values,

Be further used for for each Euclidean distance group, calculate wherein numerical value, get the individual numerical value of individual numerical value maximum in all Euclidean distance groups as described energy envelope sequence less than the number of the Euclidean distance of predeterminable range threshold values; Whether the individual numerical value of judging described energy envelope sequence greater than default quantity threshold values, if greater than, the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

Beneficial effect of the present invention is, by fall general coefficient characteristics according to the short-time energy of the audio frequency of audio frequency and video to be measured and Mei Er, search advertising segment in the audio frequency and video by similarity, can only just can find advertisement in the audio frequency and video to audio operation, and then raising detection speed, use audio frequency short-time energy and Mei Er and fall general coefficient characteristics and determine similarity, improve detection accuracy; Further, search similar fragment by dividing the energy envelope unit and carrying out the comparison of envelope shape similarity and semantic similarity, can more accurate compared pieces similarity; And can accurately determine the advertising segment reference position according to semantic similarity.

Description of drawings

Fig. 1 is the present invention detects advertisement from audio frequency and video a method flow diagram;

Fig. 2 is the present invention detects advertisement from audio frequency and video a system construction drawing.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in further detail.

Step S100 extracts audio frequency from audio frequency and video to be detected, extract the short-time energy and MFCC (Mei Er cepstrum coefficient) feature of frame from audio frequency.

Step S200, from described audio frequency, find out mutual similarity according to the short-time energy of frame and MFCC feature and satisfy two pre-conditioned framings, frame position in described audio frequency in every group of inside is continuous, and every framing corresponding audio frequency and video fragment in audio frequency and video to be detected is advertisement.

The embodiment of described step S100 is as described below.

Short-time energy is the energy of a short section of voice signal, is field of voice signal feature commonly used.

The MFCC feature is the common feature in speech recognition and Speaker Identification field, MFCC is characterized as and utilizes the spectral filtering that the triangular filter group obtains through Fourier transform voice signal and get, and its frequency domain is carried out obtaining behind Mei Er (Mel) change of scale, more to meet human auditory properties.

Multiple computing method to short-time energy are arranged in the prior art, in the specific embodiment of the invention computing method of the short-time energy of every frame as shown in the formula.

{STE}_{n} = Σ_{m = n - N + 1}^{n} {[x (m) w (n - m)]}^{2}

Wherein, STN _nRepresent the short-time energy of n frame, n is the sequence number of frame in audio frequency, and x (m) is a voice signal, and w (m) is a window function, and N is the hits of a frame.

In order to eliminate the influence of factors such as noise, smoothing processing is carried out in short-time energy.One is satisfied ∫ θ (x) dx=1, and at infinity converges to 0 real function θ (x) and be called smooth function.

Energy after level and smooth is:

STEN(x)＝STE(x)×θ(x)

θ (x) is a smooth function, and this function satisfies ∫ θ (x) dx=1, and at infinity converges to 0 real function.STEN (x) is level and smooth short-time energy, and x is a sound signal.

The method of extracting the MFCC feature in the specific embodiment of the invention is following.

Step S111 is converted to Mei Er (Mel) frequency according to formula Mel (f)=2595lg (1+f/700) with actual frequency, and wherein the f of sound signal is frequency (wherein f is the frequency of voice signal).

Step S112, according to sound signal | X _n(k) | ask the output of each triangular filter:

m (l) = Σ_{k = o (l)}^{h (l)} W_{l} (k) | X_{n} (k) |,

Wherein

W_{l} (k) = \{\begin{matrix} \frac{k - o (l)}{c (l) - o (l)} \\ \frac{h (l) - k}{h (l) - c (l)} \end{matrix},

O (l), c (l), h (l) are respectively lower limit, center, the upper limiting frequency of triangular filter, and c (l)=h (l-1)=o (l+1).X _n(k) be the data from the sample survey of audio frequency, k is a sampled point, the output of l wave filter of m (l) expression, and l is the sampling serial number.

Step S113 does the logarithm computing to all wave filter output, further does discrete cosine (DCT) conversion again, obtains the MFCC feature:

C_{mfcc} (i) = \sqrt{\frac{2}{L}} Σ_{l = 1}^{L} \log m (l) \cos {(l - \frac{1}{2}) \frac{iπ}{L}} .

Wherein, L is the wave filter number, C _Mfcc(i) i parameter of expression MFCC feature

The embodiment of described step S200 is as described below, comprises that step S210 is to step S230.

Step S210 is divided into the energy envelope unit according to the short-time energy of frame with described audio frequency.

According to the short-time energy of frame, will be positioned at energy trace rising edge and energy ascensional range and surpass the division points of the frame of default range value as the energy envelope unit; From described division points audio frequency is divided into the energy envelope unit.

Embodiment is as follows.

Be calculated as follows the Slope functional value of frame.

Slope _k＝(STEN _k+1-STEN _k)/2

K is frame sequence number in audio frequency, and STEN is the short-time energy after the smoothing processing of frame.

By the DF functional value of following formula 1 calculating frame, DF is corresponding to the energy ascensional range of frame.

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

Wherein, DF is the energy ascensional range of described frame, and k is the sequence number of described frame in audio frequency, and STEN is the short-time energy after the smoothing processing of frame, and m is default comparison frame number value, and for example default m is 10.

The division of energy envelope is according to being, DF＞T and Slope＞0, and T is default range value, can adjust the granularity of division of energy envelope by the preset value of adjusting T, is that 1.25 o'clock granularity of division helps retrieve advertisements and handles according to experiment statistics experience T value.Slope＞0 this frame of expression is in the rising edge of energy trace, and DF＞T represents that the energy jump degree satisfies the default division requirement of energy envelope.

Step S220, find out two groups of position Continuous Energy envelope unit that mutual energy envelope shape similarity satisfies preset shape similarity condition according to the short-time energy of frame and the length of energy envelope unit from described energy envelope unit, the energy envelope sequence is formed in every group of energy envelope unit.

From audio frequency, find similar energy envelope unit sequence by this step according to the similarity degree of shape, and then the object of definite semantic similarity judgement, because it is more complicated that semantic similarity calculates, thereby it is more quicker than direct application semantics similarity judgement similarity degree to increase this step; And, be similar to Du Genggao between two groups of definite energy envelope unit sequences owing to increased the judgement of shape similarity, it is more accurate to judge.

The embodiment of described step S220 is as follows.

Step S221, from the energy envelope unit, find out the continuous energy envelope unit, two groups of positions that satisfies length similarity condition, the candidate energies envelope sequence is formed in every group of energy envelope unit, and the difference of the length of the energy envelope unit of same position was less than the preset length difference between length similarity condition was every group.

The length of energy envelope unit is the quantity of frame in the energy envelope unit, d _iThe length of representing i energy envelope unit.In all energy envelope unit after division, find i energy envelope unit and j energy envelope unit, i＜j, if satisfy | d _j-d _i|≤T ₃, T3 is the preset length difference, is 5 in this embodiment.Judge whether successively backward to satisfy from i energy envelope unit and j energy envelope unit | d _J+1-d _I+1|≤T ₃, | d _J+2-d _I+2|≤T ₃, up to discovery | d _J+k-d _I+k|＞T ₃In time, stop.Thereby a candidate energies envelope sequence is formed in i to i+k-1 energy envelope unit; Form another candidate energies envelope sequence to j+k-1 energy envelope unit for j.

Step S222, whether the length of judging the candidate energies envelope sequence more than or equal to default advertisement length threshold values, if, execution in step S223.

The lengths table of candidate energies envelope sequence is shown the number of frame in the candidate energies envelope sequence, for the length of all energy envelope unit in this candidate energies envelope sequence add and.

The length of candidate energies envelope sequence is the quantity of the frame that comprises in the candidate energies envelope sequence, and whether the length of judging the candidate energies envelope sequence is more than or equal to default advertisement length threshold values.According to statistics, the length of advertisement is greater than 5 seconds, corresponds to 125 frames, so advertisement length threshold values is 125 in embodiment preferred.If there is the length of a candidate energies envelope sequence not satisfy condition, less than advertisement length threshold values, though show that then all candidate energies envelope sequence are close on length between them, but they do not have the time span characteristic of advertisement, thereby above-mentioned candidate energies envelope sequence all abandoned, carry out step 221 again.If all carried out step S221 operation for all frames in the audio frequency, but do not find the candidate energies envelope sequence that satisfies condition, to be detected the looking of then reaching a conclusion do not comprise the advertisement that repeats in the audio frequency.

Step S223, judge that according to the short-time energy of frame in the described candidate energies envelope sequence whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than the energy jump degree of correlation threshold values of presetting, if then described candidate energies envelope sequence group is the similar sequence set of described energy envelope shape.

The energy jump degree of correlation is the similarity degree of the sudden change of energy.

For the energy jump degree of correlation between the candidate energies envelope sequence multiple different expression way is arranged, corresponding to different expression waies, to different energy jump degree of correlation threshold values should be arranged.

Embodiment one

With the 1 calculated energy ascensional range by formula of all frames in the candidate energies envelope sequence, DF, mean value as the sudden change degree of candidate energies envelope sequence, with the difference of the sudden change degree of above-mentioned candidate energies envelope sequence as the energy jump degree of correlation between the candidate energies envelope sequence.

Embodiment two

Calculate for simplifying, the sudden change degree of the candidate energies envelope sequence in the embodiment one is reduced to the mean value of the energy ascensional range of the start frame of candidate energies envelope sequence and end frame.

Embodiment three

In embodiment one and embodiment two, the linear energy ascensional range of using frame as the energy jump degree of correlation of candidate energies envelope sequence, produces the influence of two-value.Thereby the present invention proposes a preferred implementation.

Be calculated as follows the energy jump degree that the energy envelope unit is represented with the probability form in the candidate energies envelope sequence.

Wherein, d _iRepresent i energy envelope unit of candidate energies envelope sequence.

Be the energy ascensional range of the start frame of i energy envelope unit,

It is the energy ascensional range of the end frame of i energy envelope unit.T ₁Being first threshold values, is 2.25 according to the value of an optimization of experiment statistics; T ₂Being second threshold values, is 4 according to the value of an optimization of experiment statistics.

An energy envelope cell list is shown I is this energy envelope unit sequence number in audio frequency, d _iBe the length of this energy envelope unit, p _iThe energy jump degree of representing with the probability form for this energy envelope unit.A candidate energies envelope sequence of being made up of k Continuous Energy envelope unit is expressed as

{SS}_{{dP}_{i}} {(d_{i}, p_{i}), (d_{i + 1}, p_{i + 1}) . . ., (d_{i + k - 1}, p_{i + k - 1})},

I is the 1st sequence number of energy envelope unit in audio frequency of this candidate energies envelope sequence.Be expressed as respectively for two candidate energies envelope sequence that find by step S221 and S222

{SS}_{{dP}_{i}} {(d_{i}, p_{i}), (d_{i + 1}, p_{i + 1}) . . ., (d_{i + k - 1}, p_{i + k - 1})}

With

{SS}_{{dP}_{j}} {(d_{j}, p_{j}), (d_{j + 1}, p_{j + 1}) . . ., (d_{j + k - 1}, p_{j + k - 1})} .

Will

P_{{dP}_{i}} = Σ (p_{i}, . . ., p_{i + k - 1})

As

Marginal probability, will

P_{{dP}_{j}} = Σ (p_{j}, . . ., p_{j + k - 1})

As Marginal probability,

With

Joint probability be

P_{{dP}_{ij}} = Σ (\min (p_{i}, p_{j}), \min (p_{i + 1}, p_{j + 1}), . . ., \min (p_{i + k - 1}, p_{j + k - 1})) .

The energy jump degree of correlation between two candidate energies envelope sequence is calculated as follows.

P_{ij} = \frac{2 \cdot P_{{dP}_{ij}}}{P_{{dP}_{i}} + P_{{dP}_{j}}}

Work as P _IjGreater than threshold value T ₄The time, then think

With Be to be respectively an energy envelope sequence.Wherein, threshold value T ₄According to a large amount of experiment statistics experience values is 0.8.T ₄Be energy jump degree of correlation threshold values.

Step S230 judges according to the MFCC feature of frame in the energy envelope sequence whether the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, if satisfy, the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

The mode of using MFCC character representation semantic similarity has multiple.With the mean value of the difference of parameter in the MFCC feature of the corresponding frames of a plurality of candidate energies envelope sequence as semantic similarity, perhaps as method among the step S223 parameter in the MFCC feature is carried out probability conversion, general

To first parameter in the MFCC feature of i frame, Correspond to second parameter in the MFCC feature of i frame, and then use this probability semantic similarity of method calculated candidate energy envelope sequence among the S223 set by step.

Below just be elaborated with the MFCC Euclidean distance of the interframe of candidate energies envelope sequence situation as the semantic similarity between the energy envelope sequence.

The candidate energies envelope sequence is expressed as (a _I1, a _I2..., a _Im) and (b _I1, b _I2..., b _Im), a wherein _I1..., a _ImRepresent the frame in first candidate energies envelope sequence respectively, b _I1..., b _ImRepresent second frame in the candidate energies envelope sequence respectively.

Embodiment one

Be calculated as follows the MFCC Euclidean distance of j interframe in the candidate energies envelope sequence.

D_{j} = \sqrt{Σ_{k = 1}^{12} {(M_{a_{ij}} (k) - M_{b_{ij}} (k))}^{2}}, j = 1,2, . . ., m

Wherein, D _jThe MFCC Euclidean distance of representing j interframe,

Expression frame a _IjMFCC,

Expression frame b _IjMFCC, k represents k parameter in the MFCC feature.

Calculate the MFCC Euclidean distance less than threshold values T ₅Number, according to statistical observation, T ₅Get 4.5 o'clock can be optimum distinguish on the voice content whether similar.If less than threshold values T ₅MFCC Euclidean distance number greater than the default minimum frame number of advertisement, this embodiment is 125, think that then the semantic similarity of candidate energies envelope sequence satisfies the semantic similarity condition, the candidate energies envelope sequence is the energy envelope sequence, and the audio frequency and video fragment of energy envelope sequence correspondence is advertisement.

Embodiment two

In reality, frame is not necessarily with the frame correspondence of identical sequence position in another candidate energies envelope sequence in the candidate energies envelope sequence, deviation before and after the serial number of corresponding frame in the candidate energies envelope sequence may exist, thereby there is certain error in the disposal route in the embodiment one.In order to proofread and correct above-mentioned error, in embodiment two, calculate the MFCC Euclidean distance of the corresponding interframe of many groups; The i frame of first energy envelope sequence is with the i+e frame correspondence of second energy envelope sequence, and e is an integer, and span is default scope; The value of a corresponding e is calculated the MFCC Euclidean distance between a framing.

For example, the value of e is 0,1 ..., 9,10.The candidate energies envelope sequence is expressed as (a _I1, a _I2..., a _Im) and (b _I1, b _I2..., b _Im).MFCC Euclidean distance between one framing is expressed as a m dimensional vector D _e, D _e={ d _E1, d _E2..., d _Em.

d_{ej} = \{\begin{matrix} \sqrt{Σ_{k = 1}^{12} {(M_{a_{i (j + e)}} (k) - M_{b_{ij}} (k))}^{2}}, e \leq 5, j = 1,2, . . ., m \\ \sqrt{Σ_{k = 1}^{12} {(M_{a_{ij}} (k) - M_{b_{i (j + e - 5)}} (k))}^{2}}, e > 5, j = 1,2, . . ., m \end{matrix}

Formula 2

Expression frame a _IjMFCC,

Expression frame b _IjMFCC, k represents k parameter in the MFCC feature.

For vectorial D _e, calculate the MFCC Euclidean distance less than threshold values T ₅Number, according to statistical observation, T ₅Get 4.5 o'clock can be optimum distinguish on the voice content whether similar.Get the maximal value in the number, if arrogant value is greater than the predetermined number threshold values, the quantity threshold values is the minimum frame number of advertisement in this embodiment, be 125, think that then the semantic similarity of candidate energies envelope sequence satisfies the semantic similarity condition, the candidate energies envelope sequence is the energy envelope sequence, and the audio frequency and video fragment of energy envelope sequence correspondence is advertisement.

Embodiment three

For method in the embodiment two, the starting and ending position of the advertising segment of acquisition is accurate inadequately.Thereby increasing the step of judging the advertising segment exact positions in embodiment three, other processes are identical with embodiment two.

For the frame before every group of energy envelope sequence start frame, judge successively whether this frame satisfies described semantic similarity condition with the semantic similarity between other energy envelope sequences, first frame next frame in audio frequency that does not satisfy described semantic similarity condition is the reference position of advertisement.In like manner, for the frame after every group of energy envelope EOS frame, judge successively whether this frame satisfies described semantic similarity condition with the semantic similarity between other energy envelope sequences, first frame previous frame in audio frequency that does not satisfy described semantic similarity condition is the end position of advertisement.

Begin forward from the start frame of energy envelope sequence, when by formula 2 calculating e get different value, the semantic similarity of this interframe; If the semantic similarity that calculates for all e is all less than threshold values T ₅, then this frame is added into advertising segment, calculates the former frame of this frame; If the value of corresponding different e exists a semantic similarity to be not less than threshold values T ₅, then this frame is the boundary frame of advertising segment, and this frame is the n frame in because of audio frequency, and then the n+1 frame is the advertisement start frame in the audio frequency.Same quadrat method is searched the position of advertisement end frame accurately.

After finding a pair of repetition advertising segment, to be detected look search the audio frequency with this to repeating one of them a envelope sequence that sequence length is similar of advertising segment.

A kind of system that detects advertisement from audio frequency and video of the present invention comprises as shown in Figure 2:

Parameter extraction module 201 is used for extracting audio frequency from audio frequency and video to be detected, extracts the short-time energy and the Mei Er cepstrum coefficient feature of frame from audio frequency.

Parameter extraction module 201 also is used for smoothing processing is carried out in the short-time energy of frame, with the short-time energy of the short-time energy after the smoothing processing as frame.

Module 202 is searched in advertisement, be used for finding out mutual similarity from described audio frequency and satisfy two pre-conditioned framings according to the short-time energy of frame and Mei Er cepstrum coefficient feature, frame position in described audio frequency in every group of inside is continuous, and every framing corresponding audio frequency and video fragment in audio frequency and video to be detected is advertisement.

Advertisement is searched module 202 and is further comprised: dividing elements module, similar module, the semantic similar module of searching of searching of shape.

The dividing elements module is used for according to the short-time energy of frame described audio frequency being divided into the energy envelope unit.

The dividing elements module is further used for the short-time energy according to frame, will be positioned at energy trace rising edge and energy ascensional range and surpass the division points of the frame of default range value as the energy envelope unit; From described division points audio frequency is divided into the energy envelope unit.

The dividing elements module is further used in the short-time energy according to frame, in the time of will being positioned at frame that energy trace rising edge and energy ascensional range surpass default range value as the division points of energy envelope unit,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

The similar module of searching of shape, be used for finding out two groups of position Continuous Energy envelope unit that mutual energy envelope shape similarity satisfies preset shape similarity condition from described energy envelope unit according to the short-time energy of frame and the length of energy envelope unit, the energy envelope sequence is formed in every group of energy envelope unit.

The similar module of searching of shape is further used for finding out the continuous energy envelope unit, two groups of positions that satisfies length similarity condition from described energy envelope unit, the candidate energies envelope sequence is formed in every group of energy envelope unit, and the difference of the length of the energy envelope unit of same position was less than the preset length difference between described length similarity condition was every group; Judge that according to the short-time energy of frame in the described candidate energies envelope sequence whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than the energy jump degree of correlation threshold values of presetting, if then described candidate energies envelope sequence is described energy envelope sequence.

The similar module of searching of shape is used to also judge that whether the length of described candidate energies envelope sequence is more than or equal to default advertisement length threshold values, if the short-time energy of then carrying out frame in the described candidate energies envelope sequence of described foundation judges that whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than default energy jump degree of correlation threshold values.

Shape is similar searches module and is judging the energy jump degree of correlation between the candidate energies envelope sequence according to the short-time energy of frame in the described candidate energies envelope sequence whether during greater than the energy jump degree of correlation threshold values preset,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

The semantic similar module of searching, be used for judging according to the Mei Er cepstrum coefficient feature of energy envelope sequence frame whether the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, if satisfy, the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

The semantic similar module of searching, when the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, also be used for for the frame before every group of energy envelope sequence start frame, judge successively whether this frame satisfies described semantic similarity condition with the semantic similarity between other energy envelope sequences, first frame next frame in audio frequency that does not satisfy described semantic similarity condition is the reference position of advertisement.

The semantic similar module of searching, when the semantic similarity between described energy envelope sequence satisfies default semantic similarity condition, also be used for for the frame after every group of energy envelope EOS frame, judge successively whether this frame satisfies described semantic similarity condition with the semantic similarity between other energy envelope sequences, first frame previous frame in audio frequency that does not satisfy described semantic similarity condition is the end position of advertisement.

The semantic similar Euclidean distance that module is further used for calculating the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence of searching; Judge Euclidean distance less than the number of the frame of predeterminable range threshold values whether greater than default quantity threshold values, if the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

The semantic similar module of searching is when the Euclidean distance of the Mei Er cepstrum coefficient that calculates every pair of corresponding interframe between described energy envelope sequence, be further used for i+e frame correspondence with same second the energy envelope sequence of i frame of first energy envelope sequence, e is an integer, and span is default scope; The different values of corresponding e are calculated the Euclidean distance of the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence, and the Euclidean distance that the peek of corresponding identical e is calculated is formed an Euclidean distance group;

The semantic similar module of searching is judging that Euclidean distance is less than the number of the frame of predeterminable range threshold values whether during greater than the quantity threshold values preset, be further used for for each Euclidean distance group, calculate wherein numerical value less than the number of the Euclidean distance of predeterminable range threshold values, get the individual numerical value of individual numerical value maximum in all Euclidean distance groups as described energy envelope sequence; Whether the individual numerical value of judging described energy envelope sequence greater than default quantity threshold values, if greater than, the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

Below by being that advertisement in 10 minutes the broadcast TV program detects as an example to a segment length, introduce the implementation process of the commercial detection method based on audio frequency repeatability of the present invention in detail.Whole process is divided into four-stage substantially: audio stream cut apart extraction with audio frequency characteristics; Divide the energy envelope unit; The similar right detection of coupling of energy envelope shape with repeatability; The starting and ending position of repeated fragment is accurately located in the right checking of similar coupling on the audio frequency semantic content.

Audio stream cut apart extraction stage with audio frequency characteristics, this stage is cut apart audio stream from 10 minutes broadcast TV program fragment, then this audio stream of 10 minutes is carried out feature extraction, the feature of extraction comprises: MFCC, short-time energy, the frame length that adopts is 40ms, and frame moves and is 40ms.

For example this segment length is that 1 advertisement is arranged in 10 minutes the TV programme: new * * *.Wherein new * * * the position occurs 2 times and be respectively 10-25 second, 123-138 second.

Energy envelope is divided the stage, short-time energy feature calculation envelope unit detection function Slope and DF after utilization is level and smooth.The foundation of energy envelope division points is DF＞T and Slope＞0, and DF＞T represents that the energy jump degree satisfies the division condition of energy envelope; Slope＞0 expression energy envelope is in the rising edge state.Wherein, T is 1.25 according to a large amount of experiment statistics experience values.

The formula that calculates detection function Slope and DF is:

The Slope functional value of k frame:

Slope _k＝(STEN _k+1-STEN _k)/2

The DF functional value of k frame:

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + 10} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

Wherein STEN is the short-time energy after level and smooth.

10 minutes TV programme are divided energy envelope, and wherein near the energy envelope 10-25 locates second is (55,1.51), (45,2.51), (51,2.77), (56,3.10), (74,2.63), (40,2.96), (60,3.54), (33,4.12), (22,6.32) locate annex second at 123-138 energy envelope is (31,4.23), (43,2.45), (55,2.71), (55,3.05), (76,2.55), (40,3.02), (62,3.55), (34,4.30), (41,4.13), wherein energy envelope (d, DF) d in represents envelope length, unit is a frame.

The similar right detection-phase of coupling of energy envelope shape with repeatability, utilize energy envelope to divide the energy envelope unit that obtains, the element length of two slice unit of calculating comes envelope unit similar on the detected energy envelope shape with the probability match function between the unit.

When the energy envelope unit With Satisfy | d _j-d _i|≤T ₃, seek backward successively | d _J+1-d _I+1|≤T ₃, | d _J+2-d _I+2|≤T ₃..., up to | d _J+k-d _I+k|＞T ₃, T wherein ₃Show that according to a large amount of experiments value is can reach good experiment effect at 5 o'clock.Calculate d _i+ d _I+1+ ... + d _I+k-1With d _j+ d _J+1+ ... + d _J+k-1, when both apart from that of minimum greater than 125 the time, just think that this two envelope unit sequence is similar on the time span distance.In the above-mentioned 10 minutes fragments, (45,2.51), (51,2.77), (56,3.10), (74,2.63), (40,2.96), (60,3.54), (33,4.12) sequence and ((43,2.45), (55,2.71), (55,3.05), (76,2.55), (40,3.02), the similar condition of advertisement energy envelope length is satisfied in (62,3.55), (34,4.30) sequence.

Right to the above-mentioned sequence that satisfies length similarity condition, calculate the DFP value according to following formula:

T ₁Being first threshold values, is 2.25 according to the value of an optimization of experiment statistics; T ₂Being second threshold values, is 4 according to the value of an optimization of experiment statistics.

Calculating DFP value two energy envelope sequences afterwards is: (45,0.22), (51,0.39), (56,0.35), (74,0.31), (40,0.57), (60,0.90), (33,1) and ((43,0.19), (55,0.36), (55,0.31), (76,0.30), (40,0.59), (62,0.95), (34,1)

Will

P_{{dP}_{i}} = Σ (p_{i}, . . ., p_{i + k - 1})

As

Marginal probability, will

P_{{dP}_{j}} = Σ (p_{j}, . . ., p_{j + k - 1})

As

Marginal probability,

With

Joint probability be

P_{{dP}_{ij}} = Σ (\min (p_{i}, p_{j}), \min (p_{i + 1}, p_{j + 1}), . . ., \min (p_{i + k - 1}, p_{j + k - 1})) .

The energy jump degree of correlation between two candidate energies envelope sequence is calculated as follows:

P_{ij} = \frac{2 \cdot P_{{dP}_{ij}}}{P_{{dP}_{i}} + P_{{dP}_{j}}}

Work as P _IjGreater than threshold value T ₄The time, think that then these two sequences are similar at energy envelope in shape.Wherein, threshold value T ₄According to a large amount of experiment statistics experience values is 0.8.

For above-mentioned two sequences, the marginal probability value of first sequence is: P ₁=3.74; The marginal probability value of second sequence is: P ₂=3.7; Both joint probabilities are: P ₁₂=3.63

Both energy jump degree of correlation P=0.976 are greater than 0.8.So think that these two energy envelope sequences are similar at energy envelope in shape.

The right checking of similar coupling on the audio frequency semantic content, accurately locate the starting and ending position stage of repeated fragment, whether on audio frequency semantic content similar, we just think that this coupling is to being repeated fragment when similar to satisfying semantic content when coupling at matching unit similar on the envelope shape if utilizing MFCC feature and Euclidean distance to verify.

For above-mentioned (255,256 ...., 620) and (3079,3080, ..., 3450) two fragments of mating in shape at energy envelope, calculate (251,256 ...., 620), (252,256 ...., 620), (253,256 ...., 620), (254,256 ...., 620), (255,256 ...., 620) respectively with (3074,3080 ..., 3450), (3075,3080 ..., 3450), (3076,3080 ..., 3450), (3077,3080 ..., 3450), (3078,3080 ..., 3450), (3079,3080 ..., 3450) between in twos the Euclidean distance of MFCC feature, as calculated (251,256 ...., 620) and (3076,3080 ..., 3450) Euclidean distance satisfy in the similar condition of semantic content greater than 125 less than 4.5 number, so these two fragments are the fragment that repeats.

Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but determine by the scope of claims.

Claims

1. method that detects advertisement from audio frequency and video is characterized in that described method comprises:

2. the method that detects advertisement from audio frequency and video as claimed in claim 1 is characterized in that described step 2 further is:

3. the method that detects advertisement from audio frequency and video as claimed in claim 2 is characterized in that,

4. the method that detects advertisement from audio frequency and video as claimed in claim 2 is characterized in that,

5. the method that detects advertisement from audio frequency and video as claimed in claim 2 is characterized in that, described step 1 also comprises,

6. the method that detects advertisement from audio frequency and video as claimed in claim 5 is characterized in that, described step 21 further is,

7. the method that detects advertisement from audio frequency and video as claimed in claim 5 is characterized in that, described step 22 further is,

8. the method that detects advertisement from audio frequency and video as claimed in claim 5 is characterized in that, described step 23 further is,

9. the method that detects advertisement from audio frequency and video as claimed in claim 6 is characterized in that, described step 61 further is,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

10. the method that detects advertisement from audio frequency and video as claimed in claim 7 is characterized in that, also comprises between described step 71 and the described step 72,

11. the method that detects advertisement from audio frequency and video as claimed in claim 7 is characterized in that, described step 72 further is,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

12. the method that detects advertisement from audio frequency and video as claimed in claim 8 is characterized in that,

Described step 81 further is,

Described step 82 further is,

13. a system that detects advertisement from audio frequency and video is characterized in that described system comprises:

14. the system that detects advertisement from audio frequency and video as claimed in claim 13 is characterized in that described advertisement is searched module and further comprised:

15. the system that detects advertisement from audio frequency and video as claimed in claim 14 is characterized in that,

16. the system that detects advertisement from audio frequency and video as claimed in claim 14 is characterized in that,

17. the system that detects advertisement from audio frequency and video as claimed in claim 14 is characterized in that described parameter extraction module also is used for smoothing processing is carried out in the short-time energy of frame, with the short-time energy of the short-time energy after the smoothing processing as frame.

18. the system that from audio frequency and video, detects advertisement as claimed in claim 17, it is characterized in that, described dividing elements module is further used for the short-time energy according to frame, will be positioned at energy trace rising edge and energy ascensional range and surpass the division points of the frame of default range value as the energy envelope unit; From described division points audio frequency is divided into the energy envelope unit.

19. the system that from audio frequency and video, detects advertisement as claimed in claim 17, it is characterized in that, the similar module of searching of described shape is further used for finding out the continuous energy envelope unit, two groups of positions that satisfies length similarity condition from described energy envelope unit, the candidate energies envelope sequence is formed in every group of energy envelope unit, and the difference of the length of the energy envelope unit of same position was less than the preset length difference between described length similarity condition was every group; Judge that according to the short-time energy of frame in the described candidate energies envelope sequence whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than the energy jump degree of correlation threshold values of presetting, if then described candidate energies envelope sequence is described energy envelope sequence.

20. the system that detects advertisement from audio frequency and video as claimed in claim 17 is characterized in that, described semanteme is similar searches the Euclidean distance that module is further used for calculating the Mei Er cepstrum coefficient of every pair of corresponding interframe between described energy envelope sequence; Judge Euclidean distance less than the number of the frame of predeterminable range threshold values whether greater than default quantity threshold values, if the audio frequency and video fragment of then described energy envelope sequence correspondence is advertisement.

21. the system that detects advertisement from audio frequency and video as claimed in claim 18 is characterized in that,

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

22. the system that from audio frequency and video, detects advertisement as claimed in claim 19, it is characterized in that, the similar module of searching of described shape is used to also judge that whether the length of described candidate energies envelope sequence is more than or equal to default advertisement length threshold values, if the short-time energy of then carrying out frame in the described candidate energies envelope sequence of described foundation judges that whether the energy jump degree of correlation between the candidate energies envelope sequence is greater than default energy jump degree of correlation threshold values.

23. the system that from audio frequency and video, detects advertisement as claimed in claim 19, it is characterized in that, described shape is similar searches module and is judging that according to the short-time energy of frame in the described candidate energies envelope sequence the energy jump degree of correlation between the candidate energies envelope sequence is whether during greater than the energy jump degree of correlation threshold values preset

DF = Max {\frac{{({STEN}_{k + 1} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}, . . ., \frac{{({STEN}_{k + m} - {STNE}_{k})}^{2}}{{STEN}_{k}^{2}}}

24. the system that detects advertisement from audio frequency and video as claimed in claim 20 is characterized in that,