CN102024033B

CN102024033B - A kind of automatic detection audio template also divides the method for chapter to video

Info

Publication number: CN102024033B
Application number: CN201010567970.1A
Authority: CN
Inventors: 董远; 王乐滋
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2010-12-01
Filing date: 2010-12-01
Publication date: 2016-01-20
Anticipated expiration: 2030-12-01
Also published as: CN102024033A

Abstract

A kind of automatic detection audio template also divides the method for chapter to video.It utilizes program audio data weekly, vocal print feature Fast Learning is adopted to go out the fragment that content repeats, again by segment composition, sort out alternatively template, statistics fragment length, occurrence number, Annual distribution information are come calibrating template type and screen selecting formwork file, and are utilized template automatically to divide chapter to new program.The present invention is based on the retrieval of audio frequency and dynamically set up template base, the calculated amount overcome based on the method for video is large, detection speed is slow, and the shortcoming that when program fragment has an identical audio content, picture material is different, also solves the problem that in database, " static state " template is simultaneously.

Description

A kind of automatic detection audio template also divides the method for chapter to video

Art

The invention belongs to and copy detection is carried out to the audio content of video frequency program and program is divided automatically to the field of chapter, be specifically related to a kind of automatic detection audio template and video divided to the method for chapter.

Background technology

Video frequency program carries out point chapter and refers to that the specific fragment (as advertisement, program special efficacy) of data volume is large, that the duration is long video frequency program is marked thus facilitates user to browse.

At present, traditional method to be gone forward side by side row relax to video frame extraction feature, is based upon in image basis.Common are station symbol to detect and video identification.

The method of video identification can utilize the information of template in database to accomplish to locate fast and exactly and mark really, but in the database of current the method template by people for adding, database internal information is more fixing, the data do not had in database cannot detect, in addition, some program fragment has identical audio content, and picture material was different at that time, and the time continued is longer, as news content looks back part.For such program fragment, the common detection method based on image is just inapplicable.Detect for station symbol, increasing video uses same station symbol (if advertisement is with program) in the part that should be judged to different chapters and sections, causes station caption detection method to lose efficacy like this.

It is large also to there is calculated amount in the above-mentioned method based on video, the problem that detection speed is slow.And at present, the video based on audio frequency divides Zhang Fangfa all to belong to the detection having template, namely artificially pre-define template in a database and again testing audio data are compared.It is " static state " that the defect of these methods is limited to template in database equally, and the data do not had in database cannot detect.

Summary of the invention

The present invention is in order to overcome the deficiency based on video detection and these two class methods of audio detection based on template, propose a kind of automatic detection audio template and video divided to the method for chapter, it can learn out audio template fast, robustly in the audio file of very big data quantity, and utilizes template to divide chapter accurately to new video.

The invention provides and automatically detect audio template and the method for video being divided to chapter, divide the chapter stage comprising Template Learning stage and video.

The Template Learning stage comprises the following steps:

1) using the voice data in past one week as training sample, the voice data of 7 days (7*24 hour) 5513HZ is carried out pre-service; Whole 7 audio frequency be multiplied by 24 hours are divided into 1 hour some audio file being unit; Utilize the KULLBACK-LEIBLER distance of audio frequency, the file of 1 hour is carried out the segmentation of shear point, obtain scrappy audio-frequency fragments; Prevent segment from splitting overbreak, these audio-frequency fragments are carried out cluster, judge the time span of each fragment, duration is less than the segment of 3 seconds and the shorter segment of adjacent duration is spliced; Then for the audio file of 5513HZ, be a frame with window length 0.37s, 40ms, judge whether every frame is mute frame, the energy of each frame is eFr, energy threshold TE, according to formula:

eFr = \frac{\underset{w}{Σ} x_{i}^{2} - mean}{W}

TE = \frac{se}{α \cdot n} + β \cdot e \min

Wherein, w is the quantity of sampled point in window, and n is the frame number of whole file, x _ifor the energy value of each sampled point, α, β are setup parameter,

If eFr≤TE, then this frame is judged as mute frame; If it is over half that mute frame account for audio fragment, this fragment will be defined as silence clip.

2) window length 0.37 second, 40ms is that stepping carries out discrete Fourier transform (DFT) to the audio file of 5513HZ, and according to Mei Er frequency formula

Mel(f)＝2595lg(1+f/700)

20HZ---3000HZ in actual band is partially converted into Mei Er frequency band and is divided into 17 word frequency bands; Calculate the energy difference between adjacent two frequency bands; If difference is more than or equal to setting threshold value, output is 1, otherwise is 0; Extract the eigenwert of string of binary characters as each frame of a 16Bit;

3) utilize the data of all frames of audio frequency in a week to set up out a Hash table, the key word key of Hash table is the eigenwert of 16Bit; The value value storage of Hash table has the frame number of this eigenwert and the fragment position at place; All frames Hash in this Hash table in each non-mute audio fragment A goes out to have with it neighbour's frame of identical key; According to the numbering of the search situation of every frame and the audio fragment at neighbour's frame place, using the candidate matches fragment of audio fragment as audio A that the frame number of half can be had in A to search contiguous frames; Then by the Segment A intersegmental calculating similarity with candidate matches sheet one by one, for two Segment A, B, the frame that wherein will can find matching characteristic respectively in chronological sequence order arrangement, in A can with find the frame number mating right frame to be a in B ₁, a ₂..., a _m, can be b by the frame number of the frame in characteristic matching in A in B ₁, b ₂..., b _n, according to formulae discovery 2 coefficient s1, s2:

s_{1} = \frac{m + n}{2 \cdot \min (N_{A}, N_{B})}

s_{2} = \frac{\underset{m}{Σ} χ (a_{i}) + \underset{n}{Σ} χ (b_{i})}{2 \cdot \min (N_{A}, N_{B})}

Wherein, t is setting threshold value; Utilize s1, s2 calculates the similarity of two fragments, S=w ₁s ₁+ w ₂s ₂, w1 and w2 is the constant coefficient of setting, usual w1 < w2; Candidate segment similarity S being greater than threshold value T1 remains the coupling fragment as Segment A.

4) the audio fragment A finding coupling number of fragments to be greater than certain threshold value T2 is remained; And judge in certain hour with interval, whether there are other fragments that quantity also can be found to be greater than the coupling fragment of setting threshold value; If then retain this audio fragment, otherwise delete; Finally obtain a series of audio fragment repeated in week age.

5) utilize the state pause judgments temporal information of fragment, the audio fragment belonged on the same day remained is done and splices and merge; Fragment merges regularly be between two: for 2 Segment A on the same day, B, the initial time of A is Tas, end time is the starting and ending time of Tae, B be respectively Tbs, Tbe, wherein Tae < Tbs, if | Tae-Tbs| < TDur, then Segment A, B and two panels spacer segment part all permeate fragment, and its initial time is Tas, and the end time is Tbe;

6) sort out finishing the fragment after fusion, sorting out principle is: in the fragment after 2 fusions, and be partly coupling fragment each other if having, then these 2 fragments are classified as a class; Class also meets criterion in addition: if A and B is same class, B and C is same class, then A and C is same class;

7) carry out the judgement of program category for the fragment computations that each class content repeats 3 indexs, its decision rule is as follows:

Index 3:T _k

N _kbe the average length of fragment in K class, n is the number of fragments in K class, t _iit is the time span of i-th fragment; be the Annual distribution situation of K class fragment, c is the central instant of a week, c _iit is the central instant of fragment i in K class; T _kit is the number of times that K class occurred in a week; 3 indexs are done to merge:

Type＝c ₁·Dur+c ₂·Distrb+c ₂·T

C1, C2, C3 are the weights of 3 settings; Type < T1, such fragment is judged as program special efficacy; T1≤Type < T2, such is judged as station promotion sheet; Type >=T2, such is judged as advertisement.

8) type decision complete after, carry out screening audio fragment to set up template base; Belong in the right repetition audio fragment of coupling the time that retains the longest one section in each class audio frequency fragment, by this segment characterizations together with the program category type information judged together stored in template base, generate template file.

Video divides the chapter stage:

For new one section of program, system utilizes template base file to do copy detection to new program, finds out the fragment with template file with identical content in a program which, and nominal time and type, comprise the following steps:

1) utilize template generation stage etch 2, method described in 3 is extracted feature to new audio program and is established Hash table and template base file mated with new video frequency program one by one;

2) for template A, its every frame 16bit feature all in Hash table Hash go out the audio frequency characteristics mated with it;

3) feature in A is alignd in time with the feature of its coupling, and calculation template file and and its time upper equitant program audio part between Hamming distance hi frame by frame, again by distance divided by overlap part frame number in the hope of similarity distance mark Dsore wherein overlap is the frame number that program and template think lap.

4) mark is less than the candidate matches fragment of program audio part as template of setting threshold value, what wherein score was minimum is set to optimum matching fragment; Then other candidate segment, and if best fractional time interval be greater than time interval threshold value, and the difference of its score Dsocre and best score is less than and sets fraction shifts threshold value, then be still regarded as mating fragment; Mark initial time and the duration of this lap, and utilize template type to demarcate this part program category.

The invention has the beneficial effects as follows, the information utilizing specific fragment in one week, content to repeat is breach, and the fragment utilizing vocal print feature these to be repeated fast is found out from a large amount of data; The present invention, according to similarity decision method and this stable feature of repeatability, ensure that the accuracy of search; The present invention according to the time span of the audio fragment determined, multiplicity, and temporal distribution variance information determined the program category of audio template; In addition, the present invention utilizes the audio template learnt out to make point chapter automatically to new program, ensure that the speed of point chapter and temporal accurate location; The present invention is based on the retrieval of audio frequency and dynamically set up template base, the calculated amount overcome based on the method for video is large, detection speed is slow, and the shortcoming that when program fragment has an identical audio content, picture material is different, also solves the problem that in database, " static state " template is simultaneously.

Accompanying drawing explanation

Fig. 1 is that the present invention automatically detects audio template and video divided to the method template study partial process view of chapter;

Fig. 2 is that the present invention automatically detects audio template and divides the method video of chapter to divide chapter partial process view to video;

Fig. 3 is that the present invention automatically detects audio template and video divided to the overall architecture of the method and system of chapter;

Fig. 4 is window length and the stepping schematic diagram of audio feature extraction;

Fig. 5 is that video divides the spacing mark of chapter phase templates audio fragment and program audio to calculate schematic diagram.

Embodiment

Below in conjunction with specification drawings and specific embodiments, the present invention is further elaborated, when these those skilled in the art can be allowed not need to spend performing creative labour, understands and realize technical scheme proposed by the invention.

The technical problem to be solved in the present invention comprises:

1, utilize program audio data in the past, allow machine go out template file from a large amount of data learnings, Dynamic Establishing template base;

2, split audio file and extract robust, the vocal print feature that is conducive to fast search matching;

3, according to the feature extracted, the similarity mode between two section audio fragments;

4, audio fragment cluster, does the differentiation of program category to each audio class and pick out template file from each audio class;

5, utilize template base file, to program of newly arriving do mate after a point chapter is carried out to program.

In conjunction with above-mentioned technical matters, the present invention realizes this object to propose automatic detection audio template and the method for video being divided to chapter, divides chapter two stages comprising Template Learning and video.

By reference to the accompanying drawings 1, automatically detect audio template and also divide video the Template Learning stage of the method for chapter to comprise the following steps:

Step 101: preferably, the present invention gets over the program data of a week as training data, from learning depanning plate file; Week about, just in the program data of upper a week, learn new template, add in template base; The voice data of 7 days (7*24 hour) 5513HZ is carried out pre-service; Whole 7 audio frequency be multiplied by 24 hours are divided into 1 hour some audio file being unit; Utilize the KULLBACK-LEIBLER distance of audio frequency, the file of 1 hour is carried out the segmentation of shear point, obtain scrappy audio-frequency fragments; Prevent segment from splitting overbreak, these audio-frequency fragments are carried out cluster, judge the time span of each fragment, duration is less than the segment of 3 seconds and the shorter segment of adjacent duration is spliced; Then for the audio file of 5513HZ, be a frame with window length 0.37s, 40ms, judge whether every frame is mute frame, the energy of each frame is eFr, energy threshold TE, according to formula:

eFr = \frac{\underset{w}{Σ} x_{i}^{2} - mean}{W}

TE = \frac{se}{α \cdot n} + β \cdot e \min

Wherein, w is the quantity of sampled point in window, and n is the frame number of whole file, x _ifor the energy value of each sampled point, α, β are setup parameter, if eFr≤TE, then this frame is judged as mute frame; If it is over half that mute frame account for audio fragment, this fragment will be defined as silence clip.

Step 102: window length 0.37 second, 40ms is that stepping carries out discrete Fourier transform (DFT) to the audio file of 5513HZ, and according to Mei Er frequency formula

Mel(f)＝2595lg(1+f/700)

As shown in Figure 5, frame 1 utilizes 0 to 0.37 second, and this part sampling number certificate does discrete Fourier transform (DFT), then 20HZ---3000HZ in its actual band be partially converted into Mei Er frequency band and be divided into 17 word frequency bands, calculating the energy difference between adjacent two frequency bands; If difference is more than or equal to setting threshold value, output is 1, otherwise is 0; Extract the eigenwert of string of binary characters as frame 1 of a 16Bit; Then window sliding 40 milliseconds, namely utilizes the sampled data points of 40 milliseconds to 0.41 second to repeat the eigenwert of string of binary characters as frame 2 of above-mentioned steps extraction 16Bit, and by that analogy until all audio frames all extract feature.

Step 103: utilize the data of one week all frame of audio frequency to set up out a Hash table, the key word key of Hash table is the eigenwert of 16Bit; The value value storage of Hash table has the frame number of this eigenwert and the fragment position at place; All frames Hash in this Hash table in each non-mute audio fragment A goes out to have with it neighbour's frame of identical key; According to the numbering of the search situation of every frame and the audio fragment at neighbour's frame place, using the candidate matches fragment of audio fragment as audio A that the frame number of half can be had in A to search contiguous frames; Then by the Segment A intersegmental calculating similarity with candidate matches sheet one by one.

For Segment A and an one candidate matches fragment B, the frame that wherein will can find matching characteristic respectively in chronological sequence order arrangement, in A can with find the frame number mating right frame to be a in B ₁, a ₂..., a _m, can be b by the frame number of the frame in characteristic matching in A in B ₁, b ₂..., b _n, according to formulae discovery 2 coefficient s1, s2:

s_{1} = \frac{m + n}{2 \cdot \min (N_{A}, N_{B})}

s_{2} = \frac{\underset{m}{Σ} χ (a_{i}) + \underset{n}{Σ} χ (b_{i})}{2 \cdot \min (N_{A}, N_{B})}

Wherein, t is setting threshold value, and preferably, T value is 3; Utilize s1, s2 calculates the similarity of two fragments, S=w ₁s ₁+ w ₂s ₂, w1 and w2 is the constant coefficient of setting, usual w1 < w2, preferably, and w1 value 1/3, w2 value 2/3; Candidate segment similarity S being greater than threshold value T1 remains the coupling fragment as Segment A, and preferably, it is 0.5 that the present invention sets threshold value T1.

Step 104: remained by the audio fragment A finding coupling number of fragments to be greater than certain threshold value T2, for employing one weekly data as training sample, preferably, it is 7 that the present invention sets T2 value; And judge with in A separated in time, whether there are other fragments that quantity also can be found to be greater than setting threshold value T2 coupling fragment; If then retain this audio fragment, otherwise delete; Finally obtain a series of audio fragment repeated in week age.

Step 105: the state pause judgments temporal information utilizing fragment, does the audio fragment belonged on the same day remained and splices and merge; Fragment merges regularly be between two: for 2 Segment A on the same day, the initial time of B, A is Tas, and the end time is Tae, the starting and ending time of B is respectively Tbs, Tbe, wherein Tae < Tbs, if | Tae-Tbs| < TDur, preferably, TDur is set as 10 seconds, then Segment A, B and two panels spacer segment part all permeate fragment, and its initial time is Tas, and the end time is Tbe;

Step 106: sort out finishing the fragment after fusion, sorting out principle is: in the fragment after 2 fusions, it is partly coupling fragment pair each other if having, then these 2 fragments are classified as a class, even in Segment A, some data is judged as at step 104 with partial data in fragment B and mates, then Segment A and fragment B are classified as a class; Class also meets criterion in addition: if A and B is same class, B and C is same class, then A and C is same class;

Step 107: carry out the judgement of program category for the fragment computations that each class content repeats 3 indexs, its decision rule is as follows:

Index 1:

Dur = \frac{N_{k}^{2}}{\underset{&ForAll; k}{\max (N_{k}^{2})}}

Index 2:

Distrb = \frac{σ_{k}^{2}}{\underset{k}{\max (σ_{k}^{2})}}

Index 3:T _k

Type＝c ₁·Dur+c ₂·Distrb+c ₂·T

Step 108: after type decision is complete, carries out screening audio fragment to set up template base; Belong in the right repetition audio fragment of coupling the time that retains the longest one section in each class audio frequency fragment, by this segment characterizations together with the program category type information judged together stored in template base, generate template file.

By reference to the accompanying drawings 2, automatic detection audio template also divides the video of the method for chapter to divide the chapter stage to video, for new one section of program, system utilizes template base file to do copy detection to new program, find out the fragment with template file with identical content in a program which, and nominal time and type, comprise the following steps:

Step 201: with template generation stage etch 102, described in 103, method is identical, establishes Hash table to the audio extraction feature of new program, again template base file is mated with new video frequency program one by one, coupling work as the following step 202,203, described in 204;

Step 202: for a template audio fragment A, its every frame 16bit feature all in Hash table Hash go out the audio frequency characteristics mated with it;

Step 203: feature in A is alignd in time with the feature of its coupling, and calculation template file and and its time upper equitant program audio part between Hamming distance frame by frame, then by distance divided by the frame number of the part that overlaps in the hope of affinity score;

As shown in Figure 6, might as well establish in step 202., in the middle of template audio fragment A frame 3 and frame 6 in the audio file of new program be detected as mate right, so the frame 3 of A is alignd on time location with the frame 6 of program, calculate the Hamming distance frame by frame of program and A lap and A, namely from frame 1 to the frame 4 of frame m and program, to frame m+3, this part calculates Hamming distance hi to A frame by frame; Then the Hamming distance utilizing each frame to calculate calculates distance mark Dsore, wherein overlap is the frame number that program and template think lap, and overlap equals the frame number m of A in this example embodiment.

Step 204: the candidate matches fragment of program audio part as template mark being less than setting threshold value, what wherein score was minimum is set to optimum matching fragment; Then other candidate segment, if and best fractional time interval is greater than time interval threshold value, preferably, time interval threshold value is set to the template segments time span of 1.2 times, and the difference of its score and best score is less than and sets fraction shifts threshold value, preferably fraction shifts threshold value is set to 2, then be still regarded as mating fragment; Mark initial time and the duration of this lap, and utilize template type to demarcate this part program category.

Claims

1. one kind is automatically detected audio template and video frequency program is divided to the method for chapter, the information that it is characterized in that utilizing specific fragment in one week, content to repeat goes out audio template from the voice data learning of a week fast, robustly for breach, and utilize template to divide chapter accurately to new program, comprise the Template Learning stage and video divides the chapter stage, wherein the Template Learning stage comprises the following steps:

Step one, carries out pre-service for the program audio file of a week and judges silence clip;

Step 2, for each audio fragment, extracts the vocal print feature of robust;

Step 3, utilizes one week audio data characteristics, sets up Hash table, search coupling fragment;

Whether step 4, remains the audio fragment A that coupling number of fragments can be found inside step 3 gained fragment to be greater than threshold value, and judges in certain hour with interval, have other fragments that quantity also can be found to be greater than the coupling fragment of setting threshold value; If then retain this audio fragment, otherwise delete; Finally obtain a series of audio fragment in week age, content repeated;

Step 5, in the fragment that step 4 filters out, initial time for two Segment A on the same day, B, A is Tas, and the end time is Tae, the starting and ending time of B is respectively Tbs, Tbe, wherein Tae < Tbs, if | Tae-Tbs| < TDur, then Segment A, B and two panels spacer segment part all permeate fragment, its initial time is Tas, and the end time is Tbe;

Step 6, carries out cluster by the fragment after merging in step 5, obtains several audio class, and its classification principle is: in the fragment after two fusions, and be partly coupling fragment each other if having, then these two fragments are classified as a class; Class also meets criterion in addition: if A and B is same class, B and C is same class, then A and C is same class;

Step 7, for each class put in order in step 6, judges its program category;

Step 8, belongs in the right repetition audio fragment of coupling the time that retains the longest one section in each class audio frequency fragment, by this segment characterizations together with the program category type information judged together stored in template base, generate template file;

Wherein said step one specifically comprises: using the voice data in past one week as training sample, the voice data of these 5513HZ is divided into some audio files that 1 hour is unit; Utilize the Kullback-Leibler distance of audio frequency, the file of 1 hour is carried out the segmentation of shear point, obtain scrappy audio-frequency fragments; Prevent segment from splitting overbreak, these audio-frequency fragments are carried out cluster, judge the time span of each fragment, duration is less than the segment of 3 seconds and the shorter segment of adjacent duration is spliced; Then for the audio file of 5513HZ, be a frame with window length 0.37s, 40ms, judge whether every frame is mute frame, the energy of each frame is eFr, energy threshold TE, according to formula:

w is the quantity of sampled point in window, and n is the frame number of whole file, x _ifor the energy value of each sampled point, α, β are setup parameter, if eFr≤TE, then this frame is judged as mute frame; If it is over half that mute frame account for audio fragment, this fragment will be defined as silence clip;

Wherein said step 2 specifically comprises: window length 0.37 second, 40ms is that stepping carries out discrete Fourier transform (DFT) to the audio file of 5513HZ, and according to Mei Er frequency formula Mel (f)=2595lg (1+f/700), 20HZ---3000HZ in actual band is partially converted into Mei Er frequency band and is divided into 17 word frequency bands; Calculate the energy difference between adjacent two frequency bands; If difference is more than or equal to setting threshold value, output is 1, otherwise is 0; Extract the eigenwert of string of binary characters as each frame voice data of a 16Bit;

Wherein said step 3 specifically comprises: set up out a Hash table by the data of one week interior all frame of audio frequency, the key word key of Hash table is the eigenwert of 16Bit, the value value storage of Hash table has the frame number of this eigenwert and the fragment position at place, all frames Hash in this Hash table in each non-mute audio fragment A goes out to have with it neighbour's frame of identical key, according to the numbering of the search situation of every frame and the audio fragment at neighbour's frame place, using the candidate matches fragment of audio fragment as audio A that the frame number of half can be had in A to search contiguous frames; Then by the Segment A intersegmental calculating similarity with candidate matches sheet one by one, candidate segment similarity being greater than threshold value remains the coupling fragment as Segment A;

Divide the chapter stage at video, for new one section of program, system utilizes template base file to do copy detection to new program, finds out the fragment with template file with identical content in a program which, and nominal time and type, comprise the following steps:

Step one, consistent with method described in step 2 in the Template Learning stage and step 3, feature is extracted to new audio program and establishes Hash table;

Step 2, mates with new video frequency program one by one by template base file, for each template, its every frame 16bit feature all in Hash table Hash go out the audio frequency characteristics mated with it;

Step 3, the similarity distance mark Dscore between calculation template file and new Program sections data;

Step 4, selects and demarcates the fragment of mating with template file from new program;

Video divides the step 3 in chapter stage specifically to comprise: the feature in the feature of template file and the program file of its coupling alignd in time, and calculation template file and and its time upper equitant program audio part between Hamming distance hi frame by frame, again by distance divided by overlap part frame number in the hope of similarity distance mark Dscore, Dscore wherein overlap be program and template overlap part frame number;

Video divides step 4 described in the chapter stage specifically to comprise: the candidate matches fragment of program audio part as template mark being less than setting threshold value, and what wherein score was minimum is set to optimum matching fragment; Then other candidate segment, if and optimum matching fractional time interval is greater than time interval threshold value, and the difference of the similarity distance mark of its similarity distance mark Dscore calculated in step 3 and optimum matching fragment is less than the fraction shifts threshold value of setting, then be still regarded as mating fragment; The wherein time interval threshold value template time length that equals 1.2 times, fraction shifts threshold value equals 2; Mark initial time and the duration of this lap, and utilize template type to demarcate this part program category;

Above-mentioned automatic detection audio template is also divided in Zhang Fangfa to video frequency program, the feature in Template Learning stage to be in step 3 two Segment A, the similarity decision method of B is: for two Segment A, B, the frame that wherein will can find matching characteristic respectively in chronological sequence order arrangement, in A can with find the frame number mating right frame to be a in B ₁, a ₂..., a _m, can be b by the frame number of the frame in characteristic matching in A in B ₁, b ₂..., b _n, according to formulae discovery 2 coefficient s1, s2:

Wherein, t is setting threshold value; Utilize s1, s2 calculates the similarity of two fragments,

S=w ₁s ₁+ w ₂s ₂, w1 and w2 is the constant coefficient of setting; Candidate segment similarity S being greater than threshold value T1 remains the coupling fragment as Segment A;

Above-mentioned automatic detection audio template is also divided in Zhang Fangfa to video frequency program, and the feature in Template Learning stage is also that the program category of calculating and each audio class that step 7 comprises 3 indexs judges:

Index 1:

Index 2:

Index 3:T _k

N _kbe the average length of fragment in K class, n is the number of fragments in K class, t _iit is the time span of i-th fragment; be the Annual distribution situation of K class fragment, c is the central instant of a week, c _iit is the central instant of fragment i in K class; T _kbe the number of times that K class occurred in a week, then 3 indexs done merge and judge program category;

The concrete operations that the fusion of above-mentioned 3 indexs and template file program category judge comprise calculating fusion coefficients Type and are compared with the threshold value of setting by this coefficient:

Type＝c ₁·Dur+c ₂·Distrb+c ₃·T