CN102024033B

CN102024033B - A kind of automatic detection audio template also divides the method for chapter to video

Info

Publication number: CN102024033B
Application number: CN201010567970.1A
Authority: CN
Inventors: 董远; 王乐滋
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2010-12-01
Filing date: 2010-12-01
Publication date: 2016-01-20
Anticipated expiration: 2030-12-01
Also published as: CN102024033A

Abstract

A kind of automatic detection audio template also divides the method for chapter to video.It utilizes program audio data weekly, vocal print feature Fast Learning is adopted to go out the fragment that content repeats, again by segment composition, sort out alternatively template, statistics fragment length, occurrence number, Annual distribution information are come calibrating template type and screen selecting formwork file, and are utilized template automatically to divide chapter to new program.The present invention is based on the retrieval of audio frequency and dynamically set up template base, the calculated amount overcome based on the method for video is large, detection speed is slow, and the shortcoming that when program fragment has an identical audio content, picture material is different, also solves the problem that in database, " static state " template is simultaneously.

Description

Method for automatically detecting audio template and video seal

Technical Field

The invention belongs to the field of copy detection of audio contents of video programs and automatic chaptering of the programs, and particularly relates to a method for automatically detecting an audio template and chaptering the video.

Background

The video program is divided into chapters, which means that specific segments (such as advertisements and program special effects) of the video program with large data volume and long duration are marked so as to facilitate browsing of users.

At present, the traditional method is to extract and process the features of the video frames, which is based on the image. Station caption detection and video identification are common.

The video identification method can actually utilize the information of the template in the database to quickly and accurately position and mark, but the template in the database of the current method is artificially added, the information in the database is relatively fixed, the data which is not in the database cannot be detected, in addition, the image contents of some program segments with the same audio contents are different at the time, and the duration is longer, such as a news content review part. For such program segments, common image-based detection methods are not applicable. For logo detection, more and more videos use the same logo (e.g., commercials and programs) in portions that should be interpreted as different chapters, thus rendering the logo detection method ineffective.

The video-based method also has the problems of large calculation amount and low detection speed. At present, the video seal method based on audio belongs to the detection with a template, namely, the template is artificially defined in advance in a database and then the test audio data is compared. The drawback of these methods is also that the templates in the database are "static" and data not in the database will not be detectable.

Disclosure of Invention

The invention provides a method for automatically detecting an audio template and stamping videos in order to overcome the defects of two methods, namely video detection and template-based audio detection, and the method can rapidly and robustly learn the audio template in an audio file with large data volume and accurately stamp the new videos by utilizing the template.

The invention provides a method for automatically detecting an audio template and chaptering videos, which comprises a template learning stage and a video chaptering stage.

The template learning phase comprises the following steps:

1) using the audio data of the past week as a training sample, and preprocessing the audio data of 5513HZ for 7 days (7 x 24 hours); the whole audio of 7 times 24 hours is divided into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file of 1 hour by using the KULLBACK-LEIBLER distance of the audio to obtain fragmentary audio fragments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:

eFr = \frac{\underset{w}{Σ} x_{i}^{2} - mean}{W}

TE = \frac{se}{α \cdot n} + β \cdot e \min

wherein, w is the number of sampling points in the window, n is the wholeNumber of frames of file, x_iFor the energy value of each sample point, α is a set parameter,

if eFr is less than or equal to TE, the frame is judged as a mute frame; if a silent frame occupies more than half of an audio segment, the segment will be defined as a silent segment.

2) The window length is 0.37 second, 40ms is step to carry out discrete Fourier transform on 5513HZ audio file, and according to the Miller frequency formula

Mel(f)＝2595lg(1+f/700)

Converting a 20 HZ-3000 HZ part in an actual frequency band into a Mel frequency band and equally dividing the Mel frequency band into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame;

3) establishing a hash table by using data of all frames of the audio within one week, wherein a keyword key of the hash table is a characteristic value of 16 bits; the value of the hash table stores the frame number with the characteristic value and the position of the fragment; all frames in each non-silent audio fragment A are hashed in the hash table to form adjacent frames with the same key; according to the searching condition of each frame and the number of the audio segment where the adjacent frame is located, searching half of the number of frames in the audio segment A to obtain the audio segment of the adjacent frame as a candidate matching segment of the audio A; then calculating the similarity between the segment A and the candidate matching segment one by one, and respectively arranging the frames in which the matching characteristics can be found in the two segments A and B according to the time sequence, wherein the frame number of the frame in which the matching pair can be found in the segment A and the segment B is a₁，a₂，...，a_mThe frame number of the frame in B that can be matched by the features in A is B₁，b₂，...，b_n2 coefficients s1, s2 are calculated according to the formula:

s_{1} = \frac{m + n}{2 \cdot \min (N_{A}, N_{B})}

s_{2} = \frac{\underset{m}{Σ} χ (a_{i}) + \underset{n}{Σ} χ (b_{i})}{2 \cdot \min (N_{A}, N_{B})}

wherein,t is a set threshold; similarity of two fragments was calculated using S1, S2, S ═ w₁·s₁+w₂·s₂W1 and w2 are constant coefficients set, typically w1 < w 2; and keeping the candidate segment with the similarity S larger than the threshold T1 as the matching segment of the segment A.

4) Keeping the audio segments A with the number of the found matching segments larger than a certain threshold T2; judging whether other fragments can find the matching fragments with the number larger than the set threshold value within a certain time interval; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio segments which repeatedly appear in a week are obtained.

5) Splicing and fusing the reserved audio clips belonging to the same day by using the start and end time information of the clips; the rule of fusion of the fragments in pairs is as follows: for 2 segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, the segment A, B and the two segment interval parts are fused into a segment, the starting time is Tas, and the ending time is Tbe;

6) classifying the segments after the fusion is completed, wherein the classification principle is as follows: if some of the 2 fused segments are matched segments, the 2 segments are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;

7) for the repeated segment of each type of content, 3 indexes are calculated and the program type is judged, wherein the judgment rule is as follows:

index 3: t is_k

N_kIs the average length of the fragments in class K,n is the number of fragments in class K, t_iThe time length of the ith fragment;for the temporal distribution of the class K segments,c is the central time of the week, C_iThe central time of the fragment i in the K type; t is_kThe number of occurrences of class K in one week; 3 indexes were fused:

Type＝c₁·Dur+c₂·Distrb+c₂·T

c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.

8) After the type is judged, audio clips are screened to establish a template library; and storing the segment characteristics and the judged program type information into a template library together to generate a template file.

Video stamping stage:

for a new section of program, the system uses the file in the template library to make copy detection for the new program, finds out the segment with the same content as the template file in the program, and specifies the time and the type, including the following steps:

1) extracting characteristics of the new audio program by using the method in the step 2 and the step 3 of the template generation stage, establishing a hash table and matching files in the template library with the new video program one by one;

2) for the template A, 16bit features of each frame are Hash matched with the template A in a hash table;

3) time-aligning the features in A with the matched features, calculating the frame-by-frame Hamming distance hi between the template file and the audio part of the program which is overlapped with the template file in time, dividing the distance by the number of the overlapped parts to obtain a similar distance score Dsore,wherein overlap is the number of frames of the overlapping part of the program and the template.

4) Taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the other candidate segments and the optimal segment is larger than the time interval threshold value and the difference value between the score Dsocre and the optimal score is smaller than the set score offset threshold value, the other candidate segments are still regarded as the matched segments; the start time and duration of the overlapping portion are marked and the type of the portion is marked with the template type.

The method has the advantages that the repeated information of the specific segments in the content in one week is used as the breakthrough, and the repeated segments are quickly found out from a large amount of data by using the voiceprint characteristics; according to the method, the accuracy of searching is ensured according to the stable characteristics of the similarity judgment method and the repeatability; the method judges the program type of the audio template according to the determined time length, the repetition times and the distribution variance information on time of the audio segments; in addition, the invention utilizes the learned audio template to automatically divide the seal of the new program, thereby ensuring the accurate positioning of the speed and time of dividing the seal; the invention is based on the retrieval of the audio and dynamically establishes the template library, overcomes the defects of large calculation amount, low detection speed and different image contents when the program segments have the same audio content in the video-based method, and solves the problem of 'static' templates in the database.

Drawings

FIG. 1 is a flow chart of a template learning portion of a method for automatically detecting an audio template and video chapters in accordance with the present invention;

FIG. 2 is a partial flowchart of a video stamping method for automatically detecting an audio template and stamping a video according to the present invention;

FIG. 3 is a general architecture of the method and system for automatically detecting audio templates and video chapters according to the present invention;

FIG. 4 is a schematic diagram of window length and step for audio feature extraction;

FIG. 5 is a diagram illustrating the calculation of distance scores between the audio clips of the template and the program audio in the video chapter-dividing stage.

Detailed Description

The invention is further described in the following with reference to the drawings and the detailed description, which will enable those skilled in the art to understand and implement the technical solution proposed by the invention without any creative effort.

The technical problems to be solved by the invention include:

1. the method comprises the following steps of (1) learning a template file from a large amount of data by using past program audio data, and dynamically establishing a template library;

2. dividing an audio file and extracting robust voiceprint features which are beneficial to fast searching and matching;

3. according to the extracted features, the similarity between the two sections of audio segments is matched;

4. clustering audio segments, judging the program type of each audio class and selecting a template file from each audio class;

5. and matching the new program by using the file in the template library, and then carrying out chapter separation on the program.

In view of the above technical problems, the present invention provides a method for automatically detecting an audio template and video chaptering, which includes two stages of template learning and video chaptering.

With reference to fig. 1, the template learning stage of the method for automatically detecting an audio template and video chapters includes the following steps:

step 101: preferably, the invention takes program data of the past week as training data, and learns the template file from the training data; every other week, learning a new template from the program data of the last week, and adding the new template into a template library; preprocessing 7 days (7 x 24 hours) 5513HZ audio data; the whole audio of 7 times 24 hours is divided into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file of 1 hour by using the KULLBACK-LEIBLER distance of the audio to obtain fragmentary audio fragments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:

eFr = \frac{\underset{w}{Σ} x_{i}^{2} - mean}{W}

TE = \frac{se}{α \cdot n} + β \cdot e \min

wherein, w isThe number of sampling points in the window, n being the number of frames of the entire file, x_iFor the energy value of each sample point, α is a set parameter, if eFr is not more than TE, the frame is judged as a mute frame, if the mute frame occupies more than half of the audio segment, the segment will be defined as a mute segment.

Step 102: the window length is 0.37 second, 40ms is step to carry out discrete Fourier transform on 5513HZ audio file, and according to the Miller frequency formula

Mel(f)＝2595lg(1+f/700)

as shown in fig. 5, frame 1 uses the sampling point data of 0 to 0.37 seconds to perform discrete fourier transform, then converts the 20 HZ-3000 HZ portion in its actual frequency band into the mel frequency band and equally divides it into 17 word frequency bands, and calculates the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of the frame 1; the window is then slid by 40ms, i.e. the above steps are repeated with 40ms to 0.41 s sampled data points to extract the 16Bit binary string as the feature value for frame 2, and so on until all audio frames have extracted features.

Step 103: establishing a hash table by using data of all frames of the audio in one week, wherein a keyword key of the hash table is a characteristic value of 16 bits; the value of the hash table stores the frame number with the characteristic value and the position of the fragment; all frames in each non-silent audio fragment A are hashed in the hash table to form adjacent frames with the same key; according to the searching condition of each frame and the number of the audio segment where the adjacent frame is located, searching half of the number of frames in the audio segment A to obtain the audio segment of the adjacent frame as a candidate matching segment of the audio A; and then calculating the similarity between the segment A and the candidate matching segments one by one.

For the fragment A and a candidate matching fragment B thereof, frames in which matching features can be found are respectively arranged in time sequence, and the frame number of the frame in which a matching pair can be found in the fragment A and the candidate matching fragment B is a₁，a₂，...，a_mThe frame number of the frame in B that can be matched by the features in A is B₁，b₂，...，b_n2 coefficients s1, s2 are calculated according to the formula:

s_{1} = \frac{m + n}{2 \cdot \min (N_{A}, N_{B})}

s_{2} = \frac{\underset{m}{Σ} χ (a_{i}) + \underset{n}{Σ} χ (b_{i})}{2 \cdot \min (N_{A}, N_{B})}

wherein,t is a set threshold, and preferably, T takes a value of 3; similarity of two fragments was calculated using S1, S2, S ═ w₁·s₁+w₂·s₂W1 and w2 are set constant coefficients, w1 is smaller than w2, preferably, w1 takes 1/3 and w2 takes 2/3; the candidate segment with the similarity S greater than the threshold T1 is retained as the matching segment of the segment a, and preferably, the threshold T1 is set to 0.5.

Step 104: the audio segment a with the number of the found matching segments larger than a certain threshold T2 is retained, and for the one-week data adopted as the training sample, the T2 value is preferably set to 7; judging whether other fragments can find out matching fragments with the quantity larger than a set threshold value T2 within a certain time interval with A; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio segments which repeatedly appear in a week are obtained.

Step 105: splicing and fusing the reserved audio clips belonging to the same day by using the start and end time information of the clips; the rule of fusion of the fragments in pairs is as follows: for 2 segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, preferably TDur is set to 10 seconds, segment A, B and the two-segment interval part are fused into one segment, the starting time is Tas, and the ending time is Tbe;

step 106: classifying the segments after the fusion is completed, wherein the classification principle is as follows: if some of the 2 fused segments are matched segment pairs, the 2 segments are classified into one class, that is, if some of the data in segment a and some of the data in segment B are judged to be matched in step 104, segment a and segment B are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;

step 107: for the repeated segment of each type of content, 3 indexes are calculated and the program type is judged, wherein the judgment rule is as follows:

index 1:

Dur = \frac{N_{k}^{2}}{\underset{&ForAll; k}{\max (N_{k}^{2})}}

index 2:

Distrb = \frac{σ_{k}^{2}}{\underset{k}{\max (σ_{k}^{2})}}

index 3: t is_k

N_kIs the average length of the fragments in class K,n is the number of fragments in class K, t_iIs as followsThe time length of i segments;for the temporal distribution of the class K segments,c is the central time of the week, C_iThe central time of the fragment i in the K type; t is_kThe number of occurrences of class K in one week; 3 indexes were fused:

Type＝c₁·Dur+c₂·Distrb+c₂·T

Step 108: after the type is judged, audio clips are screened to establish a template library; and storing the segment characteristics and the judged program type information into a template library together to generate a template file.

With reference to fig. 2, in the video chapter-dividing stage of the method for automatically detecting an audio template and video chapters, for a new program, the system uses the files in the template library to perform copy detection on the new program, finds out the segments with the same content as the template file in the program, and specifies the time and the type, including the following steps:

step 201: the same method as described in the step 102, 103 of the template generation phase, the audio of the new program is extracted with features and a hash table is established, and then the files in the template library are matched with the new video program one by one, and the matching work is as described in the following steps 202, 203, 204;

step 202: for a template audio segment A, the 16bit characteristics of each frame are Hash in a hash table to obtain the matched audio characteristics;

step 203: aligning the characteristics in the A with the matched characteristics in time, calculating a frame-by-frame Hamming distance between the template file and the time-overlapped program audio part, and dividing the distance by the number of the overlapped parts to obtain a similarity score;

as shown in fig. 6, it is not provided in step 202 that frame 3 in the template audio segment a is detected as a matching pair with frame 6 in the audio file of the new program, so that frame 3 of a is aligned with frame 6 of the program in time position, and a frame-by-frame hamming distance between the overlapping part of the program and a is calculated, that is, a calculates hamming distance hi frame by frame from frame 1 to frame m and from frame 4 to frame m +3 of the program; a distance score Dsore is then calculated using the calculated hamming distance for each frame,where overlap is the number of frames in which the program and template want to overlap, in this example overlap is equal to the number of frames m of A.

Step 204: taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the candidate segment and the best segment is greater than the time interval threshold, preferably, the time interval threshold is set to be 1.2 times the time length of the template segment, and the difference between the score and the best score is less than the set score offset threshold, preferably, the score offset threshold is set to be 2, then the candidate segment is still regarded as the matching segment; the start time and duration of the overlapping portion are marked and the type of the portion is marked with the template type.

Claims

1. A method for automatically detecting an audio template and separating a video program is characterized in that the audio template can be rapidly and robustly learned from audio data of a week by using information that a specific segment repeatedly appears on the content in a week as a breakthrough, and the new program is accurately separated by using the template, and the method comprises a template learning stage and a video separation stage, wherein the template learning stage comprises the following steps:

firstly, preprocessing a program audio file of a week and judging a mute segment;

step two, extracting robust voiceprint characteristics for each audio segment;

thirdly, establishing a hash table by using the characteristics of the audio data of one week, and searching for a matched segment;

step four, reserving the audio segments A with the number of the matching segments larger than the threshold value in the segments obtained in the step three, and judging whether other segments can find the matching segments with the number larger than the set threshold value within a certain time interval; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio clips which repeatedly appear on the content in one week are obtained;

step five, in the segments screened in the step four, for two segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, the segment A, B and the interval part of the two segments are fused into one segment, the starting time is Tas, and the ending time is Tbe;

step six, clustering the fragments fused in the step five to obtain a plurality of audio classes, wherein the classification principle is as follows: if some of the two fused fragments are matched fragments, the two fragments are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;

step seven, judging the program type of each class sorted in the step six;

step eight, a section with the longest time is reserved in the repeated audio clips belonging to the matched pair in each type of audio clip, and the characteristics of the clip and the judged program type information are stored into a template library together to generate a template file;

wherein the first step specifically comprises: taking the audio data of the past week as training samples, and dividing the audio data of 5513HZ into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file for 1 hour by utilizing the Kullback-Leibler distance of the audio to obtain fragmentary audio segments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:

w is the number of sampling points in the window, n is the number of frames of the entire file, x_iα is a set parameter for the energy value of each sampling point, if eFr is less than or equal to TE, the frame is judged as a mute frame, if the mute frame occupies more than half of the audio segment, the segment is defined as a mute segment;

wherein the second step specifically comprises: the window length is 0.37 seconds, 40ms is step-by-step, the discrete Fourier transform is carried out on an audio file with 5513HZ, and according to a Mel frequency formula Mel (f) -2595 lg (1+ f/700), a part of 20HZ to 3000HZ in an actual frequency band is converted into a Mel frequency band and is equally divided into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame of audio data;

wherein the third step specifically comprises: establishing a hash table by using data of all frames of audio in a week, wherein a key of the hash table is a characteristic value of 16 bits, a value of the hash table stores a frame number with the characteristic value and a position of a segment where the frame number is located, all frames in each non-silent audio segment A hash adjacent frames with the same key in the hash table, and according to the search condition of each frame and the number of the audio segment where the adjacent frame is located, the audio segment of which the number is half of the number of the frames in A is searched to the adjacent frame to serve as a candidate matching segment of the audio A; then calculating the similarity between the segment A and the candidate matching segments one by one, and reserving the candidate segments with the similarity larger than a threshold value as the matching segments of the segment A;

in the video seal stage, for a new program, the system uses the file in the template library to make copy detection on the new program, finds out the segment with the same content as the template file in the program, and specifies the time and the type, comprising the following steps:

step one, consistent with the method in the step two and the step three in the template learning stage, extracting characteristics of the new audio program and establishing a Hash table;

step two, files in the template library are matched with new video programs one by one, and for each template, 16bit features of each frame are Hash in a Hash table to obtain audio features matched with the 16bit features;

step three, calculating a similarity distance score Dscore between the template file and the data of the new program part;

selecting and calibrating a segment matched with the template file from the new program;

the third step of the video seal stage specifically includes: temporally aligning the features of the template file with the features in the program file to which the template file is matched, calculating a frame-by-frame Hamming distance hi between the template file and a temporally overlapping portion of the program audio, and dividing the distance by the number of the overlapping portions to obtain a similarity distance score Dscore, DscoreWherein overlap is the frame number of the overlapped part of the program and the template;

the fourth step in the video seal stage specifically includes: taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the candidate segment and the best matching segment is greater than the time interval threshold value and the difference value between the similarity distance score Dscore calculated in the step three and the similarity distance score of the best matching segment is smaller than the set score offset threshold value, the other candidate segments are still regarded as matching segments; wherein the time interval threshold is equal to 1.2 times the length of the template time and the fractional offset threshold is equal to 2; marking the start time and duration of the overlapping part, and marking the type of the part of the program by using the template type;

in the above method for automatically detecting an audio template and separating a video program, the template learning stage is characterized in that the similarity determination method of the two segments A, B in the third step is as follows: for two segments A and B, respectively arranging the frames in which the matching features can be found in chronological order, wherein the frame number of the frame in which the matching pair can be found in B in A is a₁，a₂，…，a_mThe frame number of the frame in B that can be matched by the features in A is B₁，b₂，…，b_n2 coefficients s1, s2 are calculated according to the formula:

wherein,t is a set threshold; calculating the similarity of the two segments by using s1 and s2,

S＝w₁·s₁+w₂·s₂w1 and w2 are set constant coefficients; reserving the candidate segment with the similarity S larger than the threshold T1 as a matching segment of the segment A;

in the above method for automatically detecting an audio template and separating video programs, the template learning stage is further characterized by comprising the seventh step of calculating 3 indexes and judging the program type of each audio class:

index 1:

index 2:

index 3: t is_k

N_kIs the average length of the fragments in class K,n is the number of fragments in class K, t_iThe time length of the ith fragment;for the temporal distribution of the class K segments,c is the central time of the week, C_iThe central time of the fragment i in the K type; t is_kThe times of occurrence of the K-th category in one week are determined, then 3 indexes are fused, and the program type is judged;

the specific operation of the fusion of the 3 indexes and the judgment of the program Type of the template file comprises calculating a fusion coefficient Type and comparing the coefficient with a set threshold value:

Type＝c₁·Dur+c₂·Distrb+c₃·T