CN102024033B - A kind of automatic detection audio template also divides the method for chapter to video - Google Patents

A kind of automatic detection audio template also divides the method for chapter to video Download PDF

Info

Publication number
CN102024033B
CN102024033B CN201010567970.1A CN201010567970A CN102024033B CN 102024033 B CN102024033 B CN 102024033B CN 201010567970 A CN201010567970 A CN 201010567970A CN 102024033 B CN102024033 B CN 102024033B
Authority
CN
China
Prior art keywords
audio
template
segment
program
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010567970.1A
Other languages
Chinese (zh)
Other versions
CN102024033A (en
Inventor
董远
王乐滋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201010567970.1A priority Critical patent/CN102024033B/en
Publication of CN102024033A publication Critical patent/CN102024033A/en
Application granted granted Critical
Publication of CN102024033B publication Critical patent/CN102024033B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of automatic detection audio template also divides the method for chapter to video.It utilizes program audio data weekly, vocal print feature Fast Learning is adopted to go out the fragment that content repeats, again by segment composition, sort out alternatively template, statistics fragment length, occurrence number, Annual distribution information are come calibrating template type and screen selecting formwork file, and are utilized template automatically to divide chapter to new program.The present invention is based on the retrieval of audio frequency and dynamically set up template base, the calculated amount overcome based on the method for video is large, detection speed is slow, and the shortcoming that when program fragment has an identical audio content, picture material is different, also solves the problem that in database, " static state " template is simultaneously.

Description

Method for automatically detecting audio template and video seal
Technical Field
The invention belongs to the field of copy detection of audio contents of video programs and automatic chaptering of the programs, and particularly relates to a method for automatically detecting an audio template and chaptering the video.
Background
The video program is divided into chapters, which means that specific segments (such as advertisements and program special effects) of the video program with large data volume and long duration are marked so as to facilitate browsing of users.
At present, the traditional method is to extract and process the features of the video frames, which is based on the image. Station caption detection and video identification are common.
The video identification method can actually utilize the information of the template in the database to quickly and accurately position and mark, but the template in the database of the current method is artificially added, the information in the database is relatively fixed, the data which is not in the database cannot be detected, in addition, the image contents of some program segments with the same audio contents are different at the time, and the duration is longer, such as a news content review part. For such program segments, common image-based detection methods are not applicable. For logo detection, more and more videos use the same logo (e.g., commercials and programs) in portions that should be interpreted as different chapters, thus rendering the logo detection method ineffective.
The video-based method also has the problems of large calculation amount and low detection speed. At present, the video seal method based on audio belongs to the detection with a template, namely, the template is artificially defined in advance in a database and then the test audio data is compared. The drawback of these methods is also that the templates in the database are "static" and data not in the database will not be detectable.
Disclosure of Invention
The invention provides a method for automatically detecting an audio template and stamping videos in order to overcome the defects of two methods, namely video detection and template-based audio detection, and the method can rapidly and robustly learn the audio template in an audio file with large data volume and accurately stamp the new videos by utilizing the template.
The invention provides a method for automatically detecting an audio template and chaptering videos, which comprises a template learning stage and a video chaptering stage.
The template learning phase comprises the following steps:
1) using the audio data of the past week as a training sample, and preprocessing the audio data of 5513HZ for 7 days (7 x 24 hours); the whole audio of 7 times 24 hours is divided into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file of 1 hour by using the KULLBACK-LEIBLER distance of the audio to obtain fragmentary audio fragments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:
eFr = Σ w x i 2 - mean W TE = se α · n + β · e min
wherein, w is the number of sampling points in the window, n is the wholeNumber of frames of file, xiFor the energy value of each sample point, α is a set parameter,
if eFr is less than or equal to TE, the frame is judged as a mute frame; if a silent frame occupies more than half of an audio segment, the segment will be defined as a silent segment.
2) The window length is 0.37 second, 40ms is step to carry out discrete Fourier transform on 5513HZ audio file, and according to the Miller frequency formula
Mel(f)=2595lg(1+f/700)
Converting a 20 HZ-3000 HZ part in an actual frequency band into a Mel frequency band and equally dividing the Mel frequency band into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame;
3) establishing a hash table by using data of all frames of the audio within one week, wherein a keyword key of the hash table is a characteristic value of 16 bits; the value of the hash table stores the frame number with the characteristic value and the position of the fragment; all frames in each non-silent audio fragment A are hashed in the hash table to form adjacent frames with the same key; according to the searching condition of each frame and the number of the audio segment where the adjacent frame is located, searching half of the number of frames in the audio segment A to obtain the audio segment of the adjacent frame as a candidate matching segment of the audio A; then calculating the similarity between the segment A and the candidate matching segment one by one, and respectively arranging the frames in which the matching characteristics can be found in the two segments A and B according to the time sequence, wherein the frame number of the frame in which the matching pair can be found in the segment A and the segment B is a1,a2,...,amThe frame number of the frame in B that can be matched by the features in A is B1,b2,...,bn2 coefficients s1, s2 are calculated according to the formula:
s 1 = m + n 2 · min ( N A , N B )
s 2 = Σ m χ ( a i ) + Σ n χ ( b i ) 2 · min ( N A , N B )
wherein,t is a set threshold; similarity of two fragments was calculated using S1, S2, S ═ w1·s1+w2·s2W1 and w2 are constant coefficients set, typically w1 < w 2; and keeping the candidate segment with the similarity S larger than the threshold T1 as the matching segment of the segment A.
4) Keeping the audio segments A with the number of the found matching segments larger than a certain threshold T2; judging whether other fragments can find the matching fragments with the number larger than the set threshold value within a certain time interval; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio segments which repeatedly appear in a week are obtained.
5) Splicing and fusing the reserved audio clips belonging to the same day by using the start and end time information of the clips; the rule of fusion of the fragments in pairs is as follows: for 2 segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, the segment A, B and the two segment interval parts are fused into a segment, the starting time is Tas, and the ending time is Tbe;
6) classifying the segments after the fusion is completed, wherein the classification principle is as follows: if some of the 2 fused segments are matched segments, the 2 segments are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;
7) for the repeated segment of each type of content, 3 indexes are calculated and the program type is judged, wherein the judgment rule is as follows:
index 3: t isk
NkIs the average length of the fragments in class K,n is the number of fragments in class K, tiThe time length of the ith fragment;for the temporal distribution of the class K segments,c is the central time of the week, CiThe central time of the fragment i in the K type; t iskThe number of occurrences of class K in one week; 3 indexes were fused:
Type=c1·Dur+c2·Distrb+c2·T
c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.
8) After the type is judged, audio clips are screened to establish a template library; and storing the segment characteristics and the judged program type information into a template library together to generate a template file.
Video stamping stage:
for a new section of program, the system uses the file in the template library to make copy detection for the new program, finds out the segment with the same content as the template file in the program, and specifies the time and the type, including the following steps:
1) extracting characteristics of the new audio program by using the method in the step 2 and the step 3 of the template generation stage, establishing a hash table and matching files in the template library with the new video program one by one;
2) for the template A, 16bit features of each frame are Hash matched with the template A in a hash table;
3) time-aligning the features in A with the matched features, calculating the frame-by-frame Hamming distance hi between the template file and the audio part of the program which is overlapped with the template file in time, dividing the distance by the number of the overlapped parts to obtain a similar distance score Dsore,wherein overlap is the number of frames of the overlapping part of the program and the template.
4) Taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the other candidate segments and the optimal segment is larger than the time interval threshold value and the difference value between the score Dsocre and the optimal score is smaller than the set score offset threshold value, the other candidate segments are still regarded as the matched segments; the start time and duration of the overlapping portion are marked and the type of the portion is marked with the template type.
The method has the advantages that the repeated information of the specific segments in the content in one week is used as the breakthrough, and the repeated segments are quickly found out from a large amount of data by using the voiceprint characteristics; according to the method, the accuracy of searching is ensured according to the stable characteristics of the similarity judgment method and the repeatability; the method judges the program type of the audio template according to the determined time length, the repetition times and the distribution variance information on time of the audio segments; in addition, the invention utilizes the learned audio template to automatically divide the seal of the new program, thereby ensuring the accurate positioning of the speed and time of dividing the seal; the invention is based on the retrieval of the audio and dynamically establishes the template library, overcomes the defects of large calculation amount, low detection speed and different image contents when the program segments have the same audio content in the video-based method, and solves the problem of 'static' templates in the database.
Drawings
FIG. 1 is a flow chart of a template learning portion of a method for automatically detecting an audio template and video chapters in accordance with the present invention;
FIG. 2 is a partial flowchart of a video stamping method for automatically detecting an audio template and stamping a video according to the present invention;
FIG. 3 is a general architecture of the method and system for automatically detecting audio templates and video chapters according to the present invention;
FIG. 4 is a schematic diagram of window length and step for audio feature extraction;
FIG. 5 is a diagram illustrating the calculation of distance scores between the audio clips of the template and the program audio in the video chapter-dividing stage.
Detailed Description
The invention is further described in the following with reference to the drawings and the detailed description, which will enable those skilled in the art to understand and implement the technical solution proposed by the invention without any creative effort.
The technical problems to be solved by the invention include:
1. the method comprises the following steps of (1) learning a template file from a large amount of data by using past program audio data, and dynamically establishing a template library;
2. dividing an audio file and extracting robust voiceprint features which are beneficial to fast searching and matching;
3. according to the extracted features, the similarity between the two sections of audio segments is matched;
4. clustering audio segments, judging the program type of each audio class and selecting a template file from each audio class;
5. and matching the new program by using the file in the template library, and then carrying out chapter separation on the program.
In view of the above technical problems, the present invention provides a method for automatically detecting an audio template and video chaptering, which includes two stages of template learning and video chaptering.
With reference to fig. 1, the template learning stage of the method for automatically detecting an audio template and video chapters includes the following steps:
step 101: preferably, the invention takes program data of the past week as training data, and learns the template file from the training data; every other week, learning a new template from the program data of the last week, and adding the new template into a template library; preprocessing 7 days (7 x 24 hours) 5513HZ audio data; the whole audio of 7 times 24 hours is divided into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file of 1 hour by using the KULLBACK-LEIBLER distance of the audio to obtain fragmentary audio fragments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:
eFr = &Sigma; w x i 2 - mean W TE = se &alpha; &CenterDot; n + &beta; &CenterDot; e min
wherein, w isThe number of sampling points in the window, n being the number of frames of the entire file, xiFor the energy value of each sample point, α is a set parameter, if eFr is not more than TE, the frame is judged as a mute frame, if the mute frame occupies more than half of the audio segment, the segment will be defined as a mute segment.
Step 102: the window length is 0.37 second, 40ms is step to carry out discrete Fourier transform on 5513HZ audio file, and according to the Miller frequency formula
Mel(f)=2595lg(1+f/700)
Converting a 20 HZ-3000 HZ part in an actual frequency band into a Mel frequency band and equally dividing the Mel frequency band into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame;
as shown in fig. 5, frame 1 uses the sampling point data of 0 to 0.37 seconds to perform discrete fourier transform, then converts the 20 HZ-3000 HZ portion in its actual frequency band into the mel frequency band and equally divides it into 17 word frequency bands, and calculates the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of the frame 1; the window is then slid by 40ms, i.e. the above steps are repeated with 40ms to 0.41 s sampled data points to extract the 16Bit binary string as the feature value for frame 2, and so on until all audio frames have extracted features.
Step 103: establishing a hash table by using data of all frames of the audio in one week, wherein a keyword key of the hash table is a characteristic value of 16 bits; the value of the hash table stores the frame number with the characteristic value and the position of the fragment; all frames in each non-silent audio fragment A are hashed in the hash table to form adjacent frames with the same key; according to the searching condition of each frame and the number of the audio segment where the adjacent frame is located, searching half of the number of frames in the audio segment A to obtain the audio segment of the adjacent frame as a candidate matching segment of the audio A; and then calculating the similarity between the segment A and the candidate matching segments one by one.
For the fragment A and a candidate matching fragment B thereof, frames in which matching features can be found are respectively arranged in time sequence, and the frame number of the frame in which a matching pair can be found in the fragment A and the candidate matching fragment B is a1,a2,...,amThe frame number of the frame in B that can be matched by the features in A is B1,b2,...,bn2 coefficients s1, s2 are calculated according to the formula:
s 1 = m + n 2 &CenterDot; min ( N A , N B )
s 2 = &Sigma; m &chi; ( a i ) + &Sigma; n &chi; ( b i ) 2 &CenterDot; min ( N A , N B )
wherein,t is a set threshold, and preferably, T takes a value of 3; similarity of two fragments was calculated using S1, S2, S ═ w1·s1+w2·s2W1 and w2 are set constant coefficients, w1 is smaller than w2, preferably, w1 takes 1/3 and w2 takes 2/3; the candidate segment with the similarity S greater than the threshold T1 is retained as the matching segment of the segment a, and preferably, the threshold T1 is set to 0.5.
Step 104: the audio segment a with the number of the found matching segments larger than a certain threshold T2 is retained, and for the one-week data adopted as the training sample, the T2 value is preferably set to 7; judging whether other fragments can find out matching fragments with the quantity larger than a set threshold value T2 within a certain time interval with A; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio segments which repeatedly appear in a week are obtained.
Step 105: splicing and fusing the reserved audio clips belonging to the same day by using the start and end time information of the clips; the rule of fusion of the fragments in pairs is as follows: for 2 segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, preferably TDur is set to 10 seconds, segment A, B and the two-segment interval part are fused into one segment, the starting time is Tas, and the ending time is Tbe;
step 106: classifying the segments after the fusion is completed, wherein the classification principle is as follows: if some of the 2 fused segments are matched segment pairs, the 2 segments are classified into one class, that is, if some of the data in segment a and some of the data in segment B are judged to be matched in step 104, segment a and segment B are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;
step 107: for the repeated segment of each type of content, 3 indexes are calculated and the program type is judged, wherein the judgment rule is as follows:
index 1: Dur = N k 2 max ( N k 2 ) &ForAll; k
index 2: Distrb = &sigma; k 2 max ( &sigma; k 2 ) k
index 3: t isk
NkIs the average length of the fragments in class K,n is the number of fragments in class K, tiIs as followsThe time length of i segments;for the temporal distribution of the class K segments,c is the central time of the week, CiThe central time of the fragment i in the K type; t iskThe number of occurrences of class K in one week; 3 indexes were fused:
Type=c1·Dur+c2·Distrb+c2·T
c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.
Step 108: after the type is judged, audio clips are screened to establish a template library; and storing the segment characteristics and the judged program type information into a template library together to generate a template file.
With reference to fig. 2, in the video chapter-dividing stage of the method for automatically detecting an audio template and video chapters, for a new program, the system uses the files in the template library to perform copy detection on the new program, finds out the segments with the same content as the template file in the program, and specifies the time and the type, including the following steps:
step 201: the same method as described in the step 102, 103 of the template generation phase, the audio of the new program is extracted with features and a hash table is established, and then the files in the template library are matched with the new video program one by one, and the matching work is as described in the following steps 202, 203, 204;
step 202: for a template audio segment A, the 16bit characteristics of each frame are Hash in a hash table to obtain the matched audio characteristics;
step 203: aligning the characteristics in the A with the matched characteristics in time, calculating a frame-by-frame Hamming distance between the template file and the time-overlapped program audio part, and dividing the distance by the number of the overlapped parts to obtain a similarity score;
as shown in fig. 6, it is not provided in step 202 that frame 3 in the template audio segment a is detected as a matching pair with frame 6 in the audio file of the new program, so that frame 3 of a is aligned with frame 6 of the program in time position, and a frame-by-frame hamming distance between the overlapping part of the program and a is calculated, that is, a calculates hamming distance hi frame by frame from frame 1 to frame m and from frame 4 to frame m +3 of the program; a distance score Dsore is then calculated using the calculated hamming distance for each frame,where overlap is the number of frames in which the program and template want to overlap, in this example overlap is equal to the number of frames m of A.
Step 204: taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the candidate segment and the best segment is greater than the time interval threshold, preferably, the time interval threshold is set to be 1.2 times the time length of the template segment, and the difference between the score and the best score is less than the set score offset threshold, preferably, the score offset threshold is set to be 2, then the candidate segment is still regarded as the matching segment; the start time and duration of the overlapping portion are marked and the type of the portion is marked with the template type.

Claims (1)

1. A method for automatically detecting an audio template and separating a video program is characterized in that the audio template can be rapidly and robustly learned from audio data of a week by using information that a specific segment repeatedly appears on the content in a week as a breakthrough, and the new program is accurately separated by using the template, and the method comprises a template learning stage and a video separation stage, wherein the template learning stage comprises the following steps:
firstly, preprocessing a program audio file of a week and judging a mute segment;
step two, extracting robust voiceprint characteristics for each audio segment;
thirdly, establishing a hash table by using the characteristics of the audio data of one week, and searching for a matched segment;
step four, reserving the audio segments A with the number of the matching segments larger than the threshold value in the segments obtained in the step three, and judging whether other segments can find the matching segments with the number larger than the set threshold value within a certain time interval; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio clips which repeatedly appear on the content in one week are obtained;
step five, in the segments screened in the step four, for two segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, the segment A, B and the interval part of the two segments are fused into one segment, the starting time is Tas, and the ending time is Tbe;
step six, clustering the fragments fused in the step five to obtain a plurality of audio classes, wherein the classification principle is as follows: if some of the two fused fragments are matched fragments, the two fragments are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;
step seven, judging the program type of each class sorted in the step six;
step eight, a section with the longest time is reserved in the repeated audio clips belonging to the matched pair in each type of audio clip, and the characteristics of the clip and the judged program type information are stored into a template library together to generate a template file;
wherein the first step specifically comprises: taking the audio data of the past week as training samples, and dividing the audio data of 5513HZ into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file for 1 hour by utilizing the Kullback-Leibler distance of the audio to obtain fragmentary audio segments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:
w is the number of sampling points in the window, n is the number of frames of the entire file, xiα is a set parameter for the energy value of each sampling point, if eFr is less than or equal to TE, the frame is judged as a mute frame, if the mute frame occupies more than half of the audio segment, the segment is defined as a mute segment;
wherein the second step specifically comprises: the window length is 0.37 seconds, 40ms is step-by-step, the discrete Fourier transform is carried out on an audio file with 5513HZ, and according to a Mel frequency formula Mel (f) -2595 lg (1+ f/700), a part of 20HZ to 3000HZ in an actual frequency band is converted into a Mel frequency band and is equally divided into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame of audio data;
wherein the third step specifically comprises: establishing a hash table by using data of all frames of audio in a week, wherein a key of the hash table is a characteristic value of 16 bits, a value of the hash table stores a frame number with the characteristic value and a position of a segment where the frame number is located, all frames in each non-silent audio segment A hash adjacent frames with the same key in the hash table, and according to the search condition of each frame and the number of the audio segment where the adjacent frame is located, the audio segment of which the number is half of the number of the frames in A is searched to the adjacent frame to serve as a candidate matching segment of the audio A; then calculating the similarity between the segment A and the candidate matching segments one by one, and reserving the candidate segments with the similarity larger than a threshold value as the matching segments of the segment A;
in the video seal stage, for a new program, the system uses the file in the template library to make copy detection on the new program, finds out the segment with the same content as the template file in the program, and specifies the time and the type, comprising the following steps:
step one, consistent with the method in the step two and the step three in the template learning stage, extracting characteristics of the new audio program and establishing a Hash table;
step two, files in the template library are matched with new video programs one by one, and for each template, 16bit features of each frame are Hash in a Hash table to obtain audio features matched with the 16bit features;
step three, calculating a similarity distance score Dscore between the template file and the data of the new program part;
selecting and calibrating a segment matched with the template file from the new program;
the third step of the video seal stage specifically includes: temporally aligning the features of the template file with the features in the program file to which the template file is matched, calculating a frame-by-frame Hamming distance hi between the template file and a temporally overlapping portion of the program audio, and dividing the distance by the number of the overlapping portions to obtain a similarity distance score Dscore, DscoreWherein overlap is the frame number of the overlapped part of the program and the template;
the fourth step in the video seal stage specifically includes: taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the candidate segment and the best matching segment is greater than the time interval threshold value and the difference value between the similarity distance score Dscore calculated in the step three and the similarity distance score of the best matching segment is smaller than the set score offset threshold value, the other candidate segments are still regarded as matching segments; wherein the time interval threshold is equal to 1.2 times the length of the template time and the fractional offset threshold is equal to 2; marking the start time and duration of the overlapping part, and marking the type of the part of the program by using the template type;
in the above method for automatically detecting an audio template and separating a video program, the template learning stage is characterized in that the similarity determination method of the two segments A, B in the third step is as follows: for two segments A and B, respectively arranging the frames in which the matching features can be found in chronological order, wherein the frame number of the frame in which the matching pair can be found in B in A is a1,a2,…,amThe frame number of the frame in B that can be matched by the features in A is B1,b2,…,bn2 coefficients s1, s2 are calculated according to the formula:
wherein,t is a set threshold; calculating the similarity of the two segments by using s1 and s2,
S=w1·s1+w2·s2w1 and w2 are set constant coefficients; reserving the candidate segment with the similarity S larger than the threshold T1 as a matching segment of the segment A;
in the above method for automatically detecting an audio template and separating video programs, the template learning stage is further characterized by comprising the seventh step of calculating 3 indexes and judging the program type of each audio class:
index 1:
index 2:
index 3: t isk
NkIs the average length of the fragments in class K,n is the number of fragments in class K, tiThe time length of the ith fragment;for the temporal distribution of the class K segments,c is the central time of the week, CiThe central time of the fragment i in the K type; t iskThe times of occurrence of the K-th category in one week are determined, then 3 indexes are fused, and the program type is judged;
the specific operation of the fusion of the 3 indexes and the judgment of the program Type of the template file comprises calculating a fusion coefficient Type and comparing the coefficient with a set threshold value:
Type=c1·Dur+c2·Distrb+c3·T
c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.
CN201010567970.1A 2010-12-01 2010-12-01 A kind of automatic detection audio template also divides the method for chapter to video Expired - Fee Related CN102024033B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010567970.1A CN102024033B (en) 2010-12-01 2010-12-01 A kind of automatic detection audio template also divides the method for chapter to video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010567970.1A CN102024033B (en) 2010-12-01 2010-12-01 A kind of automatic detection audio template also divides the method for chapter to video

Publications (2)

Publication Number Publication Date
CN102024033A CN102024033A (en) 2011-04-20
CN102024033B true CN102024033B (en) 2016-01-20

Family

ID=43865330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010567970.1A Expired - Fee Related CN102024033B (en) 2010-12-01 2010-12-01 A kind of automatic detection audio template also divides the method for chapter to video

Country Status (1)

Country Link
CN (1) CN102024033B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103379364B (en) * 2012-04-26 2018-08-03 腾讯科技(深圳)有限公司 Processing method, device, video server and the system of video file
CN103021440B (en) * 2012-11-22 2015-04-22 腾讯科技(深圳)有限公司 Method and system for tracking audio streaming media
CN103237233B (en) * 2013-03-28 2017-01-25 深圳Tcl新技术有限公司 Rapid detection method and system for television commercials
CN104091598A (en) * 2013-04-18 2014-10-08 腾讯科技(深圳)有限公司 Audio file similarity calculation method and device
CN105185401B (en) * 2015-08-28 2019-01-01 广州酷狗计算机科技有限公司 The method and device of synchronized multimedia listed files
CN106548793A (en) * 2015-09-16 2017-03-29 中兴通讯股份有限公司 Storage and the method and apparatus for playing audio file
CN106331844A (en) * 2016-08-17 2017-01-11 北京金山安全软件有限公司 Method and device for generating subtitles of media file and electronic equipment
CN108253977B (en) * 2016-12-28 2020-11-24 沈阳美行科技有限公司 Generation method and generation device of incremental data for updating navigation data
CN107609149B (en) * 2017-09-21 2020-06-19 北京奇艺世纪科技有限公司 Video positioning method and device
CN108513140B (en) * 2018-03-05 2020-10-16 北京明略昭辉科技有限公司 Method for screening repeated advertisement segments in audio and generating wool audio
CN108447501B (en) * 2018-03-27 2020-08-18 中南大学 Pirated video detection method and system based on audio words in cloud storage environment
CN108763492A (en) * 2018-05-29 2018-11-06 四川远鉴科技有限公司 A kind of audio template extracting method and device
CN109087669B (en) * 2018-10-23 2021-03-02 腾讯科技(深圳)有限公司 Audio similarity detection method and device, storage medium and computer equipment
CN109547850B (en) * 2018-11-22 2021-04-06 杭州秋茶网络科技有限公司 Video shooting error correction method and related product
CN110400559B (en) * 2019-06-28 2020-09-29 北京达佳互联信息技术有限公司 Audio synthesis method, device and equipment
CN110717063B (en) * 2019-10-18 2022-02-11 上海华讯网络系统有限公司 Method and system for verifying and selectively archiving IP telephone recording file
CN111883139A (en) * 2020-07-24 2020-11-03 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for screening target voices
CN111863023B (en) * 2020-09-22 2021-01-08 深圳市声扬科技有限公司 Voice detection method and device, computer equipment and storage medium
CN115205635B (en) * 2022-09-13 2022-12-02 有米科技股份有限公司 Weak supervision self-training method and device of image-text semantic alignment model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420618A (en) * 2008-12-02 2009-04-29 西安交通大学 Adaptive telescopic video encoding and decoding construction design method based on interest zone
CN101594527A (en) * 2009-06-30 2009-12-02 成都艾索语音技术有限公司 The dual stage process of high Precision Detection template from audio and video streams

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101420618A (en) * 2008-12-02 2009-04-29 西安交通大学 Adaptive telescopic video encoding and decoding construction design method based on interest zone
CN101594527A (en) * 2009-06-30 2009-12-02 成都艾索语音技术有限公司 The dual stage process of high Precision Detection template from audio and video streams

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于时空融合的视频分割算法研究;李宏等;《信号处理》;20090131;第25卷(第1期);第72-76页 *

Also Published As

Publication number Publication date
CN102024033A (en) 2011-04-20

Similar Documents

Publication Publication Date Title
CN102024033B (en) A kind of automatic detection audio template also divides the method for chapter to video
CN102799605B (en) A kind of advertisement detecting method and system
WO2021000909A1 (en) Curriculum optimisation method, apparatus, and system
US11983919B2 (en) Video anomaly detection method based on human-machine cooperation
CN107305541B (en) Method and device for segmenting speech recognition text
Zhang et al. Automatic parsing and indexing of news video
US7765574B1 (en) Automated segmentation and information extraction of broadcast news via finite state presentation model
CN106878632B (en) Video data processing method and device
Snoek et al. Multimedia event-based video indexing using time intervals
Qi et al. Integrating visual, audio and text analysis for news video
CN101821734B (en) Detection and classification of matches between time-based media
CN107515934B (en) Movie semantic personalized tag optimization method based on big data
CN109446376B (en) Method and system for classifying voice through word segmentation
CN106792005B (en) Content detection method based on audio and video combination
CN102436483A (en) Video advertisement detecting method based on explicit type sharing subspace
CN107609149B (en) Video positioning method and device
CN112699787A (en) Method and device for detecting advertisement insertion time point
Hanjalic et al. Semiautomatic news analysis, indexing, and classification system based on topic preselection
CN113194332B (en) Multi-policy-based new advertisement discovery method, electronic device and readable storage medium
CN115580758A (en) Video content generation method and device, electronic equipment and storage medium
KR101389730B1 (en) Method to create split position accordance with subjects for the video file
CN114048335A (en) Knowledge base-based user interaction method and device
CN111723235A (en) Music content identification method, device and equipment
CN117725194A (en) Personalized pushing method, system, equipment and storage medium for futures data
Haloi et al. Unsupervised story segmentation and indexing of broadcast news video

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160120

Termination date: 20211201