CN102024033B - A kind of automatic detection audio template also divides the method for chapter to video - Google Patents
A kind of automatic detection audio template also divides the method for chapter to video Download PDFInfo
- Publication number
- CN102024033B CN102024033B CN201010567970.1A CN201010567970A CN102024033B CN 102024033 B CN102024033 B CN 102024033B CN 201010567970 A CN201010567970 A CN 201010567970A CN 102024033 B CN102024033 B CN 102024033B
- Authority
- CN
- China
- Prior art keywords
- audio
- template
- segment
- program
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000001514 detection method Methods 0.000 title claims abstract description 16
- 239000012634 fragment Substances 0.000 claims abstract description 50
- 230000004927 fusion Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 2
- 230000003068 static effect Effects 0.000 abstract description 3
- 238000009415 formwork Methods 0.000 abstract 1
- 239000000463 material Substances 0.000 abstract 1
- 230000001755 vocal effect Effects 0.000 abstract 1
- 230000003442 weekly effect Effects 0.000 abstract 1
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of automatic detection audio template also divides the method for chapter to video.It utilizes program audio data weekly, vocal print feature Fast Learning is adopted to go out the fragment that content repeats, again by segment composition, sort out alternatively template, statistics fragment length, occurrence number, Annual distribution information are come calibrating template type and screen selecting formwork file, and are utilized template automatically to divide chapter to new program.The present invention is based on the retrieval of audio frequency and dynamically set up template base, the calculated amount overcome based on the method for video is large, detection speed is slow, and the shortcoming that when program fragment has an identical audio content, picture material is different, also solves the problem that in database, " static state " template is simultaneously.
Description
Technical Field
The invention belongs to the field of copy detection of audio contents of video programs and automatic chaptering of the programs, and particularly relates to a method for automatically detecting an audio template and chaptering the video.
Background
The video program is divided into chapters, which means that specific segments (such as advertisements and program special effects) of the video program with large data volume and long duration are marked so as to facilitate browsing of users.
At present, the traditional method is to extract and process the features of the video frames, which is based on the image. Station caption detection and video identification are common.
The video identification method can actually utilize the information of the template in the database to quickly and accurately position and mark, but the template in the database of the current method is artificially added, the information in the database is relatively fixed, the data which is not in the database cannot be detected, in addition, the image contents of some program segments with the same audio contents are different at the time, and the duration is longer, such as a news content review part. For such program segments, common image-based detection methods are not applicable. For logo detection, more and more videos use the same logo (e.g., commercials and programs) in portions that should be interpreted as different chapters, thus rendering the logo detection method ineffective.
The video-based method also has the problems of large calculation amount and low detection speed. At present, the video seal method based on audio belongs to the detection with a template, namely, the template is artificially defined in advance in a database and then the test audio data is compared. The drawback of these methods is also that the templates in the database are "static" and data not in the database will not be detectable.
Disclosure of Invention
The invention provides a method for automatically detecting an audio template and stamping videos in order to overcome the defects of two methods, namely video detection and template-based audio detection, and the method can rapidly and robustly learn the audio template in an audio file with large data volume and accurately stamp the new videos by utilizing the template.
The invention provides a method for automatically detecting an audio template and chaptering videos, which comprises a template learning stage and a video chaptering stage.
The template learning phase comprises the following steps:
1) using the audio data of the past week as a training sample, and preprocessing the audio data of 5513HZ for 7 days (7 x 24 hours); the whole audio of 7 times 24 hours is divided into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file of 1 hour by using the KULLBACK-LEIBLER distance of the audio to obtain fragmentary audio fragments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:
wherein, w is the number of sampling points in the window, n is the wholeNumber of frames of file, xiFor the energy value of each sample point, α is a set parameter,
if eFr is less than or equal to TE, the frame is judged as a mute frame; if a silent frame occupies more than half of an audio segment, the segment will be defined as a silent segment.
2) The window length is 0.37 second, 40ms is step to carry out discrete Fourier transform on 5513HZ audio file, and according to the Miller frequency formula
Mel(f)=2595lg(1+f/700)
Converting a 20 HZ-3000 HZ part in an actual frequency band into a Mel frequency band and equally dividing the Mel frequency band into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame;
3) establishing a hash table by using data of all frames of the audio within one week, wherein a keyword key of the hash table is a characteristic value of 16 bits; the value of the hash table stores the frame number with the characteristic value and the position of the fragment; all frames in each non-silent audio fragment A are hashed in the hash table to form adjacent frames with the same key; according to the searching condition of each frame and the number of the audio segment where the adjacent frame is located, searching half of the number of frames in the audio segment A to obtain the audio segment of the adjacent frame as a candidate matching segment of the audio A; then calculating the similarity between the segment A and the candidate matching segment one by one, and respectively arranging the frames in which the matching characteristics can be found in the two segments A and B according to the time sequence, wherein the frame number of the frame in which the matching pair can be found in the segment A and the segment B is a1,a2,...,amThe frame number of the frame in B that can be matched by the features in A is B1,b2,...,bn2 coefficients s1, s2 are calculated according to the formula:
wherein,t is a set threshold; similarity of two fragments was calculated using S1, S2, S ═ w1·s1+w2·s2W1 and w2 are constant coefficients set, typically w1 < w 2; and keeping the candidate segment with the similarity S larger than the threshold T1 as the matching segment of the segment A.
4) Keeping the audio segments A with the number of the found matching segments larger than a certain threshold T2; judging whether other fragments can find the matching fragments with the number larger than the set threshold value within a certain time interval; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio segments which repeatedly appear in a week are obtained.
5) Splicing and fusing the reserved audio clips belonging to the same day by using the start and end time information of the clips; the rule of fusion of the fragments in pairs is as follows: for 2 segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, the segment A, B and the two segment interval parts are fused into a segment, the starting time is Tas, and the ending time is Tbe;
6) classifying the segments after the fusion is completed, wherein the classification principle is as follows: if some of the 2 fused segments are matched segments, the 2 segments are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;
7) for the repeated segment of each type of content, 3 indexes are calculated and the program type is judged, wherein the judgment rule is as follows:
index 3: t isk
NkIs the average length of the fragments in class K,n is the number of fragments in class K, tiThe time length of the ith fragment;for the temporal distribution of the class K segments,c is the central time of the week, CiThe central time of the fragment i in the K type; t iskThe number of occurrences of class K in one week; 3 indexes were fused:
Type=c1·Dur+c2·Distrb+c2·T
c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.
8) After the type is judged, audio clips are screened to establish a template library; and storing the segment characteristics and the judged program type information into a template library together to generate a template file.
Video stamping stage:
for a new section of program, the system uses the file in the template library to make copy detection for the new program, finds out the segment with the same content as the template file in the program, and specifies the time and the type, including the following steps:
1) extracting characteristics of the new audio program by using the method in the step 2 and the step 3 of the template generation stage, establishing a hash table and matching files in the template library with the new video program one by one;
2) for the template A, 16bit features of each frame are Hash matched with the template A in a hash table;
3) time-aligning the features in A with the matched features, calculating the frame-by-frame Hamming distance hi between the template file and the audio part of the program which is overlapped with the template file in time, dividing the distance by the number of the overlapped parts to obtain a similar distance score Dsore,wherein overlap is the number of frames of the overlapping part of the program and the template.
4) Taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the other candidate segments and the optimal segment is larger than the time interval threshold value and the difference value between the score Dsocre and the optimal score is smaller than the set score offset threshold value, the other candidate segments are still regarded as the matched segments; the start time and duration of the overlapping portion are marked and the type of the portion is marked with the template type.
The method has the advantages that the repeated information of the specific segments in the content in one week is used as the breakthrough, and the repeated segments are quickly found out from a large amount of data by using the voiceprint characteristics; according to the method, the accuracy of searching is ensured according to the stable characteristics of the similarity judgment method and the repeatability; the method judges the program type of the audio template according to the determined time length, the repetition times and the distribution variance information on time of the audio segments; in addition, the invention utilizes the learned audio template to automatically divide the seal of the new program, thereby ensuring the accurate positioning of the speed and time of dividing the seal; the invention is based on the retrieval of the audio and dynamically establishes the template library, overcomes the defects of large calculation amount, low detection speed and different image contents when the program segments have the same audio content in the video-based method, and solves the problem of 'static' templates in the database.
Drawings
FIG. 1 is a flow chart of a template learning portion of a method for automatically detecting an audio template and video chapters in accordance with the present invention;
FIG. 2 is a partial flowchart of a video stamping method for automatically detecting an audio template and stamping a video according to the present invention;
FIG. 3 is a general architecture of the method and system for automatically detecting audio templates and video chapters according to the present invention;
FIG. 4 is a schematic diagram of window length and step for audio feature extraction;
FIG. 5 is a diagram illustrating the calculation of distance scores between the audio clips of the template and the program audio in the video chapter-dividing stage.
Detailed Description
The invention is further described in the following with reference to the drawings and the detailed description, which will enable those skilled in the art to understand and implement the technical solution proposed by the invention without any creative effort.
The technical problems to be solved by the invention include:
1. the method comprises the following steps of (1) learning a template file from a large amount of data by using past program audio data, and dynamically establishing a template library;
2. dividing an audio file and extracting robust voiceprint features which are beneficial to fast searching and matching;
3. according to the extracted features, the similarity between the two sections of audio segments is matched;
4. clustering audio segments, judging the program type of each audio class and selecting a template file from each audio class;
5. and matching the new program by using the file in the template library, and then carrying out chapter separation on the program.
In view of the above technical problems, the present invention provides a method for automatically detecting an audio template and video chaptering, which includes two stages of template learning and video chaptering.
With reference to fig. 1, the template learning stage of the method for automatically detecting an audio template and video chapters includes the following steps:
step 101: preferably, the invention takes program data of the past week as training data, and learns the template file from the training data; every other week, learning a new template from the program data of the last week, and adding the new template into a template library; preprocessing 7 days (7 x 24 hours) 5513HZ audio data; the whole audio of 7 times 24 hours is divided into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file of 1 hour by using the KULLBACK-LEIBLER distance of the audio to obtain fragmentary audio fragments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:
wherein, w isThe number of sampling points in the window, n being the number of frames of the entire file, xiFor the energy value of each sample point, α is a set parameter, if eFr is not more than TE, the frame is judged as a mute frame, if the mute frame occupies more than half of the audio segment, the segment will be defined as a mute segment.
Step 102: the window length is 0.37 second, 40ms is step to carry out discrete Fourier transform on 5513HZ audio file, and according to the Miller frequency formula
Mel(f)=2595lg(1+f/700)
Converting a 20 HZ-3000 HZ part in an actual frequency band into a Mel frequency band and equally dividing the Mel frequency band into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame;
as shown in fig. 5, frame 1 uses the sampling point data of 0 to 0.37 seconds to perform discrete fourier transform, then converts the 20 HZ-3000 HZ portion in its actual frequency band into the mel frequency band and equally divides it into 17 word frequency bands, and calculates the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of the frame 1; the window is then slid by 40ms, i.e. the above steps are repeated with 40ms to 0.41 s sampled data points to extract the 16Bit binary string as the feature value for frame 2, and so on until all audio frames have extracted features.
Step 103: establishing a hash table by using data of all frames of the audio in one week, wherein a keyword key of the hash table is a characteristic value of 16 bits; the value of the hash table stores the frame number with the characteristic value and the position of the fragment; all frames in each non-silent audio fragment A are hashed in the hash table to form adjacent frames with the same key; according to the searching condition of each frame and the number of the audio segment where the adjacent frame is located, searching half of the number of frames in the audio segment A to obtain the audio segment of the adjacent frame as a candidate matching segment of the audio A; and then calculating the similarity between the segment A and the candidate matching segments one by one.
For the fragment A and a candidate matching fragment B thereof, frames in which matching features can be found are respectively arranged in time sequence, and the frame number of the frame in which a matching pair can be found in the fragment A and the candidate matching fragment B is a1,a2,...,amThe frame number of the frame in B that can be matched by the features in A is B1,b2,...,bn2 coefficients s1, s2 are calculated according to the formula:
wherein,t is a set threshold, and preferably, T takes a value of 3; similarity of two fragments was calculated using S1, S2, S ═ w1·s1+w2·s2W1 and w2 are set constant coefficients, w1 is smaller than w2, preferably, w1 takes 1/3 and w2 takes 2/3; the candidate segment with the similarity S greater than the threshold T1 is retained as the matching segment of the segment a, and preferably, the threshold T1 is set to 0.5.
Step 104: the audio segment a with the number of the found matching segments larger than a certain threshold T2 is retained, and for the one-week data adopted as the training sample, the T2 value is preferably set to 7; judging whether other fragments can find out matching fragments with the quantity larger than a set threshold value T2 within a certain time interval with A; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio segments which repeatedly appear in a week are obtained.
Step 105: splicing and fusing the reserved audio clips belonging to the same day by using the start and end time information of the clips; the rule of fusion of the fragments in pairs is as follows: for 2 segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, preferably TDur is set to 10 seconds, segment A, B and the two-segment interval part are fused into one segment, the starting time is Tas, and the ending time is Tbe;
step 106: classifying the segments after the fusion is completed, wherein the classification principle is as follows: if some of the 2 fused segments are matched segment pairs, the 2 segments are classified into one class, that is, if some of the data in segment a and some of the data in segment B are judged to be matched in step 104, segment a and segment B are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;
step 107: for the repeated segment of each type of content, 3 indexes are calculated and the program type is judged, wherein the judgment rule is as follows:
index 1:
index 2:
index 3: t isk
NkIs the average length of the fragments in class K,n is the number of fragments in class K, tiIs as followsThe time length of i segments;for the temporal distribution of the class K segments,c is the central time of the week, CiThe central time of the fragment i in the K type; t iskThe number of occurrences of class K in one week; 3 indexes were fused:
Type=c1·Dur+c2·Distrb+c2·T
c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.
Step 108: after the type is judged, audio clips are screened to establish a template library; and storing the segment characteristics and the judged program type information into a template library together to generate a template file.
With reference to fig. 2, in the video chapter-dividing stage of the method for automatically detecting an audio template and video chapters, for a new program, the system uses the files in the template library to perform copy detection on the new program, finds out the segments with the same content as the template file in the program, and specifies the time and the type, including the following steps:
step 201: the same method as described in the step 102, 103 of the template generation phase, the audio of the new program is extracted with features and a hash table is established, and then the files in the template library are matched with the new video program one by one, and the matching work is as described in the following steps 202, 203, 204;
step 202: for a template audio segment A, the 16bit characteristics of each frame are Hash in a hash table to obtain the matched audio characteristics;
step 203: aligning the characteristics in the A with the matched characteristics in time, calculating a frame-by-frame Hamming distance between the template file and the time-overlapped program audio part, and dividing the distance by the number of the overlapped parts to obtain a similarity score;
as shown in fig. 6, it is not provided in step 202 that frame 3 in the template audio segment a is detected as a matching pair with frame 6 in the audio file of the new program, so that frame 3 of a is aligned with frame 6 of the program in time position, and a frame-by-frame hamming distance between the overlapping part of the program and a is calculated, that is, a calculates hamming distance hi frame by frame from frame 1 to frame m and from frame 4 to frame m +3 of the program; a distance score Dsore is then calculated using the calculated hamming distance for each frame,where overlap is the number of frames in which the program and template want to overlap, in this example overlap is equal to the number of frames m of A.
Step 204: taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the candidate segment and the best segment is greater than the time interval threshold, preferably, the time interval threshold is set to be 1.2 times the time length of the template segment, and the difference between the score and the best score is less than the set score offset threshold, preferably, the score offset threshold is set to be 2, then the candidate segment is still regarded as the matching segment; the start time and duration of the overlapping portion are marked and the type of the portion is marked with the template type.
Claims (1)
1. A method for automatically detecting an audio template and separating a video program is characterized in that the audio template can be rapidly and robustly learned from audio data of a week by using information that a specific segment repeatedly appears on the content in a week as a breakthrough, and the new program is accurately separated by using the template, and the method comprises a template learning stage and a video separation stage, wherein the template learning stage comprises the following steps:
firstly, preprocessing a program audio file of a week and judging a mute segment;
step two, extracting robust voiceprint characteristics for each audio segment;
thirdly, establishing a hash table by using the characteristics of the audio data of one week, and searching for a matched segment;
step four, reserving the audio segments A with the number of the matching segments larger than the threshold value in the segments obtained in the step three, and judging whether other segments can find the matching segments with the number larger than the set threshold value within a certain time interval; if yes, the audio clip is reserved, and if not, the audio clip is deleted; finally, a series of audio clips which repeatedly appear on the content in one week are obtained;
step five, in the segments screened in the step four, for two segments A, B on the same day, the starting time of A is Tas, the ending time is Tae, the starting time and the ending time of B are Tbs and Tbe respectively, wherein Tae is less than Tbs, if | Tae-Tbs | < TDur, the segment A, B and the interval part of the two segments are fused into one segment, the starting time is Tas, and the ending time is Tbe;
step six, clustering the fragments fused in the step five to obtain a plurality of audio classes, wherein the classification principle is as follows: if some of the two fused fragments are matched fragments, the two fragments are classified into one class; the additional classes also satisfy the criteria: if A and B are the same class and B and C are the same class, then A and C are the same class;
step seven, judging the program type of each class sorted in the step six;
step eight, a section with the longest time is reserved in the repeated audio clips belonging to the matched pair in each type of audio clip, and the characteristics of the clip and the judged program type information are stored into a template library together to generate a template file;
wherein the first step specifically comprises: taking the audio data of the past week as training samples, and dividing the audio data of 5513HZ into a plurality of audio files with 1 hour as a unit; carrying out shear point segmentation on the file for 1 hour by utilizing the Kullback-Leibler distance of the audio to obtain fragmentary audio segments; the method comprises the steps of preventing fragments from being excessively fragmented, clustering the audio fragments, judging the time length of each fragment, and splicing the fragments with the time length less than 3 seconds with the adjacent fragments with shorter time lengths; then, for an audio file with a window length of 5513HZ, with a window length of 0.37s and 40ms as a frame, judging whether each frame is a mute frame, wherein the energy of each frame is eFr, and an energy threshold TE is determined according to the formula:
w is the number of sampling points in the window, n is the number of frames of the entire file, xiα is a set parameter for the energy value of each sampling point, if eFr is less than or equal to TE, the frame is judged as a mute frame, if the mute frame occupies more than half of the audio segment, the segment is defined as a mute segment;
wherein the second step specifically comprises: the window length is 0.37 seconds, 40ms is step-by-step, the discrete Fourier transform is carried out on an audio file with 5513HZ, and according to a Mel frequency formula Mel (f) -2595 lg (1+ f/700), a part of 20HZ to 3000HZ in an actual frequency band is converted into a Mel frequency band and is equally divided into 17 character frequency bands; calculating the energy difference between two adjacent frequency bands; if the difference is larger than or equal to the set threshold, the output is 1, otherwise, the output is 0; extracting a binary character string of 16 bits as a characteristic value of each frame of audio data;
wherein the third step specifically comprises: establishing a hash table by using data of all frames of audio in a week, wherein a key of the hash table is a characteristic value of 16 bits, a value of the hash table stores a frame number with the characteristic value and a position of a segment where the frame number is located, all frames in each non-silent audio segment A hash adjacent frames with the same key in the hash table, and according to the search condition of each frame and the number of the audio segment where the adjacent frame is located, the audio segment of which the number is half of the number of the frames in A is searched to the adjacent frame to serve as a candidate matching segment of the audio A; then calculating the similarity between the segment A and the candidate matching segments one by one, and reserving the candidate segments with the similarity larger than a threshold value as the matching segments of the segment A;
in the video seal stage, for a new program, the system uses the file in the template library to make copy detection on the new program, finds out the segment with the same content as the template file in the program, and specifies the time and the type, comprising the following steps:
step one, consistent with the method in the step two and the step three in the template learning stage, extracting characteristics of the new audio program and establishing a Hash table;
step two, files in the template library are matched with new video programs one by one, and for each template, 16bit features of each frame are Hash in a Hash table to obtain audio features matched with the 16bit features;
step three, calculating a similarity distance score Dscore between the template file and the data of the new program part;
selecting and calibrating a segment matched with the template file from the new program;
the third step of the video seal stage specifically includes: temporally aligning the features of the template file with the features in the program file to which the template file is matched, calculating a frame-by-frame Hamming distance hi between the template file and a temporally overlapping portion of the program audio, and dividing the distance by the number of the overlapping portions to obtain a similarity distance score Dscore, DscoreWherein overlap is the frame number of the overlapped part of the program and the template;
the fourth step in the video seal stage specifically includes: taking the program audio part with the score smaller than a set threshold value as a candidate matching segment of the template, wherein the segment with the smallest score is set as the best matching segment; then, if the time interval between the candidate segment and the best matching segment is greater than the time interval threshold value and the difference value between the similarity distance score Dscore calculated in the step three and the similarity distance score of the best matching segment is smaller than the set score offset threshold value, the other candidate segments are still regarded as matching segments; wherein the time interval threshold is equal to 1.2 times the length of the template time and the fractional offset threshold is equal to 2; marking the start time and duration of the overlapping part, and marking the type of the part of the program by using the template type;
in the above method for automatically detecting an audio template and separating a video program, the template learning stage is characterized in that the similarity determination method of the two segments A, B in the third step is as follows: for two segments A and B, respectively arranging the frames in which the matching features can be found in chronological order, wherein the frame number of the frame in which the matching pair can be found in B in A is a1,a2,…,amThe frame number of the frame in B that can be matched by the features in A is B1,b2,…,bn2 coefficients s1, s2 are calculated according to the formula:
wherein,t is a set threshold; calculating the similarity of the two segments by using s1 and s2,
S=w1·s1+w2·s2w1 and w2 are set constant coefficients; reserving the candidate segment with the similarity S larger than the threshold T1 as a matching segment of the segment A;
in the above method for automatically detecting an audio template and separating video programs, the template learning stage is further characterized by comprising the seventh step of calculating 3 indexes and judging the program type of each audio class:
index 1:
index 2:
index 3: t isk
NkIs the average length of the fragments in class K,n is the number of fragments in class K, tiThe time length of the ith fragment;for the temporal distribution of the class K segments,c is the central time of the week, CiThe central time of the fragment i in the K type; t iskThe times of occurrence of the K-th category in one week are determined, then 3 indexes are fused, and the program type is judged;
the specific operation of the fusion of the 3 indexes and the judgment of the program Type of the template file comprises calculating a fusion coefficient Type and comparing the coefficient with a set threshold value:
Type=c1·Dur+c2·Distrb+c3·T
c1, C2 and C3 are 3 set weights; the Type is less than T1, and the segment is judged as a special program effect; the Type is more than or equal to T1 and less than T2, and the Type is judged as a propaganda film of a television station; type is not less than T2, and the Type is determined as advertisement.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010567970.1A CN102024033B (en) | 2010-12-01 | 2010-12-01 | A kind of automatic detection audio template also divides the method for chapter to video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201010567970.1A CN102024033B (en) | 2010-12-01 | 2010-12-01 | A kind of automatic detection audio template also divides the method for chapter to video |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102024033A CN102024033A (en) | 2011-04-20 |
CN102024033B true CN102024033B (en) | 2016-01-20 |
Family
ID=43865330
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201010567970.1A Expired - Fee Related CN102024033B (en) | 2010-12-01 | 2010-12-01 | A kind of automatic detection audio template also divides the method for chapter to video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102024033B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103379364B (en) * | 2012-04-26 | 2018-08-03 | 腾讯科技(深圳)有限公司 | Processing method, device, video server and the system of video file |
CN103021440B (en) * | 2012-11-22 | 2015-04-22 | 腾讯科技(深圳)有限公司 | Method and system for tracking audio streaming media |
CN103237233B (en) * | 2013-03-28 | 2017-01-25 | 深圳Tcl新技术有限公司 | Rapid detection method and system for television commercials |
CN104091598A (en) * | 2013-04-18 | 2014-10-08 | 腾讯科技(深圳)有限公司 | Audio file similarity calculation method and device |
CN105185401B (en) * | 2015-08-28 | 2019-01-01 | 广州酷狗计算机科技有限公司 | The method and device of synchronized multimedia listed files |
CN106548793A (en) * | 2015-09-16 | 2017-03-29 | 中兴通讯股份有限公司 | Storage and the method and apparatus for playing audio file |
CN106331844A (en) * | 2016-08-17 | 2017-01-11 | 北京金山安全软件有限公司 | Method and device for generating subtitles of media file and electronic equipment |
CN108253977B (en) * | 2016-12-28 | 2020-11-24 | 沈阳美行科技有限公司 | Generation method and generation device of incremental data for updating navigation data |
CN107609149B (en) * | 2017-09-21 | 2020-06-19 | 北京奇艺世纪科技有限公司 | Video positioning method and device |
CN108513140B (en) * | 2018-03-05 | 2020-10-16 | 北京明略昭辉科技有限公司 | Method for screening repeated advertisement segments in audio and generating wool audio |
CN108447501B (en) * | 2018-03-27 | 2020-08-18 | 中南大学 | Pirated video detection method and system based on audio words in cloud storage environment |
CN108763492A (en) * | 2018-05-29 | 2018-11-06 | 四川远鉴科技有限公司 | A kind of audio template extracting method and device |
CN109087669B (en) * | 2018-10-23 | 2021-03-02 | 腾讯科技(深圳)有限公司 | Audio similarity detection method and device, storage medium and computer equipment |
CN109547850B (en) * | 2018-11-22 | 2021-04-06 | 杭州秋茶网络科技有限公司 | Video shooting error correction method and related product |
CN110400559B (en) * | 2019-06-28 | 2020-09-29 | 北京达佳互联信息技术有限公司 | Audio synthesis method, device and equipment |
CN110717063B (en) * | 2019-10-18 | 2022-02-11 | 上海华讯网络系统有限公司 | Method and system for verifying and selectively archiving IP telephone recording file |
CN111883139A (en) * | 2020-07-24 | 2020-11-03 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for screening target voices |
CN111863023B (en) * | 2020-09-22 | 2021-01-08 | 深圳市声扬科技有限公司 | Voice detection method and device, computer equipment and storage medium |
CN115205635B (en) * | 2022-09-13 | 2022-12-02 | 有米科技股份有限公司 | Weak supervision self-training method and device of image-text semantic alignment model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101420618A (en) * | 2008-12-02 | 2009-04-29 | 西安交通大学 | Adaptive telescopic video encoding and decoding construction design method based on interest zone |
CN101594527A (en) * | 2009-06-30 | 2009-12-02 | 成都艾索语音技术有限公司 | The dual stage process of high Precision Detection template from audio and video streams |
-
2010
- 2010-12-01 CN CN201010567970.1A patent/CN102024033B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101420618A (en) * | 2008-12-02 | 2009-04-29 | 西安交通大学 | Adaptive telescopic video encoding and decoding construction design method based on interest zone |
CN101594527A (en) * | 2009-06-30 | 2009-12-02 | 成都艾索语音技术有限公司 | The dual stage process of high Precision Detection template from audio and video streams |
Non-Patent Citations (1)
Title |
---|
基于时空融合的视频分割算法研究;李宏等;《信号处理》;20090131;第25卷(第1期);第72-76页 * |
Also Published As
Publication number | Publication date |
---|---|
CN102024033A (en) | 2011-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102024033B (en) | A kind of automatic detection audio template also divides the method for chapter to video | |
CN102799605B (en) | A kind of advertisement detecting method and system | |
WO2021000909A1 (en) | Curriculum optimisation method, apparatus, and system | |
US11983919B2 (en) | Video anomaly detection method based on human-machine cooperation | |
CN107305541B (en) | Method and device for segmenting speech recognition text | |
Zhang et al. | Automatic parsing and indexing of news video | |
US7765574B1 (en) | Automated segmentation and information extraction of broadcast news via finite state presentation model | |
CN106878632B (en) | Video data processing method and device | |
Snoek et al. | Multimedia event-based video indexing using time intervals | |
Qi et al. | Integrating visual, audio and text analysis for news video | |
CN101821734B (en) | Detection and classification of matches between time-based media | |
CN107515934B (en) | Movie semantic personalized tag optimization method based on big data | |
CN109446376B (en) | Method and system for classifying voice through word segmentation | |
CN106792005B (en) | Content detection method based on audio and video combination | |
CN102436483A (en) | Video advertisement detecting method based on explicit type sharing subspace | |
CN107609149B (en) | Video positioning method and device | |
CN112699787A (en) | Method and device for detecting advertisement insertion time point | |
Hanjalic et al. | Semiautomatic news analysis, indexing, and classification system based on topic preselection | |
CN113194332B (en) | Multi-policy-based new advertisement discovery method, electronic device and readable storage medium | |
CN115580758A (en) | Video content generation method and device, electronic equipment and storage medium | |
KR101389730B1 (en) | Method to create split position accordance with subjects for the video file | |
CN114048335A (en) | Knowledge base-based user interaction method and device | |
CN111723235A (en) | Music content identification method, device and equipment | |
CN117725194A (en) | Personalized pushing method, system, equipment and storage medium for futures data | |
Haloi et al. | Unsupervised story segmentation and indexing of broadcast news video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20160120 Termination date: 20211201 |