CN100508587C

CN100508587C - News video retrieval method based on speech classifying identification

Info

Publication number: CN100508587C
Application number: CNB2006100079659A
Authority: CN
Inventors: 彭宇新; 房翠华; 陈晓鸥; 吴於茜
Original assignee: BEIDA FANGZHENG TECHN INST Co Ltd BEIJING; Peking University; Peking University Founder Group Co Ltd
Current assignee: New Founder Holdings Development Co ltd; Peking University; Peking University Founder Research and Development Center
Priority date: 2006-02-24
Filing date: 2006-02-24
Publication date: 2009-07-01
Anticipated expiration: 2026-02-24
Also published as: CN1825936A

Abstract

This invention relates to a news video search method based on phone sort identification, which divides all phone fragments of standard phones automatically in news video then identifies the standard phones by a phone identification system, since the standard phone can express the main content of the video, it's easy to realize the news searches from the context to the video.

Description

A kind of news video retrieval method based on the voice Classification and Identification

Technical field

The invention belongs to computer speech identification and video search technique area, be specifically related to a kind of news video retrieval method based on the voice Classification and Identification.

Background technology

At present, speech recognition technology has a wide range of applications, and it not only can be applied to audio area, and in video field important use is arranged also, because also comprise audio-frequency information in the video.If can identify voice content in the video, then can realize the retrieval from the speech text to the video content for video frequency searching provides strong support by speech recognition technology.Existing video retrieval technology generally is to extract low-level image features such as color, texture from video, carries out video frequency searching according to these features then.But, this method is brought following two problems: (1) people are when retrieve video, be to retrieve according to the mankind's senior semantic feature such as football match, the war in Iraq, bird flu etc., video low-level image feature such as features such as color, texture that this and computer are described have very big contradiction, and both can't reach consistent; (2) existing video retrieval method can not fine realization by the retrieval of literal to video, inquiry mode also is not suitable for the mode that people habitually practise, and uses very inconvenient.Existing video retrieval method is: generally submit an inquiry camera lens or inquiry segment by the user to system, system returns the result similar to inquiring about example then, but, does the problem of meanwhile bringing is: how the user obtain inquiring about example? in addition, the inquiry mode of most of user's customs is input inquiry literal, system returns the video data relevant with inquiring about literal then, for example, user input query literal " the war in Iraq ", wish that system can return the video data relevant with " the war in Iraq ", be similar to present search engine such as google and Baidu etc., but different with these search engines is, what import is literal, and the result of retrieval but is a video data.

In order to realize above-mentioned retrieval from the speech text to the video content, need obtain the Word message in the video, and in order to obtain the Word message in the video, a feasible method is to use speech recognition technology, identifies the language and characters in the video.But existing speech recognition system in order to discern the voice of different people, often needs by the speaker speech recognition system to be trained earlier, and then is discerned speaker's voice by speech recognition system.This method is for the voice snippet that comprises many people, be difficult to use, because be difficult to find everyone that speech recognition system is trained, even a few peoples' voice snippet, often also can't find the speaker to carry out voice training, for example, can not find each speaker to carry out voice training for the speech recognition of news video; In addition, even through voice training, to non-standard voice, still be difficult to identification, discrimination is very low.But, if without voice training, directly use speech recognition system that news video is carried out speech recognition, recognition effect can be poorer so, discrimination is lower, because the news program of video has generally included following various sound: the news program advance notice of (1) band music background; (2) advertisement; (3) weather forecast; (4) non-standard voice are as by dialect of interviewer etc.; (5) received pronunciation.In above-mentioned several voice, the discrimination of non-standard voice is very low, and the discrimination of (1)-(3) is lower, can not discern substantially.Therefore, without distinction whole news video is carried out speech recognition if directly use speech recognition system, the result who brings like this is: speech recognition system is all discerned the various sound that news video comprised, cause having comprised among the result of speech recognition correct recognition result (mainly being identification) and wrong recognition result (mainly being identification) at last to other voice in above-mentioned 1 to 4 to the received pronunciation in above-mentioned 5, and computer can't know which is correct result, which is an error result, therefore, when carrying out video frequency searching with this, as inquiring about the corresponding video of literal " the war in Iraq ", a lot of wrong results then can appear.

Summary of the invention

At the deficiencies in the prior art, the objective of the invention is to propose a kind of news video retrieval method based on the voice Classification and Identification, this method not needing can realize the speaker to train the received pronunciations such as standard mandarin that can discern automatically in the news video, thereby realizes the news retrieval from the text to the video.

For reaching above purpose, the technical solution used in the present invention is: a kind of news video retrieval method based on the voice Classification and Identification may further comprise the steps:

(1) use sound classifier, be partitioned into the voice snippet of received pronunciation in the news video, described received pronunciation is meant the voice of pronunciation standard;

(2) adopt speech recognition system to identify the voice snippet of received pronunciation in the news video, be converted into content of text;

(3) content of text that obtains according to step (2) carries out the retrieval of corresponding video data, realizes the retrieval from the speech text to the news video.

Further, the received pronunciation of the present invention mandarin of pronunciation standard preferably.

Further, in the step (1), audio classification has adopted the disaggregated model based on SVMs, is divided into two parts: sorter model training and classification prediction; What the audio frequency characteristics that extracts during audio classification adopted is 13 dimensional feature vectors that logarithm energy (log energy) and Mel cepstral coefficients (Mel-scale Frequency CepstralCoefficients is called for short MFCC) are formed.

Further again, for making the present invention have better effect, in the step (1), the process of sorter model training is: at first select training sample, extract the logarithm energy of each sample and the audio frequency characteristics that the Mel cepstral coefficients is formed then, and all these features are write in the tag file, utilize SVMs to generate sorter model then.Training sample comprises following 5 classes: 1) received pronunciation; 2) music; 3) background noise; 4) noiseless; 5) non-standard voice; Classification is unit with the frame, gives corresponding classification of each audio frame assignment, and the classification mark of training sample also is to be unit with the frame, and good classification is carried out model training to utilize mark.

Further again, for making the present invention have better effect, in the step (1), the process of classification prediction is: for the news video that will classify, extract the logarithm energy of news audio frequency and the audio frequency characteristics that the Mel cepstral coefficients is formed, the sorter model that utilizes SVMs to train the out then mark of classifying automatically.

Further again, for making the present invention have better effect, in the step (1), voice snippet to received pronunciation in the news video that tentatively is partitioned into carries out correcting process: if promptly occur independently one or M different classes of frame suddenly in the frame of consecutive identical classification in classification results, M is a positive integer, then these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously.This is because in the frame of consecutive identical classification, the frame of other classification of only a few can not sporadicly occur, thus these frames can be judged into the frame of wrong identification this moment, and these orphan frames are modified to generic frame continuously.

Again further, for making the present invention have better effect: when the voice snippet of standard mandarin in the news video that tentatively is partitioned into is carried out correcting process, in actual applications, select M to be less than or equal to 10, if it is different classes of promptly 10 frames having occurred being less than or equal in the middle of one section continuous generic audio frequency, judge that then these frames are frames of wrong identification.

Effect of the present invention is: compare with existing method, the present invention not needing can realize the speaker to train the received pronunciation that can discern automatically in the news video, thereby obtain reflecting the topmost content of text of news video, realize news retrieval from the text to the video by content of text then, thereby give full play to audio analysis and the huge effect of retrieval technique in information retrieval.

Why the present invention has the foregoing invention effect, and its reason is:

Announcer's RP in the news video can reflect the main content of this news video; And existing speech recognition system can not need the speaker to train and get final product better criterion of identification voice, and therefore, the present invention at first in news video, is partitioned into all voice snippets of received pronunciation automatically; Use speech recognition system then, the received pronunciation that is partitioned into is discerned, obtain reflecting the main content of text of news video, thereby realize the news retrieval from the text to the video.

Description of drawings

Fig. 1 is a schematic flow sheet of the present invention.

Embodiment

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

As shown in Figure 1, a kind of news video retrieval method based on the voice Classification and Identification may further comprise the steps:

(1) use sound classifier, be partitioned into the voice snippet of received pronunciation in the news video, the received pronunciation in the present embodiment is that example is illustrated with the standard mandarin;

Audio classification has adopted the disaggregated model based on SVMs, is divided into two parts: sorter model training and classification prediction.What audio frequency characteristics adopted is 13 dimensional feature vectors that logarithm energy (log energy) and Mel cepstral coefficients (MFCC) are formed.

In the present embodiment, the process of sorter model training is: at first select training sample, extract the logarithm energy of each sample and the audio frequency characteristics that the Mel cepstral coefficients is formed then, and all these features are write in the tag file, utilize SVMs to generate sorter model then.Training sample comprises following 5 classes: 1) standard mandarin; 2) music; 3) background noise; 4) noiseless; 5) non-standard mandarin.Classification is unit with the frame, give corresponding classification of each audio frame assignment, the classification mark of training sample also is to be unit with the frame, but because the time span of each audio frame is 23 milliseconds of (22050 hertz of sample frequencys, the sound frame comprises 512 sampled points), can not on this time span, carry out manual accurately classification mark, therefore the present invention selects a section audio, judge with people's ear, content in this section audio belongs to a classification, and judges with people's ear, and significantly other classifications do not occur, then with all frames in this section audio all assignment be a classification, good classification is carried out model training to utilize mark.

In the present embodiment, the process of classification prediction is: for the news video that will classify, extract the logarithm energy of news audio frequency and the audio frequency characteristics that the Mel cepstral coefficients is formed, the sorter model that utilizes SVMs to train the out then mark of classifying automatically.

In the present embodiment, the process of classification results being carried out correcting process is: the voice snippet to standard mandarin in the news video that tentatively is partitioned into carries out correcting process, because the classification mark is to be unit with the frame, the length of each frame is 23 milliseconds, in the classification results such situation can appear: the frame of consecutive identical classification, independently one or the individual different classes of frame of M (M is a positive integer) can appear, because the frame of consecutive identical classification, the frame that other classification of only a few can not sporadicly occur, so in the present embodiment these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously.In actual applications, can select M to be less than or equal to 10, be different classes of if promptly occurred being less than or equaled 10 frames in the middle of one section continuous generic audio frequency, can judge that then these frames are frames of wrong identification.

(2) adopt speech recognition system to identify the voice snippet of standard mandarin in the news video, be converted into content of text;

Utilize existing speech recognition software to discern, the process of identification is the standard mandarin voice that input the present invention is partitioned into, and speech recognition software is discerned then.

Following experimental result shows, compare with existing method, the present invention not only can not need the speaker to train the voice that can discern news video, and can obtain reflecting the topmost content of text of news video, thereby provide powerful support for for text provides to the news retrieval of video, proved that the present invention has obtained good effect in the news video retrieval based on the voice Classification and Identification.

Adopted a news program of 1 hour 4 minutes 6 seconds in the present embodiment, it has mainly comprised following sound: 1) the news program advance notice of band music background; 2) advertisement; 3) weather forecast; 4) non-standard mandarin is as by dialect of interviewer etc.; 5) standard mandarin.Wherein, 1) discrimination-4) is very low, is example with this news program, uses existing speech recognition software, to 4) correct recognition rata of non-standard mandarin has only 6.2%, and 1)-3) discrimination lower, can not discern substantially.

Target of the present invention is to be partitioned into the standard mandarin segment from news video, then by speech recognition software criterion of identification mandarin part, obtains the main content of text of news video.

1, the segmentation result of standard mandarin

Present embodiment has used following 2 standard evaluation the present invention to be partitioned into the result of the standard mandarin in the news video:

The standard mandarin result that correct standard mandarin/the present invention that precision ratio=the present invention is cut apart is cut apart

All standard mandarin that comprise in correct standard mandarin/news video that recall ratio=the present invention is cut apart

Wherein, all standard mandarin that comprise in the news video listen news video to obtain by the people by hand, and the standard mandarin that the present invention is cut apart is realized automatically by computer.At last, in this news video, the precision ratio of segmentation standard mandarin of the present invention=99.05%, recall ratio=96.69% has been obtained good result.

2, the recognition result of standard mandarin

Use existing speech recognition software, the standard mandarin that the present invention is partitioned into is discerned, correct recognition rata can reach 61.42%.If do not use the present invention, whole news program is discerned, then correct recognition rata has only 47.6%, this result who brings is: because the sound except that standard mandarin, discrimination is extremely low, if do not remove these parts, then last recognition result has comprised correct recognition result and wrong recognition result, and computer can't know which is correct result, and which is an error result.When doing video frequency searching like this, show corresponding video, a lot of wrong results then can occur as inquiry " the war in Iraq ".But, use the present invention, because be partitioned into standard mandarin substantially, and standard mandarin comprises the part that the announcer reports, and this content can have been sent out the main contents that mirror whole news video substantially, and the present invention only discerns standard mandarin.Like this, the present invention can obtain reflecting the main content of text of news video, and result for retrieval can improve greatly.

Method of the present invention is not limited to the embodiment described in the embodiment, such as received pronunciation is except the standard mandarin of China, as long as other any countries or there is the language of RP in the area also can realize news retrieval from the text to the video based on identical principle and method.Those skilled in the art's technical scheme according to the present invention draws other execution mode, belongs to technological innovation scope of the present invention equally.

Claims

1, a kind of news video retrieval method based on the voice Classification and Identification may further comprise the steps:

2, a kind of news video retrieval method based on the voice Classification and Identification as claimed in claim 1, it is characterized in that: received pronunciation is the mandarin of pronunciation standard.

3, a kind of news video retrieval method as claimed in claim 1 based on the voice Classification and Identification, it is characterized in that: when the utilization sound classifier is partitioned into the voice snippet of received pronunciation in the news video in the step (1), audio classification has adopted the disaggregated model based on SVMs, is divided into two parts: sorter model training and classification prediction; What the audio frequency characteristics that extracts during audio classification adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed.

4, a kind of news video retrieval method as claimed in claim 3 based on the voice Classification and Identification, it is characterized in that: in the step (1), the process of sorter model training is: at first select training sample, extract the logarithm energy of each sample and the audio frequency characteristics that the Mel cepstral coefficients is formed then, and all these features are write in the tag file, utilize SVMs to generate sorter model then, training sample comprises following 5 classes: 1) received pronunciation; 2) music; 3) background noise; 4) noiseless; 5) non-standard voice; Classification is unit with the frame, gives corresponding classification of each audio frame assignment, and the classification mark of training sample also is to be unit with the frame, and good classification is carried out model training to utilize mark.

5, as claim 3 or 4 described a kind of news video retrieval methods based on the voice Classification and Identification, it is characterized in that: in the step (1), during the classification prediction, for the news video that will classify, extract the logarithm energy of news audio frequency and the audio frequency characteristics that the Mel cepstral coefficients is formed, the sorter model that utilizes SVMs to train the out then mark of classifying automatically.

6, as claim 1,3 or 4 described a kind of news video retrieval methods based on the voice Classification and Identification, it is characterized in that: in the step (1), voice snippet to received pronunciation in the news video that tentatively is partitioned into carries out correcting process, if promptly in classification results, in the frame of consecutive identical classification, occur independently one or M different classes of frame suddenly, M is a positive integer, then these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously.

7, a kind of news video retrieval method as claimed in claim 5 based on the voice Classification and Identification, it is characterized in that: in the step (1), voice snippet to received pronunciation in the news video that tentatively is partitioned into carries out correcting process, if promptly in classification results, in the frame of consecutive identical classification, occur independently one or M different classes of frame suddenly, M is a positive integer, then these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously; Above-mentioned received pronunciation is the mandarin of pronunciation standard.

8, a kind of news video retrieval method as claimed in claim 7 based on the voice Classification and Identification, it is characterized in that: when the voice snippet of received pronunciation in the news video that tentatively is partitioned into is carried out correcting process, M is less than or equal to 10, if it is different classes of promptly 10 frames having occurred being less than or equal in the middle of one section continuous generic audio frequency, judge that then these frames are frames of wrong identification.