CN100508587C - News video retrieval method based on speech classifying identification - Google Patents

News video retrieval method based on speech classifying identification Download PDF

Info

Publication number
CN100508587C
CN100508587C CNB2006100079659A CN200610007965A CN100508587C CN 100508587 C CN100508587 C CN 100508587C CN B2006100079659 A CNB2006100079659 A CN B2006100079659A CN 200610007965 A CN200610007965 A CN 200610007965A CN 100508587 C CN100508587 C CN 100508587C
Authority
CN
China
Prior art keywords
classification
voice
news video
identification
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2006100079659A
Other languages
Chinese (zh)
Other versions
CN1825936A (en
Inventor
彭宇新
房翠华
陈晓鸥
吴於茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Peking University Founder Research and Development Center
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB2006100079659A priority Critical patent/CN100508587C/en
Publication of CN1825936A publication Critical patent/CN1825936A/en
Application granted granted Critical
Publication of CN100508587C publication Critical patent/CN100508587C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

This invention relates to a news video search method based on phone sort identification, which divides all phone fragments of standard phones automatically in news video then identifies the standard phones by a phone identification system, since the standard phone can express the main content of the video, it's easy to realize the news searches from the context to the video.

Description

A kind of news video retrieval method based on the voice Classification and Identification
Technical field
The invention belongs to computer speech identification and video search technique area, be specifically related to a kind of news video retrieval method based on the voice Classification and Identification.
Background technology
At present, speech recognition technology has a wide range of applications, and it not only can be applied to audio area, and in video field important use is arranged also, because also comprise audio-frequency information in the video.If can identify voice content in the video, then can realize the retrieval from the speech text to the video content for video frequency searching provides strong support by speech recognition technology.Existing video retrieval technology generally is to extract low-level image features such as color, texture from video, carries out video frequency searching according to these features then.But, this method is brought following two problems: (1) people are when retrieve video, be to retrieve according to the mankind's senior semantic feature such as football match, the war in Iraq, bird flu etc., video low-level image feature such as features such as color, texture that this and computer are described have very big contradiction, and both can't reach consistent; (2) existing video retrieval method can not fine realization by the retrieval of literal to video, inquiry mode also is not suitable for the mode that people habitually practise, and uses very inconvenient.Existing video retrieval method is: generally submit an inquiry camera lens or inquiry segment by the user to system, system returns the result similar to inquiring about example then, but, does the problem of meanwhile bringing is: how the user obtain inquiring about example? in addition, the inquiry mode of most of user's customs is input inquiry literal, system returns the video data relevant with inquiring about literal then, for example, user input query literal " the war in Iraq ", wish that system can return the video data relevant with " the war in Iraq ", be similar to present search engine such as google and Baidu etc., but different with these search engines is, what import is literal, and the result of retrieval but is a video data.
In order to realize above-mentioned retrieval from the speech text to the video content, need obtain the Word message in the video, and in order to obtain the Word message in the video, a feasible method is to use speech recognition technology, identifies the language and characters in the video.But existing speech recognition system in order to discern the voice of different people, often needs by the speaker speech recognition system to be trained earlier, and then is discerned speaker's voice by speech recognition system.This method is for the voice snippet that comprises many people, be difficult to use, because be difficult to find everyone that speech recognition system is trained, even a few peoples' voice snippet, often also can't find the speaker to carry out voice training, for example, can not find each speaker to carry out voice training for the speech recognition of news video; In addition, even through voice training, to non-standard voice, still be difficult to identification, discrimination is very low.But, if without voice training, directly use speech recognition system that news video is carried out speech recognition, recognition effect can be poorer so, discrimination is lower, because the news program of video has generally included following various sound: the news program advance notice of (1) band music background; (2) advertisement; (3) weather forecast; (4) non-standard voice are as by dialect of interviewer etc.; (5) received pronunciation.In above-mentioned several voice, the discrimination of non-standard voice is very low, and the discrimination of (1)-(3) is lower, can not discern substantially.Therefore, without distinction whole news video is carried out speech recognition if directly use speech recognition system, the result who brings like this is: speech recognition system is all discerned the various sound that news video comprised, cause having comprised among the result of speech recognition correct recognition result (mainly being identification) and wrong recognition result (mainly being identification) at last to other voice in above-mentioned 1 to 4 to the received pronunciation in above-mentioned 5, and computer can't know which is correct result, which is an error result, therefore, when carrying out video frequency searching with this, as inquiring about the corresponding video of literal " the war in Iraq ", a lot of wrong results then can appear.
Summary of the invention
At the deficiencies in the prior art, the objective of the invention is to propose a kind of news video retrieval method based on the voice Classification and Identification, this method not needing can realize the speaker to train the received pronunciations such as standard mandarin that can discern automatically in the news video, thereby realizes the news retrieval from the text to the video.
For reaching above purpose, the technical solution used in the present invention is: a kind of news video retrieval method based on the voice Classification and Identification may further comprise the steps:
(1) use sound classifier, be partitioned into the voice snippet of received pronunciation in the news video, described received pronunciation is meant the voice of pronunciation standard;
(2) adopt speech recognition system to identify the voice snippet of received pronunciation in the news video, be converted into content of text;
(3) content of text that obtains according to step (2) carries out the retrieval of corresponding video data, realizes the retrieval from the speech text to the news video.
Further, the received pronunciation of the present invention mandarin of pronunciation standard preferably.
Further, in the step (1), audio classification has adopted the disaggregated model based on SVMs, is divided into two parts: sorter model training and classification prediction; What the audio frequency characteristics that extracts during audio classification adopted is 13 dimensional feature vectors that logarithm energy (log energy) and Mel cepstral coefficients (Mel-scale Frequency CepstralCoefficients is called for short MFCC) are formed.
Further again, for making the present invention have better effect, in the step (1), the process of sorter model training is: at first select training sample, extract the logarithm energy of each sample and the audio frequency characteristics that the Mel cepstral coefficients is formed then, and all these features are write in the tag file, utilize SVMs to generate sorter model then.Training sample comprises following 5 classes: 1) received pronunciation; 2) music; 3) background noise; 4) noiseless; 5) non-standard voice; Classification is unit with the frame, gives corresponding classification of each audio frame assignment, and the classification mark of training sample also is to be unit with the frame, and good classification is carried out model training to utilize mark.
Further again, for making the present invention have better effect, in the step (1), the process of classification prediction is: for the news video that will classify, extract the logarithm energy of news audio frequency and the audio frequency characteristics that the Mel cepstral coefficients is formed, the sorter model that utilizes SVMs to train the out then mark of classifying automatically.
Further again, for making the present invention have better effect, in the step (1), voice snippet to received pronunciation in the news video that tentatively is partitioned into carries out correcting process: if promptly occur independently one or M different classes of frame suddenly in the frame of consecutive identical classification in classification results, M is a positive integer, then these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously.This is because in the frame of consecutive identical classification, the frame of other classification of only a few can not sporadicly occur, thus these frames can be judged into the frame of wrong identification this moment, and these orphan frames are modified to generic frame continuously.
Again further, for making the present invention have better effect: when the voice snippet of standard mandarin in the news video that tentatively is partitioned into is carried out correcting process, in actual applications, select M to be less than or equal to 10, if it is different classes of promptly 10 frames having occurred being less than or equal in the middle of one section continuous generic audio frequency, judge that then these frames are frames of wrong identification.
Effect of the present invention is: compare with existing method, the present invention not needing can realize the speaker to train the received pronunciation that can discern automatically in the news video, thereby obtain reflecting the topmost content of text of news video, realize news retrieval from the text to the video by content of text then, thereby give full play to audio analysis and the huge effect of retrieval technique in information retrieval.
Why the present invention has the foregoing invention effect, and its reason is:
Announcer's RP in the news video can reflect the main content of this news video; And existing speech recognition system can not need the speaker to train and get final product better criterion of identification voice, and therefore, the present invention at first in news video, is partitioned into all voice snippets of received pronunciation automatically; Use speech recognition system then, the received pronunciation that is partitioned into is discerned, obtain reflecting the main content of text of news video, thereby realize the news retrieval from the text to the video.
Description of drawings
Fig. 1 is a schematic flow sheet of the present invention.
Embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
As shown in Figure 1, a kind of news video retrieval method based on the voice Classification and Identification may further comprise the steps:
(1) use sound classifier, be partitioned into the voice snippet of received pronunciation in the news video, the received pronunciation in the present embodiment is that example is illustrated with the standard mandarin;
Audio classification has adopted the disaggregated model based on SVMs, is divided into two parts: sorter model training and classification prediction.What audio frequency characteristics adopted is 13 dimensional feature vectors that logarithm energy (log energy) and Mel cepstral coefficients (MFCC) are formed.
In the present embodiment, the process of sorter model training is: at first select training sample, extract the logarithm energy of each sample and the audio frequency characteristics that the Mel cepstral coefficients is formed then, and all these features are write in the tag file, utilize SVMs to generate sorter model then.Training sample comprises following 5 classes: 1) standard mandarin; 2) music; 3) background noise; 4) noiseless; 5) non-standard mandarin.Classification is unit with the frame, give corresponding classification of each audio frame assignment, the classification mark of training sample also is to be unit with the frame, but because the time span of each audio frame is 23 milliseconds of (22050 hertz of sample frequencys, the sound frame comprises 512 sampled points), can not on this time span, carry out manual accurately classification mark, therefore the present invention selects a section audio, judge with people's ear, content in this section audio belongs to a classification, and judges with people's ear, and significantly other classifications do not occur, then with all frames in this section audio all assignment be a classification, good classification is carried out model training to utilize mark.
In the present embodiment, the process of classification prediction is: for the news video that will classify, extract the logarithm energy of news audio frequency and the audio frequency characteristics that the Mel cepstral coefficients is formed, the sorter model that utilizes SVMs to train the out then mark of classifying automatically.
In the present embodiment, the process of classification results being carried out correcting process is: the voice snippet to standard mandarin in the news video that tentatively is partitioned into carries out correcting process, because the classification mark is to be unit with the frame, the length of each frame is 23 milliseconds, in the classification results such situation can appear: the frame of consecutive identical classification, independently one or the individual different classes of frame of M (M is a positive integer) can appear, because the frame of consecutive identical classification, the frame that other classification of only a few can not sporadicly occur, so in the present embodiment these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously.In actual applications, can select M to be less than or equal to 10, be different classes of if promptly occurred being less than or equaled 10 frames in the middle of one section continuous generic audio frequency, can judge that then these frames are frames of wrong identification.
(2) adopt speech recognition system to identify the voice snippet of standard mandarin in the news video, be converted into content of text;
Utilize existing speech recognition software to discern, the process of identification is the standard mandarin voice that input the present invention is partitioned into, and speech recognition software is discerned then.
(3) content of text that obtains according to step (2) carries out the retrieval of corresponding video data, realizes the retrieval from the speech text to the news video.
Following experimental result shows, compare with existing method, the present invention not only can not need the speaker to train the voice that can discern news video, and can obtain reflecting the topmost content of text of news video, thereby provide powerful support for for text provides to the news retrieval of video, proved that the present invention has obtained good effect in the news video retrieval based on the voice Classification and Identification.
Adopted a news program of 1 hour 4 minutes 6 seconds in the present embodiment, it has mainly comprised following sound: 1) the news program advance notice of band music background; 2) advertisement; 3) weather forecast; 4) non-standard mandarin is as by dialect of interviewer etc.; 5) standard mandarin.Wherein, 1) discrimination-4) is very low, is example with this news program, uses existing speech recognition software, to 4) correct recognition rata of non-standard mandarin has only 6.2%, and 1)-3) discrimination lower, can not discern substantially.
Target of the present invention is to be partitioned into the standard mandarin segment from news video, then by speech recognition software criterion of identification mandarin part, obtains the main content of text of news video.
1, the segmentation result of standard mandarin
Present embodiment has used following 2 standard evaluation the present invention to be partitioned into the result of the standard mandarin in the news video:
The standard mandarin result that correct standard mandarin/the present invention that precision ratio=the present invention is cut apart is cut apart
All standard mandarin that comprise in correct standard mandarin/news video that recall ratio=the present invention is cut apart
Wherein, all standard mandarin that comprise in the news video listen news video to obtain by the people by hand, and the standard mandarin that the present invention is cut apart is realized automatically by computer.At last, in this news video, the precision ratio of segmentation standard mandarin of the present invention=99.05%, recall ratio=96.69% has been obtained good result.
2, the recognition result of standard mandarin
Use existing speech recognition software, the standard mandarin that the present invention is partitioned into is discerned, correct recognition rata can reach 61.42%.If do not use the present invention, whole news program is discerned, then correct recognition rata has only 47.6%, this result who brings is: because the sound except that standard mandarin, discrimination is extremely low, if do not remove these parts, then last recognition result has comprised correct recognition result and wrong recognition result, and computer can't know which is correct result, and which is an error result.When doing video frequency searching like this, show corresponding video, a lot of wrong results then can occur as inquiry " the war in Iraq ".But, use the present invention, because be partitioned into standard mandarin substantially, and standard mandarin comprises the part that the announcer reports, and this content can have been sent out the main contents that mirror whole news video substantially, and the present invention only discerns standard mandarin.Like this, the present invention can obtain reflecting the main content of text of news video, and result for retrieval can improve greatly.
Method of the present invention is not limited to the embodiment described in the embodiment, such as received pronunciation is except the standard mandarin of China, as long as other any countries or there is the language of RP in the area also can realize news retrieval from the text to the video based on identical principle and method.Those skilled in the art's technical scheme according to the present invention draws other execution mode, belongs to technological innovation scope of the present invention equally.

Claims (8)

1, a kind of news video retrieval method based on the voice Classification and Identification may further comprise the steps:
(1) use sound classifier, be partitioned into the voice snippet of received pronunciation in the news video, described received pronunciation is meant the voice of pronunciation standard;
(2) adopt speech recognition system to identify the voice snippet of received pronunciation in the news video, be converted into content of text;
(3) content of text that obtains according to step (2) carries out the retrieval of corresponding video data, realizes the retrieval from the speech text to the news video.
2, a kind of news video retrieval method based on the voice Classification and Identification as claimed in claim 1, it is characterized in that: received pronunciation is the mandarin of pronunciation standard.
3, a kind of news video retrieval method as claimed in claim 1 based on the voice Classification and Identification, it is characterized in that: when the utilization sound classifier is partitioned into the voice snippet of received pronunciation in the news video in the step (1), audio classification has adopted the disaggregated model based on SVMs, is divided into two parts: sorter model training and classification prediction; What the audio frequency characteristics that extracts during audio classification adopted is 13 dimensional feature vectors that logarithm energy and Mel cepstral coefficients are formed.
4, a kind of news video retrieval method as claimed in claim 3 based on the voice Classification and Identification, it is characterized in that: in the step (1), the process of sorter model training is: at first select training sample, extract the logarithm energy of each sample and the audio frequency characteristics that the Mel cepstral coefficients is formed then, and all these features are write in the tag file, utilize SVMs to generate sorter model then, training sample comprises following 5 classes: 1) received pronunciation; 2) music; 3) background noise; 4) noiseless; 5) non-standard voice; Classification is unit with the frame, gives corresponding classification of each audio frame assignment, and the classification mark of training sample also is to be unit with the frame, and good classification is carried out model training to utilize mark.
5, as claim 3 or 4 described a kind of news video retrieval methods based on the voice Classification and Identification, it is characterized in that: in the step (1), during the classification prediction, for the news video that will classify, extract the logarithm energy of news audio frequency and the audio frequency characteristics that the Mel cepstral coefficients is formed, the sorter model that utilizes SVMs to train the out then mark of classifying automatically.
6, as claim 1,3 or 4 described a kind of news video retrieval methods based on the voice Classification and Identification, it is characterized in that: in the step (1), voice snippet to received pronunciation in the news video that tentatively is partitioned into carries out correcting process, if promptly in classification results, in the frame of consecutive identical classification, occur independently one or M different classes of frame suddenly, M is a positive integer, then these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously.
7, a kind of news video retrieval method as claimed in claim 5 based on the voice Classification and Identification, it is characterized in that: in the step (1), voice snippet to received pronunciation in the news video that tentatively is partitioned into carries out correcting process, if promptly in classification results, in the frame of consecutive identical classification, occur independently one or M different classes of frame suddenly, M is a positive integer, then these frames are judged into the frame of wrong identification, and these orphan frames are modified to generic frame continuously; Above-mentioned received pronunciation is the mandarin of pronunciation standard.
8, a kind of news video retrieval method as claimed in claim 7 based on the voice Classification and Identification, it is characterized in that: when the voice snippet of received pronunciation in the news video that tentatively is partitioned into is carried out correcting process, M is less than or equal to 10, if it is different classes of promptly 10 frames having occurred being less than or equal in the middle of one section continuous generic audio frequency, judge that then these frames are frames of wrong identification.
CNB2006100079659A 2006-02-24 2006-02-24 News video retrieval method based on speech classifying identification Expired - Fee Related CN100508587C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2006100079659A CN100508587C (en) 2006-02-24 2006-02-24 News video retrieval method based on speech classifying identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2006100079659A CN100508587C (en) 2006-02-24 2006-02-24 News video retrieval method based on speech classifying identification

Publications (2)

Publication Number Publication Date
CN1825936A CN1825936A (en) 2006-08-30
CN100508587C true CN100508587C (en) 2009-07-01

Family

ID=36936335

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2006100079659A Expired - Fee Related CN100508587C (en) 2006-02-24 2006-02-24 News video retrieval method based on speech classifying identification

Country Status (1)

Country Link
CN (1) CN100508587C (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI387341B (en) 2008-12-15 2013-02-21 Wistron Corp Television and operating method thereof
CN101764970B (en) * 2008-12-23 2013-08-07 纬创资通股份有限公司 Television and operating method thereof
KR100999655B1 (en) * 2009-05-18 2010-12-13 윤재민 Digital video recorder system and application method thereof
CN104866477B (en) * 2014-02-21 2021-08-17 联想(北京)有限公司 Information processing method and electronic equipment
CN104361010A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Automatic classification method for correcting news classification
CN106231399A (en) * 2016-08-01 2016-12-14 乐视控股(北京)有限公司 Methods of video segmentation, equipment and system
CN108573712B (en) * 2017-03-13 2020-07-28 北京贝塔科技股份有限公司 Voice activity detection model generation method and system and voice activity detection method and system
CN110324726B (en) * 2019-05-29 2022-02-18 北京奇艺世纪科技有限公司 Model generation method, video processing method, model generation device, video processing device, electronic equipment and storage medium
CN112233667B (en) * 2020-12-17 2021-03-23 成都索贝数码科技股份有限公司 Synchronous voice recognition method based on deep learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
开放架构的数字视频管理系统iView研究与实现. 王炜,吕荣聪,武德峰,张军,汤大权,李志强.国防科技大学学报,第25卷第5期. 2003
开放架构的数字视频管理系统iView研究与实现. 王炜,吕荣聪,武德峰,张军,汤大权,李志强.国防科技大学学报,第25卷第5期. 2003 *
音频分类与分割技术研究. 白亮.. 2005
音频分类与分割技术研究. 白亮.. 2005 *

Also Published As

Publication number Publication date
CN1825936A (en) 2006-08-30

Similar Documents

Publication Publication Date Title
CN100508587C (en) News video retrieval method based on speech classifying identification
CN103700370B (en) A kind of radio and television speech recognition system method and system
CN102122506B (en) Method for recognizing voice
US7487094B1 (en) System and method of call classification with context modeling based on composite words
US8630860B1 (en) Speaker and call characteristic sensitive open voice search
CN101477798B (en) Method for analyzing and extracting audio data of set scene
US7337115B2 (en) Systems and methods for providing acoustic classification
US6434520B1 (en) System and method for indexing and querying audio archives
US7983910B2 (en) Communicating across voice and text channels with emotion preservation
Huang et al. Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora
WO2020211354A1 (en) Speaker identity recognition method and device based on speech content, and storage medium
EP2801091B1 (en) Method, apparatus and computer program product for joint use of speech and text-based features for sentiment detection
EP2252995B1 (en) Method and apparatus for voice searching for stored content using uniterm discovery
Jin et al. Event-based Video Retrieval Using Audio.
US20070112567A1 (en) Techiques for model optimization for statistical pattern recognition
CN103164403A (en) Generation method of video indexing data and system
CN103377651A (en) Device and method for automatic voice synthesis
CN101539925A (en) Audio/video file-abstracting method based on attention-degree analysis
CN111028834B (en) Voice message reminding method and device, server and voice message reminding equipment
Kenai et al. A new architecture based VAD for speaker diarization/detection systems
Jeyalakshmi et al. HMM and K-NN based automatic musical instrument recognition
CN107180629B (en) Voice acquisition and recognition method and system
CN113744742B (en) Role identification method, device and system under dialogue scene
Kos et al. Speaker's gender classification and segmentation using spectral and cepstral feature averaging
Abad et al. Automatic classification and transcription of telephone speech in radio broadcast data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220916

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: PEKING University FOUNDER R & D CENTER

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: PEKING University FOUNDER R & D CENTER

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090701