CN102970618A - Video on demand method based on syllable identification - Google Patents

Video on demand method based on syllable identification Download PDF

Info

Publication number
CN102970618A
CN102970618A CN201210486825XA CN201210486825A CN102970618A CN 102970618 A CN102970618 A CN 102970618A CN 201210486825X A CN201210486825X A CN 201210486825XA CN 201210486825 A CN201210486825 A CN 201210486825A CN 102970618 A CN102970618 A CN 102970618A
Authority
CN
China
Prior art keywords
syllable
video
english
identification
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210486825XA
Other languages
Chinese (zh)
Inventor
吕勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201210486825XA priority Critical patent/CN102970618A/en
Publication of CN102970618A publication Critical patent/CN102970618A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a video on demand method based on syllable identification. A syllable sequence of a voice to be identified is directly identified as a video name, so that text matching is not needed. In a training stage, the video name is decomposed into the syllable sequence and is classified according to syllables; an index relation is established between each video name and each syllable thereof; in an identification stage, a voice of a video name to be identified is split and is identified as the syllable sequence; all possible video names are selected from the syllable index according to the obtained syllables; and a syllable sequence to be identified is matched with the syllable sequences of the video names so as to be decoded to obtain the video name to be identified. By the method, the complexity of a video on demand system can be obviously reduced, and the video on demand system can be implemented on an independent terminal such as an embedded system.

Description

VOD method based on syllable identification
Technical field
The present invention relates to a kind of VOD method based on syllable identification, be specifically related to take syllable as unit, is the VOD method of video name with the syllable sequence Direct Recognition of voice to be identified, belongs to the speech recognition technology field.
Background technology
In recent years, along with the development of computer technology, signal processing technology and mode identification technology, speech recognition technology has been obtained marked improvement, is widely applied in a plurality of fields such as communication, medical treatment, consumer electronics, Self-Service, office automations.
In voice and video program request field, way is that the phonetic order that the user sends is identified as text with the natural language processing method widely at present, again the text and pre-stored video name text are mated one by one, identify the video name of user selection, finish on-demand process.
Based on the natural language processing system of large vocabulary continuous speech recognition, complex structure, operand is large, need to take more processor and memory source, is difficult to realize at the stand-alone terminal that lacks webserver support.And the antijamming capability of large vocabulary Continuous Speech Recognition System a little less than, robustness is relatively poor, is difficult to obtain preferably recognition effect in noisy KTV environment.
The name of the Voice ﹠ Video contents such as music, song, film, TV has larger randomness, not necessarily meets normal syntax rule and usage.Therefore use the natural language processing method, the identification video name, the content of text that obtains may not be accurate, thereby bring adverse effect for follow-up videotext coupling.
Summary of the invention
Goal of the invention: many for Chinese characters, homophone is many, but the less deficiency of syllable the invention provides a kind of VOD method based on syllable identification.
Technical scheme: a kind of VOD method based on syllable identification, in the training stage, at first each video name in the video library being carried out syllable extracts, obtain video syllable sequence storehouse, then in carrying out/training of English model training, syllable-based hmm, the training of syllable context and syllable index extract, in obtaining respectively/English model of cognition, syllable-based hmm, syllable sequence context model and syllable index database.At cognitive phase, the voice to be identified of user input are carried out phonetic segmentation after, at first carrying out/English name identification, identifying is Chinese or English name; Then carry out syllable identification with Chinese syllable-based hmm or English syllable-based hmm, obtain the syllable sequence of voice to be identified; Utilize at last syllable index database and syllable sequence context model to carry out the matching and decoding of syllable sequence, obtain video name.
Specifically comprise the steps:
(1) set up index database: extract syllable from the video name storehouse, obtain the syllable sequence of each video name, each video name is referred in its syllable index, because each video name contains a plurality of syllables, so it can belong to a plurality of different syllables;
(2) set up in/English model of cognition: in order to distinguish Chinese video name and English video name, in before syllable identification, needing to carry out/English identification;
(3) set up syllable-based hmm: with the training utterance of each syllable, training generates the hidden Markov model (HMM) of this syllable;
(4) set up the context model of syllable sequence: the name of Voice ﹠ Video title has larger freedom, the syntax rule that not necessarily meets natural language, even larger difference arranged, therefore from existing video library, extract the context model of syllable sequence, be used for the matching and decoding of syllable sequence to be identified;
(5) in/and English name identification: at cognitive phase, at first carrying out/English name identification, and then process respectively by Chinese or English name;
(6) syllable identification: the phonetic segmentation of video name to be identified also is identified as syllable sequence with syllable HMM;
(7) syllable sequence matching and decoding: according to each syllable in the syllable sequence to be identified, from index database, access whole possible video name sequences, and carry out matching and decoding with the context model of syllable sequence, obtain the title of video to be identified.
To Chinese syllable, for index database set up in each syllable of Chinese; For English syllable, the English syllable in the video name storehouse is carried out cluster, it is divided into some classes, then the English syllable of each class is carried out modeling.
With the training utterance of Chinese syllable, training generates Chinese gauss hybrid models (GMM); With the training utterance of English syllable, training generates English GMM; With in/English GMM model carries out language identification to the voice of video name to be identified, distinguishes Chinese video name and English video name.
Be each syllable mark weight of syllable sequence to be identified, the probability that syllable occurs in the video name storehouse is less, and its weight is just larger, otherwise weight is less; Weight information is labeled in the context model of syllable sequence, is used for the matching and decoding of voice sequence to be identified.
According to the matching and decoding result of syllable sequence to be identified, except with the video name of maximum probability as the Output rusults, also for the user provides a plurality of alternative results, and sort according to the output probability size, for user selection.
The user can increase video content voluntarily, and the video name of increase can belong in its syllable index automatically.
Beneficial effect: compared with prior art, VOD method based on syllable identification provided by the invention, for the video name index database set up in each syllable, for possible video name chosen in voice to be identified from index database, with syllable sequence to be identified and candidate's video name syllable array, directly coupling obtains the current video title again, not needing speech conversion is text, has improved recognition performance and system robustness.
Description of drawings
Fig. 1 is the schematic diagram of the embodiment of the invention;
Fig. 2 is the block diagram of the syllable index database of the embodiment of the invention.
Embodiment
Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment only is used for explanation the present invention and is not used in and limits the scope of the invention, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.
As shown in Figure 1, the VOD method based on syllable identification mainly comprises training stage and cognitive phase.Training stage mainly finish in/the extraction work of English model of cognition, syllable-based hmm, syllable sequence context model and syllable index database; Model and index database that cognitive phase then extracts according to the training stage, the voice that the user is inputted carry out syllable identification and syllable sequence matching and decoding, identify video name, finish on-demand process.The below describes each Model Implement scheme among Fig. 1 in detail.
1, syllable sequence is extracted:
Each Chinese character of video name in the video library is converted to Chinese phonetic alphabet syllable, obtains the syllable sequence of this video name.If English video name then is split as syllable with English word, form syllable sequence.The video name here refers to the keyword sequences of this video, not only comprises the title of this section video itself, also comprises other main informations such as featured performer.
2, in/English model training:
To Chinese and English respectively modeling, train generation Chinese GMM with whole training utterances of Chinese syllable by gauss hybrid models (GMM:Gaussian Mixture Model); Whole training utterance training with English syllable generate English GMM.In Chinese GMM and the English GMM composition system/English model of cognition.
3, syllable-based hmm training:
(HMM:Hidden Markov Model) carries out modeling to each syllable by hidden Markov model, generates the HMM of this syllable with the training utterance training of this syllable.To Chinese, do not consider tone, only have 412 syllables, therefore can carry out modeling to whole syllables; And English syllable is more, and considers that the outer text video without Chinese only accounts for sub-fraction in the video library, therefore only the commonly used English syllable that occurs in the video library is carried out modeling.
4, syllable context training:
In the training of syllable context, at first extract the contextual information in video syllable sequence storehouse with statistical method, be used for improving the accuracy of syllable identification; Then be each the syllable mark weight information in the syllable index database, the number of times that syllable occurs in video syllable sequence storehouse is more, and its weight is just less; Otherwise weight is just larger.The weight information of syllable is used for the matching and decoding of syllable sequence.
5, the syllable index extracts:
According to the syllable information of each video name, it is referred in the corresponding syllables catalogue syllabication index database.Because video name generally comprises a plurality of syllables, so it can appear in a plurality of syllable clauses and subclauses of index database.Fig. 2 is the schematic diagram of syllable index database, and two sections videos are arranged in the system, and the syllable sequence of video 1 is comprised of syllable 1 and syllable 2, and the syllable sequence of video 2 is comprised of syllable 2 and syllable 3.Therefore, video 1 ownership is in syllable 1 clauses and subclauses and syllable 2 clauses and subclauses of index database, and video 2 ownership are in the syllable 2 and syllable 3 clauses and subclauses of index database.To Chinese, can set up the syllable clauses and subclauses to whole 412 syllables; To English, because syllable is more, can't set up one by one the syllable clauses and subclauses for all syllables, therefore take vowel and voiced sound as the basis, be some classes with the English syllabification in the video syllable sequence storehouse, syllable clauses and subclauses set up in the English syllable of each class, and all videos that contains such syllable all belongs in these syllable clauses and subclauses.
6, in/English name identification:
In/English name identification in, at first the voice to be identified with user's input are voice sequence by syllable splitting, extract feature; Then be input to respectively among Chinese GMM and the English GMM, calculating probability judges that according to the probability size voice of user's input are Chinese or English.
7, syllable identification:
After judging that the input voice are Chinese or English, call respectively Chinese syllable-based hmm storehouse or English syllable-based hmm storehouse and carry out syllable identification.Every section voice to the input voice sequence extract feature, are input among the HMM of each syllable in the syllable-based hmm storehouse, and calculating probability judges that according to the probability size which syllable this section voice are.In syllable identification, also will utilize the contextual information of syllable sequence, improve the accuracy of syllable identification, obtain more reliably syllable sequence.
8, syllable sequence matching and decoding:
By syllable identification, after obtaining inputting the syllable sequence of voice, can carry out the matching and decoding of syllable sequence, syllable sequence is identified as video name.Extract candidate's video name from the syllable index database, the direct matching and decoding of syllable sequence to the input voice obtains inputting the video name that voice represent, rather than syllable sequence is converted to the text that Chinese character or word form, and carries out text matches again.To the direct matching and decoding of syllable sequence, improved on the one hand the accuracy of identification, because matching and decoding is only carried out a part of video name of candidate, need not all videos are mated one by one; Can reduce on the other hand the complexity of system, be conducive to realize at stand-alone terminal.
After from the corresponding syllable clauses and subclauses of syllable index database, choosing candidate's video name according to each syllable of input speech syllable sequence, in the following order syllable sequence is carried out matching and decoding:
(1) the syllable entry number of video ownership: the voice of user's input contain a plurality of syllables, corresponding a plurality of syllable clauses and subclauses, and the syllable clauses and subclauses of video ownership are more, and it is larger for the possibility of this video then to input voice;
(2) weight information of syllable: if the syllable entry number of a plurality of candidate's video ownership is identical, then with the weight addition of the corresponding syllable clauses and subclauses of each candidate's video, weighted value is larger, and it is larger to possibility that should candidate's video then to input voice;
(3) succession of syllable sequence: if the weighted value of a plurality of candidate's videos is also identical, then the syllable sequence of these videos is carried out order relatively with the syllable sequence of input voice, preferential with candidate's video that the syllabary ordering matching degree of input voice is high;
(4) a plurality of recognition results: if still can not obtain unique candidate's video through above-mentioned three steps, then in remaining candidate's video one of random choose as recognition result, other videos are presented on the screen as alternative result, the recognition result that supplies the user to correct mistakes fast; To obtain the situation of unique candidate's video through above-mentioned three steps, also candidate's video that other several possibilities are larger is presented on the screen, selects fast for the user.
Through the syllable sequence matching and decoding, obtain video name after, namely can read this video, finish on-demand process.

Claims (9)

1. the VOD method based on syllable identification is characterized in that: the method that comprises training stage and cognitive phase; Wherein in the training stage, at first each video name in the video name storehouse being carried out syllable extracts, obtain video syllable sequence storehouse, then in carrying out/training of English model training, syllable-based hmm, the training of syllable context and syllable index extract, in obtaining respectively/English model of cognition, syllable-based hmm, syllable sequence context model and syllable index database; At cognitive phase, the voice to be identified of user input are carried out phonetic segmentation after, at first carrying out/English name identification, identifying is Chinese or English name; Then carry out syllable identification with Chinese syllable-based hmm or English syllable-based hmm, obtain the syllable sequence of voice to be identified; Utilize at last syllable index database and syllable sequence context model to carry out the matching and decoding of syllable sequence, obtain video name.
2. the VOD method based on syllable identification as claimed in claim 1, it is characterized in that: from the video name storehouse, extract syllable, obtain the syllable sequence of each video name, each video name is referred in its syllable index, for the video name that contains a plurality of syllables, it is referred in a plurality of different syllables.
3. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: by the training utterance of each syllable, train the hidden Markov model that generates this syllable, finish the foundation of syllable-based hmm.
4. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: behind the phonetic segmentation with video name to be identified, and be identified as syllable sequence with syllable HMM.
5. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: to Chinese syllable, for index database set up in each syllable of Chinese; For English syllable, need to carry out cluster to syllable, it is divided into some classes, only the English syllable of each class in the video name storehouse is carried out modeling.
6. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: with the training utterance of Chinese syllable, train to generate Chinese gauss hybrid models; With the training utterance of English syllable, training generates English GMM; With in/English GMM model carries out language identification to the voice of video name to be identified, distinguishes Chinese video name and English video name.
7. the VOD method based on syllable identification as claimed in claim 1, it is characterized in that: be each syllable mark weight of syllable sequence to be identified, the probability that syllable occurs in the video name storehouse is less, and its weight is just larger, otherwise weight is less; Weight information is labeled in the context model of syllable sequence, is used for the matching and decoding of voice sequence to be identified.
8. the VOD method based on syllable identification as claimed in claim 1, it is characterized in that: according to the matching and decoding result of syllable sequence to be identified, except with the video name of maximum probability as the Output rusults, also provide a plurality of alternative results for the user, and sort according to the output probability size, for user selection.
9. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: when the user increases video content, the video name that increases is belonged in its corresponding syllable index.
CN201210486825XA 2012-11-26 2012-11-26 Video on demand method based on syllable identification Pending CN102970618A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210486825XA CN102970618A (en) 2012-11-26 2012-11-26 Video on demand method based on syllable identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210486825XA CN102970618A (en) 2012-11-26 2012-11-26 Video on demand method based on syllable identification

Publications (1)

Publication Number Publication Date
CN102970618A true CN102970618A (en) 2013-03-13

Family

ID=47800433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210486825XA Pending CN102970618A (en) 2012-11-26 2012-11-26 Video on demand method based on syllable identification

Country Status (1)

Country Link
CN (1) CN102970618A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104751856A (en) * 2013-12-31 2015-07-01 中国移动通信集团公司 Voice sentence recognizing method and device
CN105845139A (en) * 2016-05-20 2016-08-10 北方民族大学 Off-line speech control method and device
CN109344221A (en) * 2018-08-01 2019-02-15 阿里巴巴集团控股有限公司 Recording document creation method, device and equipment
CN110675854A (en) * 2019-08-22 2020-01-10 厦门快商通科技股份有限公司 Chinese and English mixed speech recognition method and device
CN111048068A (en) * 2018-10-11 2020-04-21 阿里巴巴集团控股有限公司 Voice wake-up method, device and system and electronic equipment
CN114079797A (en) * 2020-08-14 2022-02-22 阿里巴巴集团控股有限公司 Live subtitle generation method and device, server, live client and live system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101286317A (en) * 2008-05-30 2008-10-15 同济大学 Speech recognition device, model training method and traffic information service platform
CN101415259A (en) * 2007-10-18 2009-04-22 三星电子株式会社 System and method for searching information of embedded equipment based on double-language voice enquiry
CN101604522A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 The embedded Chinese and English mixing voice recognition methods and the system of unspecified person
CN101826325A (en) * 2010-03-10 2010-09-08 华为终端有限公司 Method and device for identifying Chinese and English speech signal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101415259A (en) * 2007-10-18 2009-04-22 三星电子株式会社 System and method for searching information of embedded equipment based on double-language voice enquiry
CN101286317A (en) * 2008-05-30 2008-10-15 同济大学 Speech recognition device, model training method and traffic information service platform
CN101604522A (en) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 The embedded Chinese and English mixing voice recognition methods and the system of unspecified person
CN101826325A (en) * 2010-03-10 2010-09-08 华为终端有限公司 Method and device for identifying Chinese and English speech signal

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104751856A (en) * 2013-12-31 2015-07-01 中国移动通信集团公司 Voice sentence recognizing method and device
CN104751856B (en) * 2013-12-31 2017-12-22 中国移动通信集团公司 A kind of speech sentences recognition methods and device
CN105845139A (en) * 2016-05-20 2016-08-10 北方民族大学 Off-line speech control method and device
CN105845139B (en) * 2016-05-20 2020-06-16 北方民族大学 Offline voice control method and device
CN109344221A (en) * 2018-08-01 2019-02-15 阿里巴巴集团控股有限公司 Recording document creation method, device and equipment
CN109344221B (en) * 2018-08-01 2021-11-23 创新先进技术有限公司 Recording text generation method, device and equipment
CN111048068A (en) * 2018-10-11 2020-04-21 阿里巴巴集团控股有限公司 Voice wake-up method, device and system and electronic equipment
CN111048068B (en) * 2018-10-11 2023-04-18 阿里巴巴集团控股有限公司 Voice wake-up method, device and system and electronic equipment
CN110675854A (en) * 2019-08-22 2020-01-10 厦门快商通科技股份有限公司 Chinese and English mixed speech recognition method and device
CN110675854B (en) * 2019-08-22 2022-10-28 厦门快商通科技股份有限公司 Chinese and English mixed speech recognition method and device
CN114079797A (en) * 2020-08-14 2022-02-22 阿里巴巴集团控股有限公司 Live subtitle generation method and device, server, live client and live system

Similar Documents

Publication Publication Date Title
CN107016994B (en) Voice recognition method and device
CN103578464B (en) Language model establishing method, speech recognition method and electronic device
CN105957518B (en) A kind of method of Mongol large vocabulary continuous speech recognition
CN110473523A (en) A kind of audio recognition method, device, storage medium and terminal
CN101309327B (en) Sound chat system, information processing device, speech recognition and key words detection
CN103578467B (en) Acoustic model building method, voice recognition method and electronic device
CN103578465B (en) Speech identifying method and electronic installation
CN106575502B (en) System and method for providing non-lexical cues in synthesized speech
US9589563B2 (en) Speech recognition of partial proper names by natural language processing
CN113205817B (en) Speech semantic recognition method, system, device and medium
CN109637537B (en) Method for automatically acquiring annotated data to optimize user-defined awakening model
US20130282359A1 (en) Method and apparatus of translating language using voice recognition
CN110147451B (en) Dialogue command understanding method based on knowledge graph
US11093110B1 (en) Messaging feedback mechanism
WO2009101837A1 (en) Mark insertion device and mark insertion method
CN105632499A (en) Method and device for optimizing voice recognition result
US20220092276A1 (en) Multimodal translation method, apparatus, electronic device and computer-readable storage medium
CN104166462A (en) Input method and system for characters
CN106935239A (en) The construction method and device of a kind of pronunciation dictionary
CN103578471A (en) Speech recognition method and electronic device thereof
CN102970618A (en) Video on demand method based on syllable identification
CN110517668B (en) Chinese and English mixed speech recognition system and method
CN102439660A (en) Voice-tag method and apparatus based on confidence score
CN110782869A (en) Speech synthesis method, apparatus, system and storage medium
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130313