CN102970618A

CN102970618A - Video on demand method based on syllable identification

Info

Publication number: CN102970618A
Application number: CN201210486825XA
Authority: CN
Inventors: 吕勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2013-03-13

Abstract

The invention discloses a video on demand method based on syllable identification. A syllable sequence of a voice to be identified is directly identified as a video name, so that text matching is not needed. In a training stage, the video name is decomposed into the syllable sequence and is classified according to syllables; an index relation is established between each video name and each syllable thereof; in an identification stage, a voice of a video name to be identified is split and is identified as the syllable sequence; all possible video names are selected from the syllable index according to the obtained syllables; and a syllable sequence to be identified is matched with the syllable sequences of the video names so as to be decoded to obtain the video name to be identified. By the method, the complexity of a video on demand system can be obviously reduced, and the video on demand system can be implemented on an independent terminal such as an embedded system.

Description

VOD method based on syllable identification

Technical field

The present invention relates to a kind of VOD method based on syllable identification, be specifically related to take syllable as unit, is the VOD method of video name with the syllable sequence Direct Recognition of voice to be identified, belongs to the speech recognition technology field.

Background technology

In recent years, along with the development of computer technology, signal processing technology and mode identification technology, speech recognition technology has been obtained marked improvement, is widely applied in a plurality of fields such as communication, medical treatment, consumer electronics, Self-Service, office automations.

In voice and video program request field, way is that the phonetic order that the user sends is identified as text with the natural language processing method widely at present, again the text and pre-stored video name text are mated one by one, identify the video name of user selection, finish on-demand process.

Based on the natural language processing system of large vocabulary continuous speech recognition, complex structure, operand is large, need to take more processor and memory source, is difficult to realize at the stand-alone terminal that lacks webserver support.And the antijamming capability of large vocabulary Continuous Speech Recognition System a little less than, robustness is relatively poor, is difficult to obtain preferably recognition effect in noisy KTV environment.

The name of the Voice ﹠ Video contents such as music, song, film, TV has larger randomness, not necessarily meets normal syntax rule and usage.Therefore use the natural language processing method, the identification video name, the content of text that obtains may not be accurate, thereby bring adverse effect for follow-up videotext coupling.

Summary of the invention

Goal of the invention: many for Chinese characters, homophone is many, but the less deficiency of syllable the invention provides a kind of VOD method based on syllable identification.

Technical scheme: a kind of VOD method based on syllable identification, in the training stage, at first each video name in the video library being carried out syllable extracts, obtain video syllable sequence storehouse, then in carrying out/training of English model training, syllable-based hmm, the training of syllable context and syllable index extract, in obtaining respectively/English model of cognition, syllable-based hmm, syllable sequence context model and syllable index database.At cognitive phase, the voice to be identified of user input are carried out phonetic segmentation after, at first carrying out/English name identification, identifying is Chinese or English name; Then carry out syllable identification with Chinese syllable-based hmm or English syllable-based hmm, obtain the syllable sequence of voice to be identified; Utilize at last syllable index database and syllable sequence context model to carry out the matching and decoding of syllable sequence, obtain video name.

Specifically comprise the steps:

(1) set up index database: extract syllable from the video name storehouse, obtain the syllable sequence of each video name, each video name is referred in its syllable index, because each video name contains a plurality of syllables, so it can belong to a plurality of different syllables;

(2) set up in/English model of cognition: in order to distinguish Chinese video name and English video name, in before syllable identification, needing to carry out/English identification;

(3) set up syllable-based hmm: with the training utterance of each syllable, training generates the hidden Markov model (HMM) of this syllable;

(4) set up the context model of syllable sequence: the name of Voice ﹠ Video title has larger freedom, the syntax rule that not necessarily meets natural language, even larger difference arranged, therefore from existing video library, extract the context model of syllable sequence, be used for the matching and decoding of syllable sequence to be identified;

(5) in/and English name identification: at cognitive phase, at first carrying out/English name identification, and then process respectively by Chinese or English name;

(6) syllable identification: the phonetic segmentation of video name to be identified also is identified as syllable sequence with syllable HMM;

(7) syllable sequence matching and decoding: according to each syllable in the syllable sequence to be identified, from index database, access whole possible video name sequences, and carry out matching and decoding with the context model of syllable sequence, obtain the title of video to be identified.

To Chinese syllable, for index database set up in each syllable of Chinese; For English syllable, the English syllable in the video name storehouse is carried out cluster, it is divided into some classes, then the English syllable of each class is carried out modeling.

With the training utterance of Chinese syllable, training generates Chinese gauss hybrid models (GMM); With the training utterance of English syllable, training generates English GMM; With in/English GMM model carries out language identification to the voice of video name to be identified, distinguishes Chinese video name and English video name.

Be each syllable mark weight of syllable sequence to be identified, the probability that syllable occurs in the video name storehouse is less, and its weight is just larger, otherwise weight is less; Weight information is labeled in the context model of syllable sequence, is used for the matching and decoding of voice sequence to be identified.

According to the matching and decoding result of syllable sequence to be identified, except with the video name of maximum probability as the Output rusults, also for the user provides a plurality of alternative results, and sort according to the output probability size, for user selection.

The user can increase video content voluntarily, and the video name of increase can belong in its syllable index automatically.

Beneficial effect: compared with prior art, VOD method based on syllable identification provided by the invention, for the video name index database set up in each syllable, for possible video name chosen in voice to be identified from index database, with syllable sequence to be identified and candidate's video name syllable array, directly coupling obtains the current video title again, not needing speech conversion is text, has improved recognition performance and system robustness.

Description of drawings

Fig. 1 is the schematic diagram of the embodiment of the invention;

Fig. 2 is the block diagram of the syllable index database of the embodiment of the invention.

Embodiment

Below in conjunction with specific embodiment, further illustrate the present invention, should understand these embodiment only is used for explanation the present invention and is not used in and limits the scope of the invention, after having read the present invention, those skilled in the art all fall within the application's claims limited range to the modification of the various equivalent form of values of the present invention.

As shown in Figure 1, the VOD method based on syllable identification mainly comprises training stage and cognitive phase.Training stage mainly finish in/the extraction work of English model of cognition, syllable-based hmm, syllable sequence context model and syllable index database; Model and index database that cognitive phase then extracts according to the training stage, the voice that the user is inputted carry out syllable identification and syllable sequence matching and decoding, identify video name, finish on-demand process.The below describes each Model Implement scheme among Fig. 1 in detail.

1, syllable sequence is extracted:

Each Chinese character of video name in the video library is converted to Chinese phonetic alphabet syllable, obtains the syllable sequence of this video name.If English video name then is split as syllable with English word, form syllable sequence.The video name here refers to the keyword sequences of this video, not only comprises the title of this section video itself, also comprises other main informations such as featured performer.

2, in/English model training:

To Chinese and English respectively modeling, train generation Chinese GMM with whole training utterances of Chinese syllable by gauss hybrid models (GMM:Gaussian Mixture Model); Whole training utterance training with English syllable generate English GMM.In Chinese GMM and the English GMM composition system/English model of cognition.

3, syllable-based hmm training:

(HMM:Hidden Markov Model) carries out modeling to each syllable by hidden Markov model, generates the HMM of this syllable with the training utterance training of this syllable.To Chinese, do not consider tone, only have 412 syllables, therefore can carry out modeling to whole syllables; And English syllable is more, and considers that the outer text video without Chinese only accounts for sub-fraction in the video library, therefore only the commonly used English syllable that occurs in the video library is carried out modeling.

4, syllable context training:

In the training of syllable context, at first extract the contextual information in video syllable sequence storehouse with statistical method, be used for improving the accuracy of syllable identification; Then be each the syllable mark weight information in the syllable index database, the number of times that syllable occurs in video syllable sequence storehouse is more, and its weight is just less; Otherwise weight is just larger.The weight information of syllable is used for the matching and decoding of syllable sequence.

5, the syllable index extracts:

According to the syllable information of each video name, it is referred in the corresponding syllables catalogue syllabication index database.Because video name generally comprises a plurality of syllables, so it can appear in a plurality of syllable clauses and subclauses of index database.Fig. 2 is the schematic diagram of syllable index database, and two sections videos are arranged in the system, and the syllable sequence of video 1 is comprised of syllable 1 and syllable 2, and the syllable sequence of video 2 is comprised of syllable 2 and syllable 3.Therefore, video 1 ownership is in syllable 1 clauses and subclauses and syllable 2 clauses and subclauses of index database, and video 2 ownership are in the syllable 2 and syllable 3 clauses and subclauses of index database.To Chinese, can set up the syllable clauses and subclauses to whole 412 syllables; To English, because syllable is more, can't set up one by one the syllable clauses and subclauses for all syllables, therefore take vowel and voiced sound as the basis, be some classes with the English syllabification in the video syllable sequence storehouse, syllable clauses and subclauses set up in the English syllable of each class, and all videos that contains such syllable all belongs in these syllable clauses and subclauses.

6, in/English name identification:

In/English name identification in, at first the voice to be identified with user's input are voice sequence by syllable splitting, extract feature; Then be input to respectively among Chinese GMM and the English GMM, calculating probability judges that according to the probability size voice of user's input are Chinese or English.

7, syllable identification:

After judging that the input voice are Chinese or English, call respectively Chinese syllable-based hmm storehouse or English syllable-based hmm storehouse and carry out syllable identification.Every section voice to the input voice sequence extract feature, are input among the HMM of each syllable in the syllable-based hmm storehouse, and calculating probability judges that according to the probability size which syllable this section voice are.In syllable identification, also will utilize the contextual information of syllable sequence, improve the accuracy of syllable identification, obtain more reliably syllable sequence.

8, syllable sequence matching and decoding:

By syllable identification, after obtaining inputting the syllable sequence of voice, can carry out the matching and decoding of syllable sequence, syllable sequence is identified as video name.Extract candidate's video name from the syllable index database, the direct matching and decoding of syllable sequence to the input voice obtains inputting the video name that voice represent, rather than syllable sequence is converted to the text that Chinese character or word form, and carries out text matches again.To the direct matching and decoding of syllable sequence, improved on the one hand the accuracy of identification, because matching and decoding is only carried out a part of video name of candidate, need not all videos are mated one by one; Can reduce on the other hand the complexity of system, be conducive to realize at stand-alone terminal.

After from the corresponding syllable clauses and subclauses of syllable index database, choosing candidate's video name according to each syllable of input speech syllable sequence, in the following order syllable sequence is carried out matching and decoding:

(1) the syllable entry number of video ownership: the voice of user's input contain a plurality of syllables, corresponding a plurality of syllable clauses and subclauses, and the syllable clauses and subclauses of video ownership are more, and it is larger for the possibility of this video then to input voice;

(2) weight information of syllable: if the syllable entry number of a plurality of candidate's video ownership is identical, then with the weight addition of the corresponding syllable clauses and subclauses of each candidate's video, weighted value is larger, and it is larger to possibility that should candidate's video then to input voice;

(3) succession of syllable sequence: if the weighted value of a plurality of candidate's videos is also identical, then the syllable sequence of these videos is carried out order relatively with the syllable sequence of input voice, preferential with candidate's video that the syllabary ordering matching degree of input voice is high;

(4) a plurality of recognition results: if still can not obtain unique candidate's video through above-mentioned three steps, then in remaining candidate's video one of random choose as recognition result, other videos are presented on the screen as alternative result, the recognition result that supplies the user to correct mistakes fast; To obtain the situation of unique candidate's video through above-mentioned three steps, also candidate's video that other several possibilities are larger is presented on the screen, selects fast for the user.

Through the syllable sequence matching and decoding, obtain video name after, namely can read this video, finish on-demand process.

Claims

1. the VOD method based on syllable identification is characterized in that: the method that comprises training stage and cognitive phase; Wherein in the training stage, at first each video name in the video name storehouse being carried out syllable extracts, obtain video syllable sequence storehouse, then in carrying out/training of English model training, syllable-based hmm, the training of syllable context and syllable index extract, in obtaining respectively/English model of cognition, syllable-based hmm, syllable sequence context model and syllable index database; At cognitive phase, the voice to be identified of user input are carried out phonetic segmentation after, at first carrying out/English name identification, identifying is Chinese or English name; Then carry out syllable identification with Chinese syllable-based hmm or English syllable-based hmm, obtain the syllable sequence of voice to be identified; Utilize at last syllable index database and syllable sequence context model to carry out the matching and decoding of syllable sequence, obtain video name.

2. the VOD method based on syllable identification as claimed in claim 1, it is characterized in that: from the video name storehouse, extract syllable, obtain the syllable sequence of each video name, each video name is referred in its syllable index, for the video name that contains a plurality of syllables, it is referred in a plurality of different syllables.

3. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: by the training utterance of each syllable, train the hidden Markov model that generates this syllable, finish the foundation of syllable-based hmm.

4. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: behind the phonetic segmentation with video name to be identified, and be identified as syllable sequence with syllable HMM.

5. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: to Chinese syllable, for index database set up in each syllable of Chinese; For English syllable, need to carry out cluster to syllable, it is divided into some classes, only the English syllable of each class in the video name storehouse is carried out modeling.

6. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: with the training utterance of Chinese syllable, train to generate Chinese gauss hybrid models; With the training utterance of English syllable, training generates English GMM; With in/English GMM model carries out language identification to the voice of video name to be identified, distinguishes Chinese video name and English video name.

7. the VOD method based on syllable identification as claimed in claim 1, it is characterized in that: be each syllable mark weight of syllable sequence to be identified, the probability that syllable occurs in the video name storehouse is less, and its weight is just larger, otherwise weight is less; Weight information is labeled in the context model of syllable sequence, is used for the matching and decoding of voice sequence to be identified.

8. the VOD method based on syllable identification as claimed in claim 1, it is characterized in that: according to the matching and decoding result of syllable sequence to be identified, except with the video name of maximum probability as the Output rusults, also provide a plurality of alternative results for the user, and sort according to the output probability size, for user selection.

9. the VOD method based on syllable identification as claimed in claim 1 is characterized in that: when the user increases video content, the video name that increases is belonged in its corresponding syllable index.