CN101320561A

CN101320561A - Method and module for improving individual speech recognition rate

Info

Publication number: CN101320561A
Application number: CNA2007101098914A
Authority: CN
Inventors: 徐志文; 高鸿宗; 刘进荣; 何泰轩
Original assignee: Cyberon Corp
Current assignee: Cyberon Corp
Priority date: 2007-06-05
Filing date: 2007-06-05
Publication date: 2008-12-10

Abstract

The present invention relates to a method and a module which are used for improving the personal speech recognition rate. The module can be used in a portable electronic device. The portable device is provided with a preset recognition model; the recognition model consists of a phoneme model, and is used for transmitting at least an instruction speech to a user for recognition. The method comprises the steps of building a special text database which is relevant to the characters corresponding to the instruction speech, acquiring a plurality of speech data by the user according to the text database to build a adjusting parameter, and integrating the phoneme model and the adjusting parameter to regulate the recognition model. According to the steps, the user can effectively adjust the recognition model, so as to improve the personal speech recognition rate.

Description

Promote the method and the module of individual speech recognition rate

Technical field

Relevant a kind of method and the module that promotes individual speech recognition rate of the present invention; In more detail, be module and the method thereof that is used for the lifting individual speech recognition rate of a portable electronic devices about a kind of.

Background technology

Along with the arriving of digital times, the interaction between the human and portable electronic product is also more and more frequent, but the operation and control interface of portable electronic product can't satisfy user's demand gradually now.The most natural communication way is exactly a language in human daily life, therefore if being given an order, human directly use language gives the portable electronic product, will make that the operation and control interface of portable electronic product is easier to be accepted by the user, make on the portable electronic product operation more conveniently, and significantly increase the surcharge of portable electronic product.

For example, have the mobile phone of speech identifying function, have a default model of cognition, this model of cognition is according to the construction of phoneme model institute.Then according to this model of cognition, mobile phone can be discerned in order at least one instruction voice that a user is sent.Model of cognition that this is default and user are irrelevant, and meaning is that the user need not the voice of pre-recording, and can enjoy the facility of speech recognition.Yet this kind model of cognition can't be taken specific user's voice difference into account, and when user's voice and the model of cognition of presetting differed big, discrimination promptly can reduce.

Concealed Markov model (Hidden Markov Model is hereinafter to be referred as HMM) is the normal speech model that uses in the field of speech recognition, in order to constitute a phoneme model.It is a probability type generation model that a HMM speech model is looked each input data (for example voice).The HMM speech model all has a probability distribution, desire to inquire about a certain voice for when for each index (for example word or speech), then is to decide in the possibility that these voice take place by inquiring about all index.For the effect that makes speech recognition more accurate.Then need to use speech data to adjust the HMM speech model, make it can pass through the speech sound signal of this accommodation function with the identification different users.

On the other hand, human each voice that is sent all are made up of different phonemes, are example with Chinese, and the pronunciation of each word all can be made up of different initial consonants or simple or compound vowel of a Chinese syllable, so each different initial consonant or simple or compound vowel of a Chinese syllable just can be considered different phonemes.Phoneme model is based on the HMM speech model, the model of setting up at each different phoneme.

In order to reach the above-mentioned purpose that gives an order with language, existing instruction audio recognition method is a model of cognition of forming each instruction with phoneme model.For example " phone Wang Xiaoming ", wherein " phone " and just can be considered an instruction, but the tone difference that everyone speaks so need the user at different instructions, is imported speech data corresponding with it to adjust its instruction model of cognition.But this adjustment is gradual, so the user just need repeat to provide the speech data of " phoning ", up to corresponding instruction model of cognition can discern the user " phone " this instruction till.

The method of above-mentioned lifting individual speech recognition rate all needs to ask the user to adjust one by one at the different instruction model of cognition, also may must repeat to import many speech datas to same instruction model of cognition, and this extremely is inconvenient to reach concerning the user also lack efficient.

In sum, how to promote the efficient of adjusting the instruction model of cognition, the user need not adjusted one by one at the different instruction model of cognition, to save time and to promote individual speech recognition rate, this is the target that just effort is carved by speech recognition manufacturer.

Summary of the invention

A purpose of the present invention is to provide a kind of method that promotes individual speech recognition rate, this method is used for a portable electronic devices, the method can according to one in advance the rule phoneme model relevant with speech data hived off, afterwards whenever the user provides speech data, just can adjust phoneme model, the so also related instruction model of cognition of forming by phoneme model of having adjusted.Therefore the present invention can improve existing instruction audio recognition method needs the user at different instruction model of cognition, the shortcoming of input speech data corresponding with it.For reaching above-mentioned purpose, disclosed method, by the acquisition speech data that the user provided, construction goes out to adjust parameter; Then integrate phoneme model and adjust parameter, to adjust this model of cognition.By above-mentioned steps, just can adjust the model of cognition in the portable electronic devices.

Another purpose of the present invention is to provide a kind of module that promotes individual speech recognition rate, this module can be used for a portable electronic devices, and carry out aforesaid method, need the user at different instruction model of cognition to improve existing instruction speech recognition, the shortcoming of input speech data corresponding with it.For reaching above-mentioned purpose, disclosed module comprises a model of cognition, and adjusts a parameter model and an integrate module, and wherein model of cognition is made up of phoneme model, and adjusting parameter model is the speech data institute construction that provides according to the user.And integrate module is in order to integrate phoneme model and to adjust parameter, to adjust model of cognition.Whereby, the present invention can adjust technology by the user, improves in the portable electronic devices, and model of cognition is for specific user's discrimination.

Behind the embodiment of consulting accompanying drawing and describing subsequently, affiliated technical field has knows that usually the knowledgeable just can understand other purposes of the present invention, and technological means of the present invention and enforcement aspect.

Description of drawings

Fig. 1 is the process flow diagram of method embodiment of the present invention;

Fig. 2 is the further process flow diagram of method embodiment of the present invention;

Fig. 3 is the synoptic diagram of phoneme model of the present invention group framework; And

Fig. 4 is the synoptic diagram of module embodiment of the present invention.

Embodiment

Preferred embodiment of the present invention is a kind of method that promotes individual speech recognition rate, is applied to a portable electronic devices with speech identifying function, is a mobile phone in the present embodiment.Have recognition system in the mobile phone, comprise a default model of cognition, this model of cognition is that this method is adjusted parameter by integrating this phoneme model and, to adjust this model of cognition according to the construction of at least one phoneme model institute.Model of cognition after then adjusting according to this, mobile phone can promote the discrimination of at least one instruction voice that a user is sent.Specifically, the default model of cognition of adjusting as yet all carries out speech recognition with identical model of cognition for the different users, can be considered the construction by a unspecific phoneme model institute.

See also Fig. 1, at first, execution in step 100, the lteral data storehouse that construction one is specific, in the middle of this preferred embodiment, specific lteral data storehouse is relevant with the pairing literal of the spendable instruction voice of user, and does not need and instruction identical.For example, default instruction voice in order to operating handset are instructions such as " phoning ", " shutdown " in the mobile phone, and specific lteral data storehouse promptly is according to these instruction features of voice and construction, will be in order to improve the phonetic recognization rate of mobile phone to specific user.Therefore, this specific lteral data storehouse can be made of above-mentioned instruction, also can be made of other literal relevant with the phonetic feature of above-mentioned instruction.About phonetic feature, be discussed in hereinafter.

Next, execution in step 101, when the user sent voice according to above-mentioned specific character database, the feature in acquisition a plurality of speech datas that the user sent went out one with construction and adjusts parameter.At last, execution in step 102 integration adjustment parameters and phoneme model are to adjust model of cognition.

See also Fig. 2, specifically, step 101 comprises the following step: execution in step 200 is by capturing proper vector in a plurality of speech datas, and wherein proper vector can be Mel cepstral coefficients (Mel-scale FrequencyCepstral Coefficients), linear predictor cepstral coefficients (Linear Predictive CepstralCoefficient) and cepstrum (Cepstral) one of them or its combination.Next execution in step 201, utilize the proper vector that is captured, and are aided with group's framework of a phoneme model, go out one with construction and adjust parameter.This group's framework is to set up according to default phoneme model, and is irrelevant with user's language tendency.About further specifying of group's framework please refer to Fig. 3 with hereinafter.

Specifically, in step 201, behind the recognition system acquisition speech data, proper vector in the acquisition speech data, these proper vectors are promptly relevant with user individual pronunciation custom, recognition system is utilized this proper vector afterwards, is aided with group's framework of a phoneme model, goes out one with construction and adjusts parameter.For example, can adopt maximum back probability estimation method (Maximum a posteriori estimation, MAP), maximum similarity linear regression method (MaximumLikelihood Linear Regression, MLLR) and vector field smoothing (Vector-Field Smoothing, VFS) comprehensive method, the best that reaches under the various training voice datas is adjusted effect.Wherein MLLR and VFS algorithm, the method that employing is hived off overcomes the problem of adjusting data deficiencies or shortage of probability distribution model, when a certain probability distribution model data is not enough, just can have the probability distribution model of particular association with reference to other of the same group of this probability distribution model (for example HMM speech model), adjust this probability distribution model, and the particular association of each probability distribution model is represented to set up group's framework just.Still have the phenomenon of data deficiencies or shortage in the group of hiving off, the group of hiving off will be set up as tree structure, when if a certain group data is not enough, can up review, merge with another group, if when data are still not enough, then up review again, in order to a group that adjusts model of cognition, enough data are arranged till.

Please refer to Fig. 3, Fig. 3 is the synoptic diagram of group's framework 3, and the method for hiving off is to use existing k-means algorithm, and the phoneme model of speech data is divided into 5 subgroup 300,301,302,303 and 304, is not described in detail in this.Adopt (bottom-up) mode from bottom to top to strengthen relation between each subgroup then, making has enough data to adjust model of cognition in the group.Utilize similarity between these subgroup (i.e. distance or maximum similarity), be combined into father group 305,306,307 and 308, and then tree structure of construction up, this group's framework finished.Visual actual conditions of above-mentioned method and adjusting not are in order to limit the scope of the invention.

In more detail, suppose because the relation of user's accent (being the language tendency), the pronunciation that user " ㄉ " reaches " ㄍ " is very close, so in this group's framework, the model that just " ㄉ " can be reached " ㄍ " is considered as two phoneme models in same subgroup 300, and phoneme model " ㄉ " reaches the voice that " ㄍ " just can be considered particular association, reached " ㄍ " relevant proper vector as long as comprise with " ㄉ " in the proper vector that captures, the proper vector that these relevant " ㄉ " reach " ㄍ " also can be used to adjust the phoneme model in the same group.

Therefore present embodiment can according to as above-mentioned group's framework, integrate and adjust parameter and phoneme model, to adjust default model of cognition, therefore adjusting parameter is hived off according to user's accent in fact, so in this preferred embodiment, as long as have " shutdown " to reach the instruction model of cognition of " making a phone call " in the default model of cognition, and have in the voice that the user sends and comprise " ㄉ " or " ㄍ ", just can adjust phoneme model " ㄉ " and reach " ㄍ ", so also related the adjustment comprises " shutdown " that phoneme model " ㄉ " reaches " ㄍ " and reaches " making a phone call " instruction model of cognition.In other words, all comprise the model of cognition of identical phoneme model, can related in the lump adjustment, and the model of cognition after adjusting just can be considered by specific phoneme model institute construction.

As shown in the above description, the present invention can adjust model of cognition by less speech data, utilize group's framework of phoneme model, when the user when reading out a certain voice, related adjustment is the relevant phoneme model of voice therewith, and then adjust the model of cognition that instructs, make the user import less speech data and just can adjust all model of cognition.

Another preferred embodiment of the present invention is the module 4 of a lifting individual speech recognition rate, be used for a portable electronic devices (as mobile phone), module 4 comprises a model of cognition 400, and adjusts a parameter model 401 and an integrate module 402, can utilize the method for preferred embodiment as described above, improve phonetic recognization rate.

Model of cognition 400 is by the construction of phoneme model institute, discerns in order to the instruction voice that a user is sent, and this phoneme model is identical with the described phoneme model of aforementioned preferred embodiment, does not add at this and gives unnecessary details.And adjustment parameter model 401 is the speech data institute construction according to the user, this adjusts parameter model 401 and comprises the described group of a preferred embodiment framework as described above, this group's framework is to form according to the particular association between phoneme model, this group's framework is the described group of preferred embodiment framework as described above, in this superfluous words no longer.The proper vector of a plurality of speech datas that the construction that this adjusts parameter model 401 is the acquisition user is sent according to a specific lteral data storehouse is aided with group's framework and gets.The purpose of design of specific character database, be that the user is sent and the relevant voice of phoneme model that constitute the instruction voice, for example, specific literal can be an instruction, as " making a phone call ", " shutdown " etc., also can be one section specific character, as " phone is arranged in the room ", " weather is very good " etc.At same text, different users's pronunciation is also different.Integrate module 402 is in order to integrate phoneme model and to adjust parameter model, and to adjust model of cognition, preferred embodiment is described as described above for its adjustment mode, does not add at this and gives unnecessary details.

Except operation and function that Fig. 4 described, module 4 also can carry out preceding method embodiment the institute in steps.Under technical field have know usually the knowledgeable can be directly acquainted with module 4 how based on preceding method embodiment to carry out these steps, do not add at this and give unnecessary details.

From the above, the present invention can do classification with phoneme model, producing group's framework, and according to this group's framework, utilizes the adjustment parameter relevant with the user to adjust phoneme model, the also related whereby model of cognition of having adjusted.Therefore the present invention can overcome the shortcoming of existing instruction audio recognition method, by importing less voice, can adjust model of cognition, to promote individual speech recognition rate.

The above embodiments only are used for exemplifying enforcement aspect of the present invention, and explain technical characterictic of the present invention, are not to be used for limiting category of the present invention.Any be familiar with this operator can unlabored change or the arrangement of the isotropism scope that all belongs to the present invention and advocated, interest field of the present invention should be as the criterion with the application's claim scope.

Claims

1. a method that promotes individual speech recognition rate is used for a portable electronic devices, this portable apparatus, has a default model of cognition, this model of cognition is according to the construction of at least one phoneme model institute, with at least one instruction voice that a user is sent, discerns; This method comprises the following step:

The lteral data storehouse that construction one is specific is relevant with the pairing literal of these instruction voice;

Capture a plurality of speech datas that this user is sent according to this article numerical data base, go out one with construction and adjust parameter; And

Integrate this at least one phoneme model and this adjustment parameter, to adjust this model of cognition.

2. method according to claim 1 is characterized in that this construction one adjusts the step of parameter, is the proper vector of these a plurality of speech datas of acquisition, and at this at least one phoneme model, sets up group's framework.

3. method according to claim 2 is characterized in that this construction one adjusts the step of parameter, is the voice according to particular association, sets up this group's framework.

4. method according to claim 2 is characterized in that this adjusts the step of model of cognition, is according to this group's framework, so that this at least one phoneme model and this are adjusted parameter, integrates.

5. method according to claim 1 is characterized in that this model of cognition is by the construction of at least one unspecific phoneme model institute.

6. a module that promotes individual speech recognition rate is used for a portable electronic devices, comprises:

One model of cognition defaults in this portable electronic devices, and this model of cognition is by the construction of at least one phoneme model institute, is at least one instruction voice that sent in order to a user, discerns;

One adjusts parameter model, comprises group's framework, and this a group's framework and a user's language tendency is irrelevant; And

One integrate module is integrated this at least one phoneme model and this adjusts parameter model, to adjust this model of cognition.

7. module according to claim 6 is characterized in that this group's framework, is that the particular association of this at least one phoneme model of basis forms.

8. module according to claim 6 is characterized in that this model of cognition is by the construction of at least one unspecific phoneme model institute.