CN105096941B

CN105096941B - Audio recognition method and device

Info

Publication number: CN105096941B
Application number: CN201510558047.4A
Authority: CN
Inventors: 杜念冬; 邹赛赛; 谢延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2017-10-31
Anticipated expiration: 2035-09-02
Also published as: CN105096941A

Abstract

The invention discloses a kind of audio recognition method and device, wherein method includes：The voice messaging of speaker's input is obtained, and obtains the speaker information of speaker；Personal acoustic model corresponding with speaker is judged whether according to speaker information；If it is present the personal acoustic model obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker；If it does not exist, then carrying out speech recognition, and the corpus information according to voice messaging generation speaker and storage to voice messaging according to basic acoustic model；And the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.This method can the speech recognition process based on speaker adaptation can be directed to each speaker the characteristics of, customize their acoustic model, so as to improve the degree of accuracy of each speaker, improve Consumer's Experience.

Description

Audio recognition method and device

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of audio recognition method and device.

Background technology

In recent years, speech recognition technology development it is more rapid, particularly deep neural network be applied to speech recognition it Afterwards, speech recognition performance is increased substantially.With the development of mobile Internet, phonetic entry mode is more and more universal, Voice is also more and more extensive using crowd.Therefore, how to improve the degree of accuracy of speech recognition turns into urgent problem to be solved.

In correlation technique, speech recognition process is mainly by a large amount of voice trainings, to obtain acoustic model and language mould Type, then carries out speech recognition by the speech data that the acoustic model and language model are inputted to speaker.As can be seen that Training sample is bigger, and accuracy is higher, trains obtained acoustic model effect better, so as to improve the degree of accuracy of speech recognition.

But be during above-mentioned speech recognition, to employ substantial amounts of speech samples, training is constructed the problem of exist Acoustic model, the model be applied to all speakers speech recognition process, for dialectal accent it is heavier or talk it is unclear For the speaker of Chu, the content that the speaker inputs can not may well be identified by above-mentioned voice recognition mode, The recognition accuracy of the acoustic model is reduced, Consumer's Experience is deteriorated.

The content of the invention

The purpose of the present invention is intended at least solve one of above-mentioned technical problem to a certain extent.

Therefore, first purpose of the present invention is to propose a kind of audio recognition method.This method can be based on speaker The characteristics of adaptive speech recognition process can be directed to each speaker, customizes their acoustic model, so as to improve each The degree of accuracy of speaker, improves Consumer's Experience.

Second object of the present invention is to propose a kind of speech recognition equipment.

To achieve these goals, the audio recognition method of first aspect present invention embodiment, including：Obtain speaker defeated The voice messaging entered, and obtain the speaker information of the speaker；Judged whether according to the speaker information and institute State the corresponding personal acoustic model of speaker；If it is present the personal acoustic model is obtained, and according to the speaker's Personal acoustic model carries out speech recognition to the voice messaging；If it does not exist, then according to basic acoustic model to institute's predicate Message breath carries out speech recognition, and the corpus information according to the voice messaging generation speaker and storage；And according to The basic acoustic model and the corpus information of storage generate the personal acoustic model of the speaker.

The audio recognition method of the embodiment of the present invention, can first obtain the voice messaging of speaker's input, and obtain speaker Speaker information, afterwards, personal acoustic model corresponding with speaker can be judged whether according to speaker information, if depositing , then personal acoustic model is obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, if not depositing , then speech recognition is carried out to voice messaging according to basic acoustic model, and the language material for generating speaker according to voice messaging is believed Cease and store, and the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage, that is, saying Entered on the basis of the unrelated acoustic model (i.e. above-mentioned basic acoustic model) of words people using the history speech data of given speaker Row further training, obtains the personal acoustic model of speaker's own characteristic, uses the speaker's in speech recognition process Personal acoustic model is identified, so as to improve everyone speech discrimination accuracy, so equivalent to all voices The user of identification provides private customized speech-recognition services, so as to improve Consumer's Experience.

To achieve these goals, the speech recognition equipment of second aspect of the present invention embodiment, including：First obtains mould Block, for obtaining the voice messaging of speaker's input, and obtains the speaker information of the speaker；Judge module, for root Personal acoustic model corresponding with the speaker is judged whether according to the speaker information；Sound identification module, is used for When the judge module judges to have the personal acoustic model, the personal acoustic model is obtained, and speak according to described The personal acoustic model of people carries out speech recognition to the voice messaging, and judges in the judge module in the absence of described During vocal acoustics's model, speech recognition is carried out to the voice messaging according to basic acoustic model；First generation module, for basis The voice messaging generates the corpus information of the speaker and storage；And second generation module, for according to the basis Acoustic model and the corpus information of storage generate the personal acoustic model of the speaker.

The speech recognition equipment of the embodiment of the present invention, can obtain the voice letter that speaker inputs by the first acquisition module Breath, and the speaker information of speaker is obtained, judge module judges whether corresponding with speaker according to speaker information Personal acoustic model, if in the presence of, sound identification module then obtains personal acoustic model, and according to the personal acoustic model of speaker Speech recognition is carried out to voice messaging, if being not present, sound identification module is then carried out according to basic acoustic model to voice messaging Speech recognition, the first generation module according to voice messaging generate speaker corpus information and storage, the second generation module according to Basic acoustic model and the corpus information of storage generate the personal acoustic model of speaker, i.e., in the unrelated acoustic model of speaker Further trained, be somebody's turn to do using the history speech data of given speaker on the basis of (i.e. above-mentioned basic acoustic model) The personal acoustic model of speaker's own characteristic, is known in speech recognition process using the personal acoustic model of the speaker Not, so as to improve everyone speech discrimination accuracy, so private is provided equivalent to the user to all speech recognitions The customized speech-recognition services of people, so as to improve Consumer's Experience.

The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become from description of the accompanying drawings below to embodiment is combined Substantially and be readily appreciated that, wherein：

Fig. 1 is the flow chart of audio recognition method according to an embodiment of the invention；

Fig. 2 is the flow chart of the personal acoustic model of generation according to an embodiment of the invention；

Fig. 3 is the flow chart of audio recognition method in accordance with another embodiment of the present invention；

Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention；

Fig. 5 is the structured flowchart of the second generation module according to an embodiment of the invention；And

Fig. 6 is the structured flowchart of speech recognition equipment in accordance with another embodiment of the present invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.

Audio recognition method and device according to embodiments of the present invention is described below in conjunction with accompanying drawing.

It should be noted that speech recognition refers to the voice of people is automatically converted into corresponding text by machine.In recent years Come, speech recognition technology development is more rapid, and particularly deep neural network is applied to after speech recognition, speech recognition system System performance is increased substantially.With the development of mobile Internet, phonetic entry mode is increasingly popular, and voice is used Crowd is also more and more extensive.Because the pronunciation of each user suffers from its respective acoustic characteristic, if it is possible in identification process Using this feature, the lifting of identifying system necessarily can be further brought.

Therefore, the present invention proposes a kind of audio recognition method, including：The voice messaging of speaker's input is obtained, and is obtained Take the speaker information of speaker；Personal acoustic model corresponding with speaker is judged whether according to speaker information；Such as Fruit is present, then obtains personal acoustic model, and carry out speech recognition to voice messaging according to the personal acoustic model of speaker；Such as Fruit is not present, then speech recognition is carried out to voice messaging according to basic acoustic model, and generate speaker's according to voice messaging Corpus information is simultaneously stored；And the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.

Fig. 1 is the flow chart of audio recognition method according to an embodiment of the invention.As shown in figure 1, the speech recognition Method can include：

S101, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.

It should be noted that in an embodiment of the present invention, speaker information can be the corresponding ID of speaker (IDentity, identity number).The speaker ID can be the identifier that server is speaker's distribution, and the speaker ID There is one-to-one relationship with the vocal print feature of speaker.

Specifically, the voice messaging that speaker inputs can be collected by the microphone in terminal, and can be to the voice messaging Vocal print feature is extracted, can obtain that the vocal print feature is corresponding to speak according to vocal print feature and the corresponding relation of speaker information afterwards People's information (speaker ID etc.).

It is appreciated that in another embodiment of the present invention, speaker information can also be used in speaker eventually The ID or MAC Address at end etc..That is, in this step, after the voice messaging of speaker's input is obtained, can also obtain Take ID or MAC Address of terminal used in the speaker etc..

S102, personal acoustic model corresponding with speaker is judged whether according to speaker information.

Wherein, in an embodiment of the present invention, personal acoustic model can be regarded as the acoustic model of speaker oneself, this Vocal acoustics's model can include the characteristic voice of speaker.

S103, if it is present obtain personal acoustic model, and according to the personal acoustic model of speaker to voice messaging Carry out speech recognition.

Specifically, when judging to exist personal acoustic model corresponding with speaker, the personal acoustic model can be obtained, it After the personal acoustic model and voice messaging can be sent to decoder, wherein the decoder can be located at server in, the decoding Device first can carry out feature extraction to obtain corresponding acoustic feature to the voice messaging, then can be by the acoustic feature and a voice Model is learned to be matched with being compared to be identified result according to certain criterion.

S104, if it does not exist, then carrying out speech recognition to voice messaging according to basic acoustic model, and believes according to voice The corpus information of breath generation speaker and storage.

Wherein, in an embodiment of the present invention, basic acoustic model can be regarded as being applied to the model of popular speaker, i.e., By collecting the speech data of multiple speakers, and to acoustic model obtained from speech data progress voice training.

That is, when judging to be not present personal acoustic model corresponding with speaker, can by basic acoustic model and Voice messaging is sent to decoder to realize the speech recognition to the voice messaging, afterwards can be made the voice messaging of this input For the corpus information of the speaker, and stored, with history of forming corpus information.

S105, the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.

Specifically, in an embodiment of the present invention, as shown in Fig. 2 generation personal acoustic model implements process It may include following steps：

S201, judges whether the quantity of the corpus information of storage reaches predetermined threshold value.

Specifically, judge whether the quantity of the corpus information of storage saves bit by bit certain amount (such as predetermined threshold value).

S202, if reaching predetermined threshold value, the corpus information to storage is screened to obtain corresponding effective language material Information.

Specifically, in an embodiment of the present invention, when the corpus information for judging storage saves bit by bit pre-set threshold numbers, The screening parameter of the corpus information of every storage can be first obtained, and the corpus information of every storage is commented according to screening parameter Point.Afterwards, the fraction of the corpus information of every storage can be generated according to the weight of appraisal result and screening parameter.Finally, according to The fraction of the corpus information of every storage is screened to obtain corresponding effective corpus information to the corpus information of storage.Its In, in an embodiment of the present invention, screening parameter may include but be not limited to confidence level, speech energy, voice length and voice and know Other content etc..

, can be to the history corpus information of storage more specifically, after the corpus information of speaker runs up to certain amount Carry out screening analysis, you can the corpus information to every storage carries out confidence calculations, speech energy calculating, voice length respectively Calculating, the calculating of speech recognition content etc., can respectively give a mark to result of calculation afterwards, then be sieved according to marking result and each Select the weight calculation of parameter to go out the final score of every corpus information stored, finally, may filter that every low storage of fraction Corpus information, and regard the high corpus information of remaining fraction as effective corpus information.

S203, carries out the judgement of speaker's primary and secondary according to every effective corpus information and belongs to having for same speaker to filter out Imitate corpus information.

Specifically, all effective corpus informations can be analyzed, such as signal to noise ratio, the language by calculating every corpus information Sound feature and speaker's sex situation carry out primary and secondary user's judgement, can if find that current speaker includes multiple natural persons All effective corpus informations are screened by the information such as phonetic feature and speaker's sex situation and belong to same to filter out Effective corpus information of one speaker.

It should be noted that in another embodiment of the present invention, the signal to noise ratio by calculating every corpus information, Phonetic feature and speaker's sex situation are carried out after primary and secondary user's judgement, if finding, current speaker includes multiple natural persons When, the current corpus information can also be rejected, that is, reject the corpus information for including multiple natural persons, the language material not believed Breath carries out model training.

S204, carries out model training to generate according to basic acoustic model to the effective corpus information for belonging to same speaker The personal acoustic model of speaker.

Specifically, in an embodiment of the present invention, can be first to belonging to effective corpus information carry out sound of same speaker Feature extraction is learned, and the corresponding speech recognition content of effective corpus information of same speaker is analyzed, is put with obtaining height The word level recognition result and corresponding acoustic feature of reliability, afterwards, can the personal acoustic model of random initializtion, and according to basic sound Learn model, word level recognition result and acoustic feature and carry out gradient calculation, finally, the gradient after calculating can be iterated with life Into personal acoustic model.

That is, corpus information can be used to carry out acoustic feature extraction, while analyzing speech recognition content, protect Stay down the word recognition result and corresponding acoustic feature of high confidence level, afterwards, can the personal acoustic model of random initializtion, and combine Acoustic feature and word the level recognition result that basic acoustic model is retained using previous step carry out gradient calculation, finally, use acquisition Gradient carry out the personal acoustic model of iteration optimization.

Thus, it can carry out model training to obtain corresponding personal acoustic mode by the corpus information for each speaker Type, the personal acoustic model includes the characteristic voice of user, can be used for improving the voice degree of accuracy in speech recognition process.

Fig. 3 is the flow chart of audio recognition method in accordance with another embodiment of the present invention.

, in an embodiment of the present invention, can also be to personal acoustic model in order to further improve the degree of accuracy of speech recognition Carry out model optimization.Specifically, as shown in figure 3, the audio recognition method can include：

S301, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.

S302, personal acoustic model corresponding with speaker is judged whether according to speaker information.

S303, if it is present obtain personal acoustic model, and according to the personal acoustic model of speaker to voice messaging Carry out speech recognition.

S304, model optimization is carried out according to voice messaging to personal acoustic model.

It is appreciated that the speech recognition process based on speaker adaptation is transparent to user, user is perceived less than certainly Adaptation training and the flow of self-adapting estimation.With increase of the user using the number of times of speech recognition, speech recognition server The adaptive model of continuous optimization user, the speech discrimination accuracy of user be able to will also be improved constantly.As can be seen here, speak People's adaptive training is not a sex work, but saving bit by bit with speaker's language material, constantly repeats adaptive training mistake Journey.In an embodiment of the present invention, each adaptive training process can be optimized based on personal acoustic model before, with not It is disconnected to improve speech discrimination accuracy.

S305, if it does not exist, then carrying out speech recognition to voice messaging according to basic acoustic model, and believes according to voice The corpus information of breath generation speaker and storage.

S306, the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.

The audio recognition method of the embodiment of the present invention, language is carried out in the personal acoustic model according to speaker to voice messaging After sound identification, also model optimization can be carried out to personal acoustic model according to voice messaging, so can further improve individual The degree of accuracy of speech recognition.

In order to realize above-described embodiment, the invention also provides a kind of speech recognition equipment.

Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention.As shown in figure 4, the voice is known Other device can include：First acquisition module 10, judge module 20, sound identification module 30, the first generation module 40 and second Generation module 50.

Specifically, the first acquisition module 10 can be used for the voice messaging for obtaining speaker's input, and obtain saying for speaker Talk about people's information.It should be noted that in an embodiment of the present invention, speaker information can be the corresponding ID of speaker.This is spoken People ID can be the identifier that server is speaker's distribution, and the vocal print feature of the speaker ID and speaker have one-to-one corresponding Relation.

More specifically, the first acquisition module 10 can collect the voice messaging that speaker inputs by the microphone in terminal, And vocal print feature can be extracted to the voice messaging, the sound can be obtained according to the corresponding relation of vocal print feature and speaker information afterwards The corresponding speaker information of line feature (speaker ID etc.).

It is appreciated that in another embodiment of the present invention, speaker information can also be used in speaker eventually The ID or MAC Address at end etc..That is, the first acquisition module 10 is after the voice messaging of speaker's input is obtained, also ID or MAC Address of terminal used in the speaker etc. can be obtained.

Judge module 20 can be used for judging whether personal acoustic model corresponding with speaker according to speaker information. Wherein, in an embodiment of the present invention, personal acoustic model can be regarded as the acoustic model of speaker oneself, the personal acoustic mode Type can include the characteristic voice of speaker.

Sound identification module 30 can be used for, when judge module 20 judges to exist personal acoustic model, obtaining personal acoustic mode Type, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, and judge in judge module 20 not deposit In personal acoustic model, speech recognition is carried out to voice messaging according to basic acoustic model.

More specifically, when judge module 20 judges to exist personal acoustic model corresponding with speaker, speech recognition mould Block 30 can obtain the personal acoustic model, and it is special to obtain corresponding acoustics first can to carry out feature extraction to the voice messaging afterwards Levy, then the acoustic feature can be matched with being compared to be identified result with personal acoustic model according to certain criterion.

First generation module 40 can be used for corpus information and the storage that speaker is generated according to voice messaging.More specifically, After sound identification module 30 carries out speech recognition according to basic acoustic model to voice messaging, the first generation module 40 can be by This time voice messaging of input and is stored as the corpus information of the speaker, with history of forming corpus information.

Second generation module 50 can be used for the individual that speaker is generated according to basic acoustic model and the corpus information of storage Acoustic model.Specifically, in one embodiment of the invention, as shown in figure 5, second generation module 50 may include：Sentence Disconnected unit 51, the first screening unit 52, the second screening unit 53 and generation unit 54.Specifically, judging unit 51 can be used for sentencing Whether the quantity of the corpus information of disconnected storage reaches predetermined threshold value.More specifically, judging unit 51 can determine whether the language material letter of storage Whether the quantity of breath saves bit by bit certain amount (such as predetermined threshold value).

First screening unit 52 can be used for judging that the quantity of the corpus information of storage reaches predetermined threshold value in judging unit 51 When, the corpus information to storage is screened to obtain corresponding effective corpus information.Specifically, in embodiments of the invention In, the first screening unit 52 can first obtain the screening parameter of the corpus information of every storage, and every is deposited according to screening parameter The corpus information of storage is scored, afterwards, and the corpus information of every storage is generated according to the weight of appraisal result and screening parameter Fraction, and screened corresponding effective to obtain to the corpus information of storage according to the fraction of the corpus information of every storage Corpus information.Wherein, in an embodiment of the present invention, to may include but be not limited to confidence level, speech energy, voice long for screening parameter Degree and speech recognition content.

More specifically, after the corpus information of speaker runs up to certain amount, the first screening unit 52 can be to storage History corpus information carry out screening analysis, you can to every storage corpus information carry out confidence calculations, voice energy respectively Calculating, the calculating of voice length computation, speech recognition content etc. are measured, result of calculation can respectively be given a mark afterwards, then basis The weight calculation of marking result and each screening parameter goes out the final score of the corpus information of every storage, finally, may filter that The corpus information of every low storage of fraction, and it regard the high corpus information of remaining fraction as effective corpus information.

Second screening unit 53 can be used for carrying out the judgement of speaker's primary and secondary to filter out category according to every effective corpus information In effective corpus information of same speaker.More specifically, the second screening unit 53 can be divided all effective corpus informations Analysis, such as carries out primary and secondary user's judgement by calculating the signal to noise ratio, phonetic feature and speaker's sex situation of every corpus information, If, can be by the information such as phonetic feature and speaker's sex situation to all it was found that when current speaker includes multiple natural persons Effective corpus information is screened belongs to effective corpus information of same speaker to filter out.

It should be noted that in another embodiment of the present invention, the second screening unit 53 is by calculating every language After signal to noise ratio, phonetic feature and speaker's sex situation the progress primary and secondary user's judgement for expecting information, if finding currently to speak When people includes multiple natural persons, the current corpus information can also be rejected, that is, reject the language material for including multiple natural persons Information, does not carry out model training to the corpus information.

Generation unit 54 can be used for carrying out mould to the effective corpus information for belonging to same speaker according to basic acoustic model Type training is to generate the personal acoustic model of speaker.Specifically, in an embodiment of the present invention, generation unit 54 can be first right The effective corpus information for belonging to same speaker carries out acoustic feature extraction, and to effective corpus information correspondence of same speaker Speech recognition content analyzed, afterwards, can be with to obtain the word level recognition result and corresponding acoustic feature of high confidence level The personal acoustic model of machine initialization, and gradient calculation is carried out according to basic acoustic model, word level recognition result and acoustic feature, most Afterwards, the gradient after calculating can be iterated to generate personal acoustic model.

That is, corpus information can be used to carry out acoustic feature extraction for generation unit 54, while to speech recognition content Analyzed, retain the word level recognition result and corresponding acoustic feature of high confidence level, afterwards, can random initializtion vocal acoustics Model, and acoustic feature and word level recognition result progress gradient calculation that basic acoustic model is retained using previous step are combined, most Afterwards, the personal acoustic model of iteration optimization is carried out using the gradient of acquisition.

Further, in one embodiment of the invention, as shown in fig. 6, the speech recognition equipment may also include optimization Module 60, optimization module 60 can be used for after the personal acoustic model according to speaker carries out speech recognition to voice messaging, Model optimization is carried out to personal acoustic model according to voice messaging.It is appreciated that the speech recognition based on speaker adaptation Journey is transparent to user, and user perceives the flow less than adaptive training and self-adapting estimation.As user uses voice The increase of the number of times of identification, speech recognition server can be by the adaptive model of continuous optimization user, the speech recognition of user The degree of accuracy will also be improved constantly.As can be seen here, speaker adaptation training is not a sex work, but with human speech of speaking Saving bit by bit for material, constantly repeats adaptive training process.In an embodiment of the present invention, optimization module 60 is in adaptive instruction every time It can be optimized during white silk based on personal acoustic model before, to improve constantly speech discrimination accuracy.Thus, so may be used Further to improve the degree of accuracy of personal speech recognition.

In the description of the invention, it is to be understood that term " first ", " second " are only used for describing purpose, and can not It is interpreted as indicating or implies relative importance or the implicit quantity for indicating indicated technical characteristic.Thus, define " the One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In the description of the invention, " multiple " It is meant that at least two, such as two, three etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification Close and combine.

Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, fragment or the portion of the code of one or more executable instructions for the step of realizing specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not be by shown or discussion suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Represent in flow charts or logic and/or step described otherwise above herein, for example, being considered use In the order list for the executable instruction for realizing logic function, it may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress for combining these instruction execution systems, device or equipment and using Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following：Electricity with one or more wirings Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir (CDROM).In addition, can even is that can be in the paper of printing described program thereon or other are suitable for computer-readable medium Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware Any one of row technology or their combination are realized：With the logic gates for realizing logic function to data-signal Discrete logic, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.

In addition, each functional unit in each embodiment of the invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only storage, disk or CD etc..Although having been shown and retouching above Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention System, one of ordinary skill in the art can be changed to above-described embodiment, change, replace and become within the scope of the invention Type.

Claims

1. a kind of audio recognition method, it is characterised in that comprise the following steps：

The voice messaging of speaker's input is obtained, and obtains the speaker information of the speaker, wherein, the speaker information The ID or MAC Address of terminal used in the speaker corresponding ID or described speakers；

Personal acoustic model corresponding with the speaker is judged whether according to the speaker information；

If it is present obtain the personal acoustic model, and according to the personal acoustic model of the speaker to the voice Information carries out speech recognition；

If it does not exist, then carrying out speech recognition to the voice messaging according to basic acoustic model, and believed according to the voice The corpus information of the breath generation speaker and storage；And

The personal acoustic model of the speaker is generated according to the basic acoustic model and the corpus information of storage.

2. audio recognition method as claimed in claim 1, it is characterised in that according to the basic acoustic model and the language of storage Expect that information generates the personal acoustic model of the speaker, specifically include：

Judge whether the quantity of the corpus information of storage reaches predetermined threshold value；

If reaching predetermined threshold value, the corpus information to the storage is screened to obtain corresponding effective corpus information；

The judgement of speaker's primary and secondary is carried out according to every effective corpus information and belongs to effective language material of same speaker to filter out and believes Breath；And

Carry out model training to the effective corpus information for belonging to same speaker to generate according to the basic acoustic model The personal acoustic model of the speaker.

3. audio recognition method as claimed in claim 2, it is characterised in that the corpus information of the storage is screened with Corresponding effective corpus information is obtained, is specifically included：

The screening parameter of the corpus information of every storage is obtained, and the language material of described every storage is believed according to the screening parameter Breath is scored；

The fraction of the corpus information of every storage is generated according to the weight of appraisal result and the screening parameter；

Screened corresponding to obtain to the corpus information of the storage according to the fraction of the corpus information of described every storage Effective corpus information.

4. audio recognition method as claimed in claim 3, it is characterised in that wherein, the screening parameter includes confidence level, language Sound energy, voice length and speech recognition content.

5. audio recognition method as claimed in claim 2, it is characterised in that belonged to according to the basic acoustic model to described Effective corpus information of same speaker carries out model training to generate the personal acoustic model of the speaker, specifically includes：

Acoustic feature extraction is carried out to the effective corpus information for belonging to same speaker, and had to the same speaker The corresponding speech recognition content of effect corpus information is analyzed, to obtain the word level recognition result and corresponding acoustics of high confidence level Feature；

Personal acoustic model described in random initializtion, and according to the basic acoustic model, institute's predicate level recognition result and described Acoustic feature carries out gradient calculation；And

Gradient after calculating is iterated to generate the personal acoustic model.

6. the audio recognition method as any one of claim 1 to 5, it is characterised in that according to the speaker's Personal acoustic model is carried out to the voice messaging after speech recognition, and methods described also includes：

Model optimization is carried out to the personal acoustic model according to the voice messaging.

7. a kind of speech recognition equipment, it is characterised in that including：

First acquisition module, for obtaining the voice messaging of speaker's input, and obtains the speaker information of the speaker, its In, the speaker information is the ID or MAC of terminal used in the speaker corresponding ID or described speakers Location；

Judge module, for judging whether personal acoustic mode corresponding with the speaker according to the speaker information Type；

Sound identification module, for when the judge module judges to have the personal acoustic model, obtaining described voice Model is learned, and speech recognition is carried out to the voice messaging according to the personal acoustic model of the speaker, and is sentenced described When disconnected module judges to be not present the personal acoustic model, voice knowledge is carried out to the voice messaging according to basic acoustic model Not；

First generation module, corpus information and storage for generating the speaker according to the voice messaging；And

Second generation module, the individual for generating the speaker according to the basic acoustic model and the corpus information of storage Acoustic model.

8. speech recognition equipment as claimed in claim 7, it is characterised in that second generation module includes：

Judging unit, for judging whether the quantity of corpus information of storage reaches predetermined threshold value；

First screening unit, for judging that the quantity of corpus information of the storage reaches the default threshold in the judging unit During value, the corpus information to the storage is screened to obtain corresponding effective corpus information；

Second screening unit, belongs to same theory for carrying out the judgement of speaker's primary and secondary according to every effective corpus information to filter out Talk about effective corpus information of people；And

Generation unit, for carrying out mould to the effective corpus information for belonging to same speaker according to the basic acoustic model Type training is to generate the personal acoustic model of the speaker.

9. speech recognition equipment as claimed in claim 8, it is characterised in that first screening unit specifically for：

10. speech recognition equipment as claimed in claim 9, it is characterised in that wherein, the screening parameter include confidence level, Speech energy, voice length and speech recognition content.

11. speech recognition equipment as claimed in claim 8, it is characterised in that the generation unit specifically for：

Gradient after calculating is iterated to generate the personal acoustic model.

12. the speech recognition equipment as any one of claim 7 to 11, it is characterised in that also include：

Optimization module, for the personal acoustic model according to the speaker to the voice messaging carry out speech recognition it Afterwards, model optimization is carried out to the personal acoustic model according to the voice messaging.