CN105096941A

CN105096941A - Voice recognition method and device

Info

Publication number: CN105096941A
Application number: CN201510558047.4A
Authority: CN
Inventors: 杜念冬; 邹赛赛; 谢延
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-09-02
Filing date: 2015-09-02
Publication date: 2015-11-25
Anticipated expiration: 2035-09-02
Also published as: CN105096941B

Abstract

The present invention discloses a voice recognition method and a device. The method comprises the steps of firstly, acquiring the voice information input by a speaker and acquiring the information of the speaker; secondly, judging the existence/absence of a personal acoustic model corresponding to the speaker according to the information of the speaker; thirdly, upon judging the existence of the personal acoustic model, acquiring the personal acoustic model and conducting the voice recognition on the voice information according to the personal acoustic model of the speaker; fourthly, upon judging the absence of the personal acoustic model, conducting the voice recognition on the voice information according to a basic acoustic model, generating the corpus information of the speaker according to the voice information, and storing the corpus information; fifthly, generating a personal acoustic model for the speaker based on the basic acoustic model and the stored corpus information. Based on the method, acoustic models can be customized for all speakers based on the characteristics of the speakers during the self-adaptive voice recognition process of the speakers. Therefore, the recognition accuracy for each speaker is improved and the user experience is improved.

Description

Audio recognition method and device

Technical field

The present invention relates to technical field of voice recognition, particularly relate to a kind of audio recognition method and device.

Background technology

In recent years, speech recognition technology development is comparatively rapid, and after particularly deep neural network is applied to speech recognition, speech recognition performance obtains and increases substantially.Along with the development of mobile Internet, phonetic entry mode is more and more general, and voice use crowd also more and more extensive.Therefore, the accuracy how improving speech recognition becomes problem demanding prompt solution.

In correlation technique, speech recognition process, mainly through a large amount of voice training, to obtain acoustic model and language model, then carries out speech recognition by this acoustic model and language model to the speech data that speaker inputs.Can find out, training sample is larger, and degree of accuracy is higher, trains the acoustic model effect obtained better, thus improves the accuracy of speech recognition.

But Problems existing is, in the process of above-mentioned speech recognition, have employed a large amount of speech samples, acoustic model is constructed in training, this models applying is in the speech recognition process of all speakers, and the heavier or unclear speaker that talks for dialectal accent, may can not identify the content of this speaker input well by above-mentioned voice recognition mode, reduce the recognition accuracy of this acoustic model, Consumer's Experience is deteriorated.

Summary of the invention

Object of the present invention is intended to solve one of above-mentioned technical matters at least to a certain extent.

For this reason, first object of the present invention is to propose a kind of audio recognition method.The method for the feature of each speaker, can customize their acoustic model based on the speech recognition process of speaker adaptation, thus improves the accuracy of each speaker, improves Consumer's Experience.

Second object of the present invention is to propose a kind of speech recognition equipment.

To achieve these goals, the audio recognition method of first aspect present invention embodiment, comprising: the voice messaging obtaining speaker's input, and obtains the speaker information of described speaker; Judge whether to there is the individual acoustic model corresponding with described speaker according to described speaker information; If existed, then obtain described individual acoustic model, and according to the individual acoustic model of described speaker, speech recognition is carried out to described voice messaging; If there is no, then according to basic acoustic model, speech recognition is carried out to described voice messaging, and generate the language material information of described speaker according to described voice messaging and store; And the individual acoustic model of described speaker is generated according to the language material information of described basic acoustic model and storage.

The audio recognition method of the embodiment of the present invention, first can obtain the voice messaging of speaker's input, and obtain the speaker information of speaker, afterwards, can judge whether to there is the individual acoustic model corresponding with speaker according to speaker information, if exist, then obtain individual acoustic model, and according to the individual acoustic model of speaker, speech recognition is carried out to voice messaging, if do not exist, then according to basic acoustic model, speech recognition is carried out to voice messaging, and generate the language material information of speaker according to voice messaging and store, and the individual acoustic model of speaker is generated according to the language material information of basic acoustic model and storage, namely on acoustic model (the namely above-mentioned basic acoustic model) basis that speaker has nothing to do, the history speech data of given speaker is utilized to train further, obtain the individual acoustic model of this speaker's own characteristic, the individual acoustic model of this speaker is used to identify at speech recognition process, thus everyone speech discrimination accuracy can be improved, be equivalent to like this provide private customized speech-recognition services to the user of all speech recognitions, thus improve Consumer's Experience.

To achieve these goals, the speech recognition equipment of second aspect present invention embodiment, comprising: the first acquisition module, for obtaining the voice messaging of speaker's input, and obtains the speaker information of described speaker; , there is the individual acoustic model corresponding with described speaker for judging whether according to described speaker information in judge module; Sound identification module, during for judging to there is described individual acoustic model at described judge module, obtain described individual acoustic model, and according to the individual acoustic model of described speaker, speech recognition is carried out to described voice messaging, and when described judge module judges there is not described individual acoustic model, according to basic acoustic model, speech recognition is carried out to described voice messaging; First generation module, for generating the language material information of described speaker according to described voice messaging and storing; And second generation module, for generating the individual acoustic model of described speaker according to the language material information of described basic acoustic model and storage.

The speech recognition equipment of the embodiment of the present invention, the voice messaging of speaker's input is obtained by the first acquisition module, and obtain the speaker information of speaker, judge module judges whether to there is the individual acoustic model corresponding with speaker according to speaker information, if exist, sound identification module then obtains individual acoustic model, and according to the individual acoustic model of speaker, speech recognition is carried out to voice messaging, if do not exist, sound identification module then carries out speech recognition according to basic acoustic model to voice messaging, first generation module generates the language material information of speaker according to voice messaging and stores, second generation module generates the individual acoustic model of speaker according to the language material information of basic acoustic model and storage, namely on acoustic model (the namely above-mentioned basic acoustic model) basis that speaker has nothing to do, the history speech data of given speaker is utilized to train further, obtain the individual acoustic model of this speaker's own characteristic, the individual acoustic model of this speaker is used to identify at speech recognition process, thus everyone speech discrimination accuracy can be improved, be equivalent to like this provide private customized speech-recognition services to the user of all speech recognitions, thus improve Consumer's Experience.

Additional aspect of the present invention and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

Above-mentioned and/or additional aspect of the present invention and advantage will become obvious and easy understand from accompanying drawing below combining to the description of embodiment, wherein:

Fig. 1 is the process flow diagram of audio recognition method according to an embodiment of the invention;

Fig. 2 is the process flow diagram generating individual acoustic model according to an embodiment of the invention;

Fig. 3 is the process flow diagram of audio recognition method in accordance with another embodiment of the present invention;

Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention;

Fig. 5 is the structured flowchart of the second generation module according to an embodiment of the invention; And

Fig. 6 is the structured flowchart of speech recognition equipment in accordance with another embodiment of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, be intended to for explaining the present invention, and can not limitation of the present invention be interpreted as.

Below in conjunction with accompanying drawing description according to the audio recognition method of the embodiment of the present invention and device.

It should be noted that, speech recognition refers to, by machine, the voice of people is automatically converted to corresponding text.In recent years, speech recognition technology development is comparatively rapid, and after particularly deep neural network is applied to speech recognition, speech recognition system performance obtains and increases substantially.Along with the development of mobile Internet, phonetic entry mode is more and more general, and voice use crowd also more and more extensive.Pronunciation due to each user has its respective acoustic characteristic, if can utilize this feature in identifying, must bring the lifting of recognition system further.

For this reason, the present invention proposes a kind of audio recognition method, comprising: the voice messaging obtaining speaker's input, and obtain the speaker information of speaker; Judge whether to there is the individual acoustic model corresponding with speaker according to speaker information; If existed, then obtain individual acoustic model, and according to the individual acoustic model of speaker, speech recognition is carried out to voice messaging; If there is no, then according to basic acoustic model, speech recognition is carried out to voice messaging, and generate the language material information of speaker according to voice messaging and store; And the individual acoustic model of speaker is generated according to the language material information of basic acoustic model and storage.

Fig. 1 is the process flow diagram of audio recognition method according to an embodiment of the invention.As shown in Figure 1, this audio recognition method can comprise:

S101, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.

It should be noted that, in an embodiment of the present invention, speaker information can be ID corresponding to speaker (IDentity, identify label number).It is the identifier that speaker distributes that this speaker ID can be server, and the vocal print feature of this speaker ID and speaker has one-to-one relationship.

Particularly, the voice messaging of speaker's input is collected by the microphone in terminal, and vocal print feature can be extracted to this voice messaging, speaker information corresponding to this vocal print feature (speaker ID etc.) can be obtained according to vocal print feature and the corresponding relation of speaker information afterwards.

Be appreciated that in another embodiment of the present invention, speaker information can also be the ID or MAC Address etc. of the terminal that speaker uses.That is, in this step, after the voice messaging obtaining speaker's input, the ID of the terminal that this speaker uses or MAC Address etc. can also be obtained.

S102, judges whether to there is the individual acoustic model corresponding with speaker according to speaker information.

Wherein, in an embodiment of the present invention, individual acoustic model can be regarded as the acoustic model of speaker oneself, and this individual acoustic model can comprise the characteristic voice of speaker.

S103, if existed, then obtains individual acoustic model, and carries out speech recognition according to the individual acoustic model of speaker to voice messaging.

Particularly, when judging to there is the individual acoustic model corresponding with speaker, this individual acoustic model can be obtained, this individual acoustic model and voice messaging can be sent to demoder afterwards, wherein this demoder can be arranged in server, this demoder first can carry out feature extraction to obtain corresponding acoustic feature to this voice messaging, then this acoustic feature can be carried out coupling with individual acoustic model according to certain criterion and compares to obtain recognition result.

S104, if there is no, then carries out speech recognition according to basic acoustic model to voice messaging, and generates the language material information of speaker according to voice messaging and store.

Wherein, in an embodiment of the present invention, basic acoustic model can be regarded as the model being applicable to popular speaker, namely by collecting the speech data of multiple speaker, and acoustic model voice training being carried out to this speech data and obtains.

That is, when judging there is not the individual acoustic model corresponding with speaker, basic acoustic model and voice messaging can be sent to demoder to realize the speech recognition to this voice messaging, the voice messaging that this can be inputted afterwards is as the language material information of this speaker, and store, with history of forming language material information.

S105, generates the individual acoustic model of speaker according to the language material information of basic acoustic model and storage.

Specifically, in an embodiment of the present invention, as shown in Figure 2, the specific implementation process generating individual acoustic model can comprise the steps:

S201, judges whether the quantity of the language material information stored reaches predetermined threshold value.

Particularly, judge whether the quantity of the language material information stored saves bit by bit some (as predetermined threshold value).

S202, if reach predetermined threshold value, then screens to obtain corresponding effective language material information to the language material information stored.

Specifically, in an embodiment of the present invention, when judging that the language material information stored saves bit by bit pre-set threshold numbers, first can obtain the screening parameter of the language material information that every bar stores, and according to screening parameter, the language material information that every bar stores being marked.Afterwards, the mark of the language material information that every bar stores can be generated according to the weight of appraisal result and screening parameter.Finally, the mark of the language material information stored according to every bar screens to obtain corresponding effective language material information to the language material information stored.Wherein, in an embodiment of the present invention, screening parameter can include but not limited to degree of confidence, speech energy, voice length and speech recognition content etc.

More specifically, when the language material information accumulation of speaker is to after some, screening strength can be carried out to the history language material information stored, the language material information that can store every bar carries out confidence calculations respectively, speech energy calculates, voice length computation, speech recognition content calculating etc., can give a mark respectively to result of calculation afterwards, then the final mark of the language material information that every bar stores is gone out according to the weight calculation of give a mark result and each screening parameter, finally, the language material information that the low every bar of mark stores can be filtered out, and using language material information high for remaining mark as effective language material information.

S203, carries out the judgement of speaker's primary and secondary to filter out the effective language material information belonging to same speaker according to every bar effective language material information.

Particularly, can analyze all effective language material information, as the signal to noise ratio (S/N ratio) by calculating every bar language material information, phonetic feature and speaker's sex situation carry out primary and secondary user judgement, if when finding that current speaker comprises multiple nature person, screen to filter out by information such as phonetic feature and speaker's sex situations the effective language material information belonging to same speaker to all effective language material information.

It should be noted that, in another embodiment of the present invention, after the signal to noise ratio (S/N ratio) by calculating every bar language material information, phonetic feature and speaker's sex situation carry out primary and secondary user judgement, if when finding that current speaker comprises multiple nature person, also this current language material information can be rejected, namely reject the language material information including multiple nature person, model training is not carried out to this language material information.

S204, carries out model training to generate the individual acoustic model of speaker according to basic acoustic model to the effective language material information belonging to same speaker.

Specifically, in an embodiment of the present invention, first can carry out acoustic feature extraction to the effective language material information belonging to same speaker, and the speech recognition content corresponding to effective language material information of same speaker is analyzed, to obtain the word level recognition result of high confidence level and corresponding acoustic feature, afterwards, can random initializtion individual acoustic model, and carry out gradient calculation according to basic acoustic model, word level recognition result and acoustic feature, finally, iteration can be carried out to generate individual acoustic model to the gradient after calculating.

That is, language material information can be used to carry out acoustic feature extraction, speech recognition content is analyzed simultaneously, retain the word level recognition result of high confidence level and corresponding acoustic feature, afterwards, can random initializtion individual acoustic model, and the acoustic feature utilizing previous step to retain in conjunction with basic acoustic model and word level recognition result carry out gradient calculation, finally, the gradient obtained is used to carry out iteration optimization individual acoustic model.

Thus, by carrying out model training for the language material information of each speaker to obtain corresponding individual acoustic model, this individual acoustic model comprises the characteristic voice of user, can be used for improving voice accuracy at speech recognition process.

Fig. 3 is the process flow diagram of audio recognition method in accordance with another embodiment of the present invention.

In order to improve the accuracy of speech recognition further, in an embodiment of the present invention, also model optimization can be carried out to individual acoustic model.Particularly, as shown in Figure 3, this audio recognition method can comprise:

S301, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.

S302, judges whether to there is the individual acoustic model corresponding with speaker according to speaker information.

S303, if existed, then obtains individual acoustic model, and carries out speech recognition according to the individual acoustic model of speaker to voice messaging.

S304, carries out model optimization according to voice messaging to individual acoustic model.

Be appreciated that the speech recognition process based on speaker adaptation is transparent to user, user awareness is less than the flow process of adaptive training and self-adapting estimation.Use the increase of the number of times of speech recognition along with user, speech recognition server can by the adaptive model of continuous optimizing user, and the speech discrimination accuracy of user also will improve constantly.As can be seen here, speaker adaptation training is not disposable work, but saving bit by bit along with speaker's language material, constantly repeat adaptive training process.In an embodiment of the present invention, each adaptive training process can be optimized based on individual acoustic model before, to improve constantly speech discrimination accuracy.

S305, if there is no, then carries out speech recognition according to basic acoustic model to voice messaging, and generates the language material information of speaker according to voice messaging and store.

S306, generates the individual acoustic model of speaker according to the language material information of basic acoustic model and storage.

The audio recognition method of the embodiment of the present invention, after according to the individual acoustic model of speaker speech recognition being carried out to voice messaging, also can carry out model optimization according to voice messaging to individual acoustic model, the accuracy of individual speech recognition can be improved so further.

In order to realize above-described embodiment, the invention allows for a kind of speech recognition equipment.

Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention.As shown in Figure 4, this speech recognition equipment can comprise: the first acquisition module 10, judge module 20, sound identification module 30, first generation module 40 and the second generation module 50.

Particularly, the first acquisition module 10 can be used for the voice messaging obtaining speaker's input, and obtains the speaker information of speaker.It should be noted that, in an embodiment of the present invention, speaker information can be ID corresponding to speaker.It is the identifier that speaker distributes that this speaker ID can be server, and the vocal print feature of this speaker ID and speaker has one-to-one relationship.

More specifically, first acquisition module 10 collects the voice messaging of speaker's input by the microphone in terminal, and vocal print feature can be extracted to this voice messaging, speaker information corresponding to this vocal print feature (speaker ID etc.) can be obtained according to vocal print feature and the corresponding relation of speaker information afterwards.

Be appreciated that in another embodiment of the present invention, speaker information can also be the ID or MAC Address etc. of the terminal that speaker uses.That is, the first acquisition module 10, after the voice messaging obtaining speaker's input, also can obtain the ID of the terminal that this speaker uses or MAC Address etc.

Judge module 20 can be used for judging whether to there is the individual acoustic model corresponding with speaker according to speaker information.Wherein, in an embodiment of the present invention, individual acoustic model can be regarded as the acoustic model of speaker oneself, and this individual acoustic model can comprise the characteristic voice of speaker.

Sound identification module 30 is used in judge module 20 and judges when there is individual acoustic model, obtain individual acoustic model, and according to the individual acoustic model of speaker, speech recognition is carried out to voice messaging, and when judge module 20 judges to there is not individual acoustic model, according to basic acoustic model, speech recognition is carried out to voice messaging.

More specifically, when judge module 20 judges to there is the individual acoustic model corresponding with speaker, sound identification module 30 can obtain this individual acoustic model, first can carry out feature extraction to obtain corresponding acoustic feature to this voice messaging afterwards, then this acoustic feature can be carried out coupling with individual acoustic model according to certain criterion and compares to obtain recognition result.

First generation module 40 can be used for generating the language material information of speaker according to voice messaging and storing.More specifically, after sound identification module 30 carries out speech recognition according to basic acoustic model to voice messaging, the voice messaging that this can input by the first generation module 40 as the language material information of this speaker, and stores, with history of forming language material information.

Second generation module 50 can be used for the individual acoustic model generating speaker according to the language material information of basic acoustic model and storage.Specifically, in one embodiment of the invention, as shown in Figure 5, this second generation module 50 can comprise: judging unit 51, first screens unit 52, second and screens unit 53 and generation unit 54.Particularly, judging unit 51 can be used for judging whether the quantity of the language material information stored reaches predetermined threshold value.More specifically, judging unit 51 can judge whether the quantity of the language material information stored saves bit by bit some (as predetermined threshold value).

First screening unit 52 is used in judging unit 51 when judging that the quantity of language material information stored reaches predetermined threshold value, screens to obtain corresponding effective language material information to the language material information stored.Specifically, in an embodiment of the present invention, first screening unit 52 first can obtain the screening parameter of the language material information that every bar stores, and according to screening parameter, the language material information that every bar stores is marked, afterwards, generate the mark of the language material information that every bar stores according to the weight of appraisal result and screening parameter, and the mark of the language material information stored according to every bar screens to obtain corresponding effective language material information to the language material information stored.Wherein, in an embodiment of the present invention, screening parameter can include but not limited to degree of confidence, speech energy, voice length and speech recognition content.

More specifically, when the language material information accumulation of speaker is to after some, first screening unit 52 can carry out screening strength to the history language material information stored, the language material information that can store every bar carries out confidence calculations respectively, speech energy calculates, voice length computation, speech recognition content calculating etc., can give a mark respectively to result of calculation afterwards, then the final mark of the language material information that every bar stores is gone out according to the weight calculation of give a mark result and each screening parameter, finally, the language material information that the low every bar of mark stores can be filtered out, and using language material information high for remaining mark as effective language material information.

Second screening unit 53 can be used for carrying out the judgement of speaker's primary and secondary to filter out the effective language material information belonging to same speaker according to every bar effective language material information.More specifically, second screening unit 53 can be analyzed all effective language material information, as the signal to noise ratio (S/N ratio) by calculating every bar language material information, phonetic feature and speaker's sex situation carry out primary and secondary user judgement, if when finding that current speaker comprises multiple nature person, screen to filter out by information such as phonetic feature and speaker's sex situations the effective language material information belonging to same speaker to all effective language material information.

It should be noted that, in another embodiment of the present invention, second screening unit 53 is after the signal to noise ratio (S/N ratio) by calculating every bar language material information, phonetic feature and speaker's sex situation carry out primary and secondary user judgement, if when finding that current speaker comprises multiple nature person, also this current language material information can be rejected, namely reject the language material information including multiple nature person, model training is not carried out to this language material information.

Generation unit 54 can be used for carrying out model training to generate the individual acoustic model of speaker according to basic acoustic model to the effective language material information belonging to same speaker.Specifically, in an embodiment of the present invention, generation unit 54 first can carry out acoustic feature extraction to the effective language material information belonging to same speaker, and the speech recognition content corresponding to effective language material information of same speaker is analyzed, to obtain the word level recognition result of high confidence level and corresponding acoustic feature, afterwards, can random initializtion individual acoustic model, and carry out gradient calculation according to basic acoustic model, word level recognition result and acoustic feature, finally, iteration can be carried out to generate individual acoustic model to the gradient after calculating.

That is, generation unit 54 can use language material information to carry out acoustic feature extraction, speech recognition content is analyzed simultaneously, retain the word level recognition result of high confidence level and corresponding acoustic feature, afterwards, can random initializtion individual acoustic model, and the acoustic feature utilizing previous step to retain in conjunction with basic acoustic model and word level recognition result carry out gradient calculation, finally, the gradient obtained is used to carry out iteration optimization individual acoustic model.

Further, in one embodiment of the invention, as shown in Figure 6, this speech recognition equipment also can comprise optimizes module 60, optimize after module 60 is used in and carries out speech recognition according to the individual acoustic model of speaker to voice messaging, according to voice messaging, model optimization is carried out to individual acoustic model.Be appreciated that the speech recognition process based on speaker adaptation is transparent to user, user awareness is less than the flow process of adaptive training and self-adapting estimation.Use the increase of the number of times of speech recognition along with user, speech recognition server can by the adaptive model of continuous optimizing user, and the speech discrimination accuracy of user also will improve constantly.As can be seen here, speaker adaptation training is not disposable work, but saving bit by bit along with speaker's language material, constantly repeat adaptive training process.In an embodiment of the present invention, optimize module 60 to be optimized based on individual acoustic model before in each adaptive training process, to improve constantly speech discrimination accuracy.Thus, the accuracy of individual speech recognition can be improved so further.

In describing the invention, it is to be appreciated that term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance or the implicit quantity indicating indicated technical characteristic.Thus, be limited with " first ", the feature of " second " can express or impliedly comprise at least one this feature.In describing the invention, the implication of " multiple " is at least two, such as two, three etc., unless otherwise expressly limited specifically.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not must for be identical embodiment or example.And the specific features of description, structure, material or feature can combine in one or more embodiment in office or example in an appropriate manner.In addition, when not conflicting, the feature of the different embodiment described in this instructions or example and different embodiment or example can carry out combining and combining by those skilled in the art.

Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.

In flow charts represent or in this logic otherwise described and/or step, such as, the sequencing list of the executable instruction for realizing logic function can be considered to, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise the system of processor or other can from instruction execution system, device or equipment instruction fetch and perform the system of instruction) use, or to use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can to comprise, store, communicate, propagate or transmission procedure for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically (non-exhaustive list) of computer-readable medium comprises following: the electrical connection section (electronic installation) with one or more wiring, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), erasablely edit ROM (read-only memory) (EPROM or flash memory), fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other suitable media that can print described program thereon, because can such as by carrying out optical scanning to paper or other media, then carry out editing, decipher or carry out process with other suitable methods if desired and electronically obtain described program, be then stored in computer memory.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of unit exists, also can be integrated in a module by two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.

The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims

1. an audio recognition method, is characterized in that, comprises the following steps:

Obtain the voice messaging of speaker's input, and obtain the speaker information of described speaker;

Judge whether to there is the individual acoustic model corresponding with described speaker according to described speaker information;

If existed, then obtain described individual acoustic model, and according to the individual acoustic model of described speaker, speech recognition is carried out to described voice messaging;

If there is no, then according to basic acoustic model, speech recognition is carried out to described voice messaging, and generate the language material information of described speaker according to described voice messaging and store; And

The individual acoustic model of described speaker is generated according to the language material information of described basic acoustic model and storage.

2. audio recognition method as claimed in claim 1, is characterized in that, generates the individual acoustic model of described speaker, specifically comprise according to the language material information of described basic acoustic model and storage:

Judge whether the quantity of the language material information stored reaches predetermined threshold value;

If reach predetermined threshold value, then screen to obtain corresponding effective language material information to the language material information of described storage;

The judgement of speaker's primary and secondary is carried out to filter out the effective language material information belonging to same speaker according to every bar effective language material information; And

According to described basic acoustic model, model training is carried out to generate the individual acoustic model of described speaker to the described effective language material information belonging to same speaker.

3. audio recognition method as claimed in claim 2, is characterized in that, screens to obtain corresponding effective language material information, specifically comprise the language material information of described storage:

Obtain the screening parameter of the language material information that every bar stores, and according to described screening parameter, the language material information that described every bar stores is marked;

The mark of the language material information that described every bar stores is generated according to the weight of appraisal result and described screening parameter;

The language material information of mark to described storage according to the language material information of described every bar storage screens to obtain corresponding effective language material information.

4. audio recognition method as claimed in claim 3, it is characterized in that, wherein, described screening parameter comprises degree of confidence, speech energy, voice length and speech recognition content.

5. audio recognition method as claimed in claim 2, is characterized in that, carry out model training to generate the individual acoustic model of described speaker, specifically comprise according to described basic acoustic model to the described effective language material information belonging to same speaker:

Acoustic feature extraction is carried out to the described effective language material information belonging to same speaker, and the speech recognition content corresponding to effective language material information of described same speaker is analyzed, to obtain the word level recognition result of high confidence level and corresponding acoustic feature;

Individual acoustic model described in random initializtion, and carry out gradient calculation according to described basic acoustic model, institute's predicate level recognition result and described acoustic feature; And

Iteration is carried out to generate described individual acoustic model to the gradient after calculating.

6. the audio recognition method according to any one of claim 1 to 5, is characterized in that, after carrying out speech recognition according to the individual acoustic model of described speaker to described voice messaging, described method also comprises:

According to described voice messaging, model optimization is carried out to described individual acoustic model.

7. a speech recognition equipment, is characterized in that, comprising:

First acquisition module, for obtaining the voice messaging of speaker's input, and obtains the speaker information of described speaker;

, there is the individual acoustic model corresponding with described speaker for judging whether according to described speaker information in judge module;

Sound identification module, during for judging to there is described individual acoustic model at described judge module, obtain described individual acoustic model, and according to the individual acoustic model of described speaker, speech recognition is carried out to described voice messaging, and when described judge module judges there is not described individual acoustic model, according to basic acoustic model, speech recognition is carried out to described voice messaging;

First generation module, for generating the language material information of described speaker according to described voice messaging and storing; And

Second generation module, for generating the individual acoustic model of described speaker according to the language material information of described basic acoustic model and storage.

8. speech recognition equipment as claimed in claim 7, it is characterized in that, described second generation module comprises:

Judging unit, for judging whether the quantity of the language material information stored reaches predetermined threshold value;

First screening unit, when the quantity for the language material information judging described storage at described judging unit reaches described predetermined threshold value, screens to obtain corresponding effective language material information to the language material information of described storage;

Second screening unit, for carrying out the judgement of speaker's primary and secondary to filter out the effective language material information belonging to same speaker according to every bar effective language material information; And

Generation unit, for carrying out model training to generate the individual acoustic model of described speaker according to described basic acoustic model to the described effective language material information belonging to same speaker.

9. speech recognition equipment as claimed in claim 8, is characterized in that, described first screening unit specifically for:

10. speech recognition equipment as claimed in claim 9, it is characterized in that, wherein, described screening parameter comprises degree of confidence, speech energy, voice length and speech recognition content.

11. speech recognition equipments as claimed in claim 8, is characterized in that, described generation unit specifically for:

12. speech recognition equipments according to any one of claim 7 to 11, is characterized in that, also comprise:

Optimize module, for after carrying out speech recognition according to the individual acoustic model of described speaker to described voice messaging, according to described voice messaging, model optimization is carried out to described individual acoustic model.