CN105096941B - Audio recognition method and device - Google Patents
Audio recognition method and device Download PDFInfo
- Publication number
- CN105096941B CN105096941B CN201510558047.4A CN201510558047A CN105096941B CN 105096941 B CN105096941 B CN 105096941B CN 201510558047 A CN201510558047 A CN 201510558047A CN 105096941 B CN105096941 B CN 105096941B
- Authority
- CN
- China
- Prior art keywords
- speaker
- acoustic model
- corpus information
- information
- storage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Telephonic Communication Services (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of audio recognition method and device, wherein method includes:The voice messaging of speaker's input is obtained, and obtains the speaker information of speaker;Personal acoustic model corresponding with speaker is judged whether according to speaker information;If it is present the personal acoustic model obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker;If it does not exist, then carrying out speech recognition, and the corpus information according to voice messaging generation speaker and storage to voice messaging according to basic acoustic model;And the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.This method can the speech recognition process based on speaker adaptation can be directed to each speaker the characteristics of, customize their acoustic model, so as to improve the degree of accuracy of each speaker, improve Consumer's Experience.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of audio recognition method and device.
Background technology
In recent years, speech recognition technology development it is more rapid, particularly deep neural network be applied to speech recognition it
Afterwards, speech recognition performance is increased substantially.With the development of mobile Internet, phonetic entry mode is more and more universal,
Voice is also more and more extensive using crowd.Therefore, how to improve the degree of accuracy of speech recognition turns into urgent problem to be solved.
In correlation technique, speech recognition process is mainly by a large amount of voice trainings, to obtain acoustic model and language mould
Type, then carries out speech recognition by the speech data that the acoustic model and language model are inputted to speaker.As can be seen that
Training sample is bigger, and accuracy is higher, trains obtained acoustic model effect better, so as to improve the degree of accuracy of speech recognition.
But be during above-mentioned speech recognition, to employ substantial amounts of speech samples, training is constructed the problem of exist
Acoustic model, the model be applied to all speakers speech recognition process, for dialectal accent it is heavier or talk it is unclear
For the speaker of Chu, the content that the speaker inputs can not may well be identified by above-mentioned voice recognition mode,
The recognition accuracy of the acoustic model is reduced, Consumer's Experience is deteriorated.
The content of the invention
The purpose of the present invention is intended at least solve one of above-mentioned technical problem to a certain extent.
Therefore, first purpose of the present invention is to propose a kind of audio recognition method.This method can be based on speaker
The characteristics of adaptive speech recognition process can be directed to each speaker, customizes their acoustic model, so as to improve each
The degree of accuracy of speaker, improves Consumer's Experience.
Second object of the present invention is to propose a kind of speech recognition equipment.
To achieve these goals, the audio recognition method of first aspect present invention embodiment, including:Obtain speaker defeated
The voice messaging entered, and obtain the speaker information of the speaker;Judged whether according to the speaker information and institute
State the corresponding personal acoustic model of speaker;If it is present the personal acoustic model is obtained, and according to the speaker's
Personal acoustic model carries out speech recognition to the voice messaging;If it does not exist, then according to basic acoustic model to institute's predicate
Message breath carries out speech recognition, and the corpus information according to the voice messaging generation speaker and storage;And according to
The basic acoustic model and the corpus information of storage generate the personal acoustic model of the speaker.
The audio recognition method of the embodiment of the present invention, can first obtain the voice messaging of speaker's input, and obtain speaker
Speaker information, afterwards, personal acoustic model corresponding with speaker can be judged whether according to speaker information, if depositing
, then personal acoustic model is obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, if not depositing
, then speech recognition is carried out to voice messaging according to basic acoustic model, and the language material for generating speaker according to voice messaging is believed
Cease and store, and the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage, that is, saying
Entered on the basis of the unrelated acoustic model (i.e. above-mentioned basic acoustic model) of words people using the history speech data of given speaker
Row further training, obtains the personal acoustic model of speaker's own characteristic, uses the speaker's in speech recognition process
Personal acoustic model is identified, so as to improve everyone speech discrimination accuracy, so equivalent to all voices
The user of identification provides private customized speech-recognition services, so as to improve Consumer's Experience.
To achieve these goals, the speech recognition equipment of second aspect of the present invention embodiment, including:First obtains mould
Block, for obtaining the voice messaging of speaker's input, and obtains the speaker information of the speaker;Judge module, for root
Personal acoustic model corresponding with the speaker is judged whether according to the speaker information;Sound identification module, is used for
When the judge module judges to have the personal acoustic model, the personal acoustic model is obtained, and speak according to described
The personal acoustic model of people carries out speech recognition to the voice messaging, and judges in the judge module in the absence of described
During vocal acoustics's model, speech recognition is carried out to the voice messaging according to basic acoustic model;First generation module, for basis
The voice messaging generates the corpus information of the speaker and storage;And second generation module, for according to the basis
Acoustic model and the corpus information of storage generate the personal acoustic model of the speaker.
The speech recognition equipment of the embodiment of the present invention, can obtain the voice letter that speaker inputs by the first acquisition module
Breath, and the speaker information of speaker is obtained, judge module judges whether corresponding with speaker according to speaker information
Personal acoustic model, if in the presence of, sound identification module then obtains personal acoustic model, and according to the personal acoustic model of speaker
Speech recognition is carried out to voice messaging, if being not present, sound identification module is then carried out according to basic acoustic model to voice messaging
Speech recognition, the first generation module according to voice messaging generate speaker corpus information and storage, the second generation module according to
Basic acoustic model and the corpus information of storage generate the personal acoustic model of speaker, i.e., in the unrelated acoustic model of speaker
Further trained, be somebody's turn to do using the history speech data of given speaker on the basis of (i.e. above-mentioned basic acoustic model)
The personal acoustic model of speaker's own characteristic, is known in speech recognition process using the personal acoustic model of the speaker
Not, so as to improve everyone speech discrimination accuracy, so private is provided equivalent to the user to all speech recognitions
The customized speech-recognition services of people, so as to improve Consumer's Experience.
The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become from description of the accompanying drawings below to embodiment is combined
Substantially and be readily appreciated that, wherein:
Fig. 1 is the flow chart of audio recognition method according to an embodiment of the invention;
Fig. 2 is the flow chart of the personal acoustic model of generation according to an embodiment of the invention;
Fig. 3 is the flow chart of audio recognition method in accordance with another embodiment of the present invention;
Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention;
Fig. 5 is the structured flowchart of the second generation module according to an embodiment of the invention;And
Fig. 6 is the structured flowchart of speech recognition equipment in accordance with another embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end
Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.
Audio recognition method and device according to embodiments of the present invention is described below in conjunction with accompanying drawing.
It should be noted that speech recognition refers to the voice of people is automatically converted into corresponding text by machine.In recent years
Come, speech recognition technology development is more rapid, and particularly deep neural network is applied to after speech recognition, speech recognition system
System performance is increased substantially.With the development of mobile Internet, phonetic entry mode is increasingly popular, and voice is used
Crowd is also more and more extensive.Because the pronunciation of each user suffers from its respective acoustic characteristic, if it is possible in identification process
Using this feature, the lifting of identifying system necessarily can be further brought.
Therefore, the present invention proposes a kind of audio recognition method, including:The voice messaging of speaker's input is obtained, and is obtained
Take the speaker information of speaker;Personal acoustic model corresponding with speaker is judged whether according to speaker information;Such as
Fruit is present, then obtains personal acoustic model, and carry out speech recognition to voice messaging according to the personal acoustic model of speaker;Such as
Fruit is not present, then speech recognition is carried out to voice messaging according to basic acoustic model, and generate speaker's according to voice messaging
Corpus information is simultaneously stored;And the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.
Fig. 1 is the flow chart of audio recognition method according to an embodiment of the invention.As shown in figure 1, the speech recognition
Method can include:
S101, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.
It should be noted that in an embodiment of the present invention, speaker information can be the corresponding ID of speaker
(IDentity, identity number).The speaker ID can be the identifier that server is speaker's distribution, and the speaker ID
There is one-to-one relationship with the vocal print feature of speaker.
Specifically, the voice messaging that speaker inputs can be collected by the microphone in terminal, and can be to the voice messaging
Vocal print feature is extracted, can obtain that the vocal print feature is corresponding to speak according to vocal print feature and the corresponding relation of speaker information afterwards
People's information (speaker ID etc.).
It is appreciated that in another embodiment of the present invention, speaker information can also be used in speaker eventually
The ID or MAC Address at end etc..That is, in this step, after the voice messaging of speaker's input is obtained, can also obtain
Take ID or MAC Address of terminal used in the speaker etc..
S102, personal acoustic model corresponding with speaker is judged whether according to speaker information.
Wherein, in an embodiment of the present invention, personal acoustic model can be regarded as the acoustic model of speaker oneself, this
Vocal acoustics's model can include the characteristic voice of speaker.
S103, if it is present obtain personal acoustic model, and according to the personal acoustic model of speaker to voice messaging
Carry out speech recognition.
Specifically, when judging to exist personal acoustic model corresponding with speaker, the personal acoustic model can be obtained, it
After the personal acoustic model and voice messaging can be sent to decoder, wherein the decoder can be located at server in, the decoding
Device first can carry out feature extraction to obtain corresponding acoustic feature to the voice messaging, then can be by the acoustic feature and a voice
Model is learned to be matched with being compared to be identified result according to certain criterion.
S104, if it does not exist, then carrying out speech recognition to voice messaging according to basic acoustic model, and believes according to voice
The corpus information of breath generation speaker and storage.
Wherein, in an embodiment of the present invention, basic acoustic model can be regarded as being applied to the model of popular speaker, i.e.,
By collecting the speech data of multiple speakers, and to acoustic model obtained from speech data progress voice training.
That is, when judging to be not present personal acoustic model corresponding with speaker, can by basic acoustic model and
Voice messaging is sent to decoder to realize the speech recognition to the voice messaging, afterwards can be made the voice messaging of this input
For the corpus information of the speaker, and stored, with history of forming corpus information.
S105, the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.
Specifically, in an embodiment of the present invention, as shown in Fig. 2 generation personal acoustic model implements process
It may include following steps:
S201, judges whether the quantity of the corpus information of storage reaches predetermined threshold value.
Specifically, judge whether the quantity of the corpus information of storage saves bit by bit certain amount (such as predetermined threshold value).
S202, if reaching predetermined threshold value, the corpus information to storage is screened to obtain corresponding effective language material
Information.
Specifically, in an embodiment of the present invention, when the corpus information for judging storage saves bit by bit pre-set threshold numbers,
The screening parameter of the corpus information of every storage can be first obtained, and the corpus information of every storage is commented according to screening parameter
Point.Afterwards, the fraction of the corpus information of every storage can be generated according to the weight of appraisal result and screening parameter.Finally, according to
The fraction of the corpus information of every storage is screened to obtain corresponding effective corpus information to the corpus information of storage.Its
In, in an embodiment of the present invention, screening parameter may include but be not limited to confidence level, speech energy, voice length and voice and know
Other content etc..
, can be to the history corpus information of storage more specifically, after the corpus information of speaker runs up to certain amount
Carry out screening analysis, you can the corpus information to every storage carries out confidence calculations, speech energy calculating, voice length respectively
Calculating, the calculating of speech recognition content etc., can respectively give a mark to result of calculation afterwards, then be sieved according to marking result and each
Select the weight calculation of parameter to go out the final score of every corpus information stored, finally, may filter that every low storage of fraction
Corpus information, and regard the high corpus information of remaining fraction as effective corpus information.
S203, carries out the judgement of speaker's primary and secondary according to every effective corpus information and belongs to having for same speaker to filter out
Imitate corpus information.
Specifically, all effective corpus informations can be analyzed, such as signal to noise ratio, the language by calculating every corpus information
Sound feature and speaker's sex situation carry out primary and secondary user's judgement, can if find that current speaker includes multiple natural persons
All effective corpus informations are screened by the information such as phonetic feature and speaker's sex situation and belong to same to filter out
Effective corpus information of one speaker.
It should be noted that in another embodiment of the present invention, the signal to noise ratio by calculating every corpus information,
Phonetic feature and speaker's sex situation are carried out after primary and secondary user's judgement, if finding, current speaker includes multiple natural persons
When, the current corpus information can also be rejected, that is, reject the corpus information for including multiple natural persons, the language material not believed
Breath carries out model training.
S204, carries out model training to generate according to basic acoustic model to the effective corpus information for belonging to same speaker
The personal acoustic model of speaker.
Specifically, in an embodiment of the present invention, can be first to belonging to effective corpus information carry out sound of same speaker
Feature extraction is learned, and the corresponding speech recognition content of effective corpus information of same speaker is analyzed, is put with obtaining height
The word level recognition result and corresponding acoustic feature of reliability, afterwards, can the personal acoustic model of random initializtion, and according to basic sound
Learn model, word level recognition result and acoustic feature and carry out gradient calculation, finally, the gradient after calculating can be iterated with life
Into personal acoustic model.
That is, corpus information can be used to carry out acoustic feature extraction, while analyzing speech recognition content, protect
Stay down the word recognition result and corresponding acoustic feature of high confidence level, afterwards, can the personal acoustic model of random initializtion, and combine
Acoustic feature and word the level recognition result that basic acoustic model is retained using previous step carry out gradient calculation, finally, use acquisition
Gradient carry out the personal acoustic model of iteration optimization.
Thus, it can carry out model training to obtain corresponding personal acoustic mode by the corpus information for each speaker
Type, the personal acoustic model includes the characteristic voice of user, can be used for improving the voice degree of accuracy in speech recognition process.
The audio recognition method of the embodiment of the present invention, can first obtain the voice messaging of speaker's input, and obtain speaker
Speaker information, afterwards, personal acoustic model corresponding with speaker can be judged whether according to speaker information, if depositing
, then personal acoustic model is obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, if not depositing
, then speech recognition is carried out to voice messaging according to basic acoustic model, and the language material for generating speaker according to voice messaging is believed
Cease and store, and the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage, that is, saying
Entered on the basis of the unrelated acoustic model (i.e. above-mentioned basic acoustic model) of words people using the history speech data of given speaker
Row further training, obtains the personal acoustic model of speaker's own characteristic, uses the speaker's in speech recognition process
Personal acoustic model is identified, so as to improve everyone speech discrimination accuracy, so equivalent to all voices
The user of identification provides private customized speech-recognition services, so as to improve Consumer's Experience.
Fig. 3 is the flow chart of audio recognition method in accordance with another embodiment of the present invention.
, in an embodiment of the present invention, can also be to personal acoustic model in order to further improve the degree of accuracy of speech recognition
Carry out model optimization.Specifically, as shown in figure 3, the audio recognition method can include:
S301, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.
S302, personal acoustic model corresponding with speaker is judged whether according to speaker information.
S303, if it is present obtain personal acoustic model, and according to the personal acoustic model of speaker to voice messaging
Carry out speech recognition.
S304, model optimization is carried out according to voice messaging to personal acoustic model.
It is appreciated that the speech recognition process based on speaker adaptation is transparent to user, user is perceived less than certainly
Adaptation training and the flow of self-adapting estimation.With increase of the user using the number of times of speech recognition, speech recognition server
The adaptive model of continuous optimization user, the speech discrimination accuracy of user be able to will also be improved constantly.As can be seen here, speak
People's adaptive training is not a sex work, but saving bit by bit with speaker's language material, constantly repeats adaptive training mistake
Journey.In an embodiment of the present invention, each adaptive training process can be optimized based on personal acoustic model before, with not
It is disconnected to improve speech discrimination accuracy.
S305, if it does not exist, then carrying out speech recognition to voice messaging according to basic acoustic model, and believes according to voice
The corpus information of breath generation speaker and storage.
S306, the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.
The audio recognition method of the embodiment of the present invention, language is carried out in the personal acoustic model according to speaker to voice messaging
After sound identification, also model optimization can be carried out to personal acoustic model according to voice messaging, so can further improve individual
The degree of accuracy of speech recognition.
In order to realize above-described embodiment, the invention also provides a kind of speech recognition equipment.
Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention.As shown in figure 4, the voice is known
Other device can include:First acquisition module 10, judge module 20, sound identification module 30, the first generation module 40 and second
Generation module 50.
Specifically, the first acquisition module 10 can be used for the voice messaging for obtaining speaker's input, and obtain saying for speaker
Talk about people's information.It should be noted that in an embodiment of the present invention, speaker information can be the corresponding ID of speaker.This is spoken
People ID can be the identifier that server is speaker's distribution, and the vocal print feature of the speaker ID and speaker have one-to-one corresponding
Relation.
More specifically, the first acquisition module 10 can collect the voice messaging that speaker inputs by the microphone in terminal,
And vocal print feature can be extracted to the voice messaging, the sound can be obtained according to the corresponding relation of vocal print feature and speaker information afterwards
The corresponding speaker information of line feature (speaker ID etc.).
It is appreciated that in another embodiment of the present invention, speaker information can also be used in speaker eventually
The ID or MAC Address at end etc..That is, the first acquisition module 10 is after the voice messaging of speaker's input is obtained, also
ID or MAC Address of terminal used in the speaker etc. can be obtained.
Judge module 20 can be used for judging whether personal acoustic model corresponding with speaker according to speaker information.
Wherein, in an embodiment of the present invention, personal acoustic model can be regarded as the acoustic model of speaker oneself, the personal acoustic mode
Type can include the characteristic voice of speaker.
Sound identification module 30 can be used for, when judge module 20 judges to exist personal acoustic model, obtaining personal acoustic mode
Type, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, and judge in judge module 20 not deposit
In personal acoustic model, speech recognition is carried out to voice messaging according to basic acoustic model.
More specifically, when judge module 20 judges to exist personal acoustic model corresponding with speaker, speech recognition mould
Block 30 can obtain the personal acoustic model, and it is special to obtain corresponding acoustics first can to carry out feature extraction to the voice messaging afterwards
Levy, then the acoustic feature can be matched with being compared to be identified result with personal acoustic model according to certain criterion.
Wherein, in an embodiment of the present invention, basic acoustic model can be regarded as being applied to the model of popular speaker, i.e.,
By collecting the speech data of multiple speakers, and to acoustic model obtained from speech data progress voice training.
First generation module 40 can be used for corpus information and the storage that speaker is generated according to voice messaging.More specifically,
After sound identification module 30 carries out speech recognition according to basic acoustic model to voice messaging, the first generation module 40 can be by
This time voice messaging of input and is stored as the corpus information of the speaker, with history of forming corpus information.
Second generation module 50 can be used for the individual that speaker is generated according to basic acoustic model and the corpus information of storage
Acoustic model.Specifically, in one embodiment of the invention, as shown in figure 5, second generation module 50 may include:Sentence
Disconnected unit 51, the first screening unit 52, the second screening unit 53 and generation unit 54.Specifically, judging unit 51 can be used for sentencing
Whether the quantity of the corpus information of disconnected storage reaches predetermined threshold value.More specifically, judging unit 51 can determine whether the language material letter of storage
Whether the quantity of breath saves bit by bit certain amount (such as predetermined threshold value).
First screening unit 52 can be used for judging that the quantity of the corpus information of storage reaches predetermined threshold value in judging unit 51
When, the corpus information to storage is screened to obtain corresponding effective corpus information.Specifically, in embodiments of the invention
In, the first screening unit 52 can first obtain the screening parameter of the corpus information of every storage, and every is deposited according to screening parameter
The corpus information of storage is scored, afterwards, and the corpus information of every storage is generated according to the weight of appraisal result and screening parameter
Fraction, and screened corresponding effective to obtain to the corpus information of storage according to the fraction of the corpus information of every storage
Corpus information.Wherein, in an embodiment of the present invention, to may include but be not limited to confidence level, speech energy, voice long for screening parameter
Degree and speech recognition content.
More specifically, after the corpus information of speaker runs up to certain amount, the first screening unit 52 can be to storage
History corpus information carry out screening analysis, you can to every storage corpus information carry out confidence calculations, voice energy respectively
Calculating, the calculating of voice length computation, speech recognition content etc. are measured, result of calculation can respectively be given a mark afterwards, then basis
The weight calculation of marking result and each screening parameter goes out the final score of the corpus information of every storage, finally, may filter that
The corpus information of every low storage of fraction, and it regard the high corpus information of remaining fraction as effective corpus information.
Second screening unit 53 can be used for carrying out the judgement of speaker's primary and secondary to filter out category according to every effective corpus information
In effective corpus information of same speaker.More specifically, the second screening unit 53 can be divided all effective corpus informations
Analysis, such as carries out primary and secondary user's judgement by calculating the signal to noise ratio, phonetic feature and speaker's sex situation of every corpus information,
If, can be by the information such as phonetic feature and speaker's sex situation to all it was found that when current speaker includes multiple natural persons
Effective corpus information is screened belongs to effective corpus information of same speaker to filter out.
It should be noted that in another embodiment of the present invention, the second screening unit 53 is by calculating every language
After signal to noise ratio, phonetic feature and speaker's sex situation the progress primary and secondary user's judgement for expecting information, if finding currently to speak
When people includes multiple natural persons, the current corpus information can also be rejected, that is, reject the language material for including multiple natural persons
Information, does not carry out model training to the corpus information.
Generation unit 54 can be used for carrying out mould to the effective corpus information for belonging to same speaker according to basic acoustic model
Type training is to generate the personal acoustic model of speaker.Specifically, in an embodiment of the present invention, generation unit 54 can be first right
The effective corpus information for belonging to same speaker carries out acoustic feature extraction, and to effective corpus information correspondence of same speaker
Speech recognition content analyzed, afterwards, can be with to obtain the word level recognition result and corresponding acoustic feature of high confidence level
The personal acoustic model of machine initialization, and gradient calculation is carried out according to basic acoustic model, word level recognition result and acoustic feature, most
Afterwards, the gradient after calculating can be iterated to generate personal acoustic model.
That is, corpus information can be used to carry out acoustic feature extraction for generation unit 54, while to speech recognition content
Analyzed, retain the word level recognition result and corresponding acoustic feature of high confidence level, afterwards, can random initializtion vocal acoustics
Model, and acoustic feature and word level recognition result progress gradient calculation that basic acoustic model is retained using previous step are combined, most
Afterwards, the personal acoustic model of iteration optimization is carried out using the gradient of acquisition.
Thus, it can carry out model training to obtain corresponding personal acoustic mode by the corpus information for each speaker
Type, the personal acoustic model includes the characteristic voice of user, can be used for improving the voice degree of accuracy in speech recognition process.
Further, in one embodiment of the invention, as shown in fig. 6, the speech recognition equipment may also include optimization
Module 60, optimization module 60 can be used for after the personal acoustic model according to speaker carries out speech recognition to voice messaging,
Model optimization is carried out to personal acoustic model according to voice messaging.It is appreciated that the speech recognition based on speaker adaptation
Journey is transparent to user, and user perceives the flow less than adaptive training and self-adapting estimation.As user uses voice
The increase of the number of times of identification, speech recognition server can be by the adaptive model of continuous optimization user, the speech recognition of user
The degree of accuracy will also be improved constantly.As can be seen here, speaker adaptation training is not a sex work, but with human speech of speaking
Saving bit by bit for material, constantly repeats adaptive training process.In an embodiment of the present invention, optimization module 60 is in adaptive instruction every time
It can be optimized during white silk based on personal acoustic model before, to improve constantly speech discrimination accuracy.Thus, so may be used
Further to improve the degree of accuracy of personal speech recognition.
The speech recognition equipment of the embodiment of the present invention, can obtain the voice letter that speaker inputs by the first acquisition module
Breath, and the speaker information of speaker is obtained, judge module judges whether corresponding with speaker according to speaker information
Personal acoustic model, if in the presence of, sound identification module then obtains personal acoustic model, and according to the personal acoustic model of speaker
Speech recognition is carried out to voice messaging, if being not present, sound identification module is then carried out according to basic acoustic model to voice messaging
Speech recognition, the first generation module according to voice messaging generate speaker corpus information and storage, the second generation module according to
Basic acoustic model and the corpus information of storage generate the personal acoustic model of speaker, i.e., in the unrelated acoustic model of speaker
Further trained, be somebody's turn to do using the history speech data of given speaker on the basis of (i.e. above-mentioned basic acoustic model)
The personal acoustic model of speaker's own characteristic, is known in speech recognition process using the personal acoustic model of the speaker
Not, so as to improve everyone speech discrimination accuracy, so private is provided equivalent to the user to all speech recognitions
The customized speech-recognition services of people, so as to improve Consumer's Experience.
In the description of the invention, it is to be understood that term " first ", " second " are only used for describing purpose, and can not
It is interpreted as indicating or implies relative importance or the implicit quantity for indicating indicated technical characteristic.Thus, define " the
One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In the description of the invention, " multiple "
It is meant that at least two, such as two, three etc., unless otherwise specifically defined.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described
Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not
Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office
Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area
Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification
Close and combine.
Any process described otherwise above or method description are construed as in flow chart or herein, represent to include
Module, fragment or the portion of the code of one or more executable instructions for the step of realizing specific logical function or process
Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not be by shown or discussion suitable
Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Represent in flow charts or logic and/or step described otherwise above herein, for example, being considered use
In the order list for the executable instruction for realizing logic function, it may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction
The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass
Defeated program is for instruction execution system, device or equipment or the dress for combining these instruction execution systems, device or equipment and using
Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wirings
Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, can even is that can be in the paper of printing described program thereon or other are suitable for computer-readable medium
Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned
In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage
Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware
Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal
Discrete logic, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried
Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable storage medium
In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each embodiment of the invention can be integrated in a processing module, can also
That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould
Block can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as
Fruit is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..Although having been shown and retouching above
Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention
System, one of ordinary skill in the art can be changed to above-described embodiment, change, replace and become within the scope of the invention
Type.
Claims (12)
1. a kind of audio recognition method, it is characterised in that comprise the following steps:
The voice messaging of speaker's input is obtained, and obtains the speaker information of the speaker, wherein, the speaker information
The ID or MAC Address of terminal used in the speaker corresponding ID or described speakers;
Personal acoustic model corresponding with the speaker is judged whether according to the speaker information;
If it is present obtain the personal acoustic model, and according to the personal acoustic model of the speaker to the voice
Information carries out speech recognition;
If it does not exist, then carrying out speech recognition to the voice messaging according to basic acoustic model, and believed according to the voice
The corpus information of the breath generation speaker and storage;And
The personal acoustic model of the speaker is generated according to the basic acoustic model and the corpus information of storage.
2. audio recognition method as claimed in claim 1, it is characterised in that according to the basic acoustic model and the language of storage
Expect that information generates the personal acoustic model of the speaker, specifically include:
Judge whether the quantity of the corpus information of storage reaches predetermined threshold value;
If reaching predetermined threshold value, the corpus information to the storage is screened to obtain corresponding effective corpus information;
The judgement of speaker's primary and secondary is carried out according to every effective corpus information and belongs to effective language material of same speaker to filter out and believes
Breath;And
Carry out model training to the effective corpus information for belonging to same speaker to generate according to the basic acoustic model
The personal acoustic model of the speaker.
3. audio recognition method as claimed in claim 2, it is characterised in that the corpus information of the storage is screened with
Corresponding effective corpus information is obtained, is specifically included:
The screening parameter of the corpus information of every storage is obtained, and the language material of described every storage is believed according to the screening parameter
Breath is scored;
The fraction of the corpus information of every storage is generated according to the weight of appraisal result and the screening parameter;
Screened corresponding to obtain to the corpus information of the storage according to the fraction of the corpus information of described every storage
Effective corpus information.
4. audio recognition method as claimed in claim 3, it is characterised in that wherein, the screening parameter includes confidence level, language
Sound energy, voice length and speech recognition content.
5. audio recognition method as claimed in claim 2, it is characterised in that belonged to according to the basic acoustic model to described
Effective corpus information of same speaker carries out model training to generate the personal acoustic model of the speaker, specifically includes:
Acoustic feature extraction is carried out to the effective corpus information for belonging to same speaker, and had to the same speaker
The corresponding speech recognition content of effect corpus information is analyzed, to obtain the word level recognition result and corresponding acoustics of high confidence level
Feature;
Personal acoustic model described in random initializtion, and according to the basic acoustic model, institute's predicate level recognition result and described
Acoustic feature carries out gradient calculation;And
Gradient after calculating is iterated to generate the personal acoustic model.
6. the audio recognition method as any one of claim 1 to 5, it is characterised in that according to the speaker's
Personal acoustic model is carried out to the voice messaging after speech recognition, and methods described also includes:
Model optimization is carried out to the personal acoustic model according to the voice messaging.
7. a kind of speech recognition equipment, it is characterised in that including:
First acquisition module, for obtaining the voice messaging of speaker's input, and obtains the speaker information of the speaker, its
In, the speaker information is the ID or MAC of terminal used in the speaker corresponding ID or described speakers
Location;
Judge module, for judging whether personal acoustic mode corresponding with the speaker according to the speaker information
Type;
Sound identification module, for when the judge module judges to have the personal acoustic model, obtaining described voice
Model is learned, and speech recognition is carried out to the voice messaging according to the personal acoustic model of the speaker, and is sentenced described
When disconnected module judges to be not present the personal acoustic model, voice knowledge is carried out to the voice messaging according to basic acoustic model
Not;
First generation module, corpus information and storage for generating the speaker according to the voice messaging;And
Second generation module, the individual for generating the speaker according to the basic acoustic model and the corpus information of storage
Acoustic model.
8. speech recognition equipment as claimed in claim 7, it is characterised in that second generation module includes:
Judging unit, for judging whether the quantity of corpus information of storage reaches predetermined threshold value;
First screening unit, for judging that the quantity of corpus information of the storage reaches the default threshold in the judging unit
During value, the corpus information to the storage is screened to obtain corresponding effective corpus information;
Second screening unit, belongs to same theory for carrying out the judgement of speaker's primary and secondary according to every effective corpus information to filter out
Talk about effective corpus information of people;And
Generation unit, for carrying out mould to the effective corpus information for belonging to same speaker according to the basic acoustic model
Type training is to generate the personal acoustic model of the speaker.
9. speech recognition equipment as claimed in claim 8, it is characterised in that first screening unit specifically for:
The screening parameter of the corpus information of every storage is obtained, and the language material of described every storage is believed according to the screening parameter
Breath is scored;
The fraction of the corpus information of every storage is generated according to the weight of appraisal result and the screening parameter;
Screened corresponding to obtain to the corpus information of the storage according to the fraction of the corpus information of described every storage
Effective corpus information.
10. speech recognition equipment as claimed in claim 9, it is characterised in that wherein, the screening parameter include confidence level,
Speech energy, voice length and speech recognition content.
11. speech recognition equipment as claimed in claim 8, it is characterised in that the generation unit specifically for:
Acoustic feature extraction is carried out to the effective corpus information for belonging to same speaker, and had to the same speaker
The corresponding speech recognition content of effect corpus information is analyzed, to obtain the word level recognition result and corresponding acoustics of high confidence level
Feature;
Personal acoustic model described in random initializtion, and according to the basic acoustic model, institute's predicate level recognition result and described
Acoustic feature carries out gradient calculation;And
Gradient after calculating is iterated to generate the personal acoustic model.
12. the speech recognition equipment as any one of claim 7 to 11, it is characterised in that also include:
Optimization module, for the personal acoustic model according to the speaker to the voice messaging carry out speech recognition it
Afterwards, model optimization is carried out to the personal acoustic model according to the voice messaging.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510558047.4A CN105096941B (en) | 2015-09-02 | 2015-09-02 | Audio recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510558047.4A CN105096941B (en) | 2015-09-02 | 2015-09-02 | Audio recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105096941A CN105096941A (en) | 2015-11-25 |
CN105096941B true CN105096941B (en) | 2017-10-31 |
Family
ID=54577227
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510558047.4A Active CN105096941B (en) | 2015-09-02 | 2015-09-02 | Audio recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105096941B (en) |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105632489A (en) * | 2016-01-20 | 2016-06-01 | 曾戟 | Voice playing method and voice playing device |
CN107545889B (en) * | 2016-06-23 | 2020-10-23 | 华为终端有限公司 | Model optimization method and device suitable for pattern recognition and terminal equipment |
CN107767880B (en) * | 2016-08-16 | 2021-04-16 | 杭州萤石网络有限公司 | Voice detection method, camera and intelligent home nursing system |
CN106683680B (en) * | 2017-03-10 | 2022-03-25 | 百度在线网络技术(北京)有限公司 | Speaker recognition method and device, computer equipment and computer readable medium |
CN108737324B (en) * | 2017-04-13 | 2021-03-02 | 腾讯科技(深圳)有限公司 | Method and device for generating artificial intelligence service assembly and related equipment and system |
CN110310623B (en) * | 2017-09-20 | 2021-12-28 | Oppo广东移动通信有限公司 | Sample generation method, model training method, device, medium, and electronic apparatus |
CN107702706B (en) * | 2017-09-20 | 2020-08-21 | Oppo广东移动通信有限公司 | Path determining method and device, storage medium and mobile terminal |
CN107910008B (en) * | 2017-11-13 | 2021-06-11 | 河海大学 | Voice recognition method based on multiple acoustic models for personal equipment |
CN108538293B (en) * | 2018-04-27 | 2021-05-28 | 海信视像科技股份有限公司 | Voice awakening method and device and intelligent device |
CN108717854A (en) * | 2018-05-08 | 2018-10-30 | 哈尔滨理工大学 | Method for distinguishing speek person based on optimization GFCC characteristic parameters |
CN110634472B (en) * | 2018-06-21 | 2024-06-04 | 中兴通讯股份有限公司 | Speech recognition method, server and computer readable storage medium |
CN110728976B (en) * | 2018-06-30 | 2022-05-06 | 华为技术有限公司 | Method, device and system for voice recognition |
CN109215638B (en) * | 2018-10-19 | 2021-07-13 | 珠海格力电器股份有限公司 | Voice learning method and device, voice equipment and storage medium |
CN109243468B (en) * | 2018-11-14 | 2022-07-12 | 出门问问创新科技有限公司 | Voice recognition method and device, electronic equipment and storage medium |
CN110164431B (en) * | 2018-11-15 | 2023-01-06 | 腾讯科技(深圳)有限公司 | Audio data processing method and device and storage medium |
CN109635209B (en) * | 2018-12-12 | 2021-03-12 | 广东小天才科技有限公司 | Learning content recommendation method and family education equipment |
CN110096479B (en) * | 2019-03-15 | 2023-07-25 | 平安科技(深圳)有限公司 | Batch renaming method and device for voice information, computer equipment and storage medium |
CN110059059B (en) * | 2019-03-15 | 2024-04-16 | 平安科技(深圳)有限公司 | Batch screening method and device for voice information, computer equipment and storage medium |
CN110264997A (en) * | 2019-05-30 | 2019-09-20 | 北京百度网讯科技有限公司 | The method, apparatus and storage medium of voice punctuate |
CN110120221A (en) * | 2019-06-06 | 2019-08-13 | 上海蔚来汽车有限公司 | The offline audio recognition method of user individual and its system for vehicle system |
CN110570843B (en) * | 2019-06-28 | 2021-03-05 | 北京蓦然认知科技有限公司 | User voice recognition method and device |
CN110288995B (en) * | 2019-07-19 | 2021-07-16 | 出门问问(苏州)信息科技有限公司 | Interaction method and device based on voice recognition, storage medium and electronic equipment |
CN113096646B (en) * | 2019-12-20 | 2022-06-07 | 北京世纪好未来教育科技有限公司 | Audio recognition method and device, electronic equipment and storage medium |
CN111081262A (en) * | 2019-12-30 | 2020-04-28 | 杭州中科先进技术研究院有限公司 | Lightweight speech recognition system and method based on customized model |
CN111261168A (en) * | 2020-01-21 | 2020-06-09 | 杭州中科先进技术研究院有限公司 | Speech recognition engine and method supporting multi-task and multi-model |
CN111292746A (en) * | 2020-02-07 | 2020-06-16 | 普强时代(珠海横琴)信息技术有限公司 | Voice input conversion system based on human-computer interaction |
CN113823263A (en) * | 2020-06-19 | 2021-12-21 | 深圳Tcl新技术有限公司 | Voice recognition method and system |
CN111816174B (en) * | 2020-06-24 | 2024-09-03 | 北京小米松果电子有限公司 | Speech recognition method, device and computer readable storage medium |
CN111785275A (en) * | 2020-06-30 | 2020-10-16 | 北京捷通华声科技股份有限公司 | Voice recognition method and device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1655235A (en) * | 2004-02-12 | 2005-08-17 | 微软公司 | Automatic identification of telephone callers based on voice characteristics |
CN103187053A (en) * | 2011-12-31 | 2013-07-03 | 联想(北京)有限公司 | Input method and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7539616B2 (en) * | 2006-02-20 | 2009-05-26 | Microsoft Corporation | Speaker authentication using adapted background models |
-
2015
- 2015-09-02 CN CN201510558047.4A patent/CN105096941B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1655235A (en) * | 2004-02-12 | 2005-08-17 | 微软公司 | Automatic identification of telephone callers based on voice characteristics |
CN103187053A (en) * | 2011-12-31 | 2013-07-03 | 联想(北京)有限公司 | Input method and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN105096941A (en) | 2015-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105096941B (en) | Audio recognition method and device | |
CN108320733B (en) | Voice data processing method and device, storage medium and electronic equipment | |
CN102982811B (en) | Voice endpoint detection method based on real-time decoding | |
CN107437415B (en) | Intelligent voice interaction method and system | |
CN105336322B (en) | Polyphone model training method, and speech synthesis method and device | |
CN110473566A (en) | Audio separation method, device, electronic equipment and computer readable storage medium | |
CN102270450B (en) | System and method of multi model adaptation and voice recognition | |
CN107945790B (en) | Emotion recognition method and emotion recognition system | |
CN107195295A (en) | Audio recognition method and device based on Chinese and English mixing dictionary | |
CN108417201B (en) | Single-channel multi-speaker identity recognition method and system | |
CN106503805A (en) | A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method | |
CN107492382A (en) | Voiceprint extracting method and device based on neutral net | |
CN106683661A (en) | Role separation method and device based on voice | |
CN104765996B (en) | Voiceprint password authentication method and system | |
CN104903954A (en) | Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination | |
CN108447471A (en) | Audio recognition method and speech recognition equipment | |
JP6440967B2 (en) | End-of-sentence estimation apparatus, method and program thereof | |
CN109448704A (en) | Construction method, device, server and the storage medium of tone decoding figure | |
CN113255362B (en) | Method and device for filtering and identifying human voice, electronic device and storage medium | |
CN109410956A (en) | A kind of object identifying method of audio data, device, equipment and storage medium | |
CN110853621B (en) | Voice smoothing method and device, electronic equipment and computer storage medium | |
CN108986798A (en) | Processing method, device and the equipment of voice data | |
CN109741734A (en) | A kind of speech evaluating method, device and readable medium | |
CN107910004A (en) | Voiced translation processing method and processing device | |
CN106843523A (en) | Character input method and device based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |