CN105096941B - Audio recognition method and device - Google Patents

Audio recognition method and device Download PDF

Info

Publication number
CN105096941B
CN105096941B CN201510558047.4A CN201510558047A CN105096941B CN 105096941 B CN105096941 B CN 105096941B CN 201510558047 A CN201510558047 A CN 201510558047A CN 105096941 B CN105096941 B CN 105096941B
Authority
CN
China
Prior art keywords
speaker
acoustic model
corpus information
information
storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510558047.4A
Other languages
Chinese (zh)
Other versions
CN105096941A (en
Inventor
杜念冬
邹赛赛
谢延
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510558047.4A priority Critical patent/CN105096941B/en
Publication of CN105096941A publication Critical patent/CN105096941A/en
Application granted granted Critical
Publication of CN105096941B publication Critical patent/CN105096941B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of audio recognition method and device, wherein method includes:The voice messaging of speaker's input is obtained, and obtains the speaker information of speaker;Personal acoustic model corresponding with speaker is judged whether according to speaker information;If it is present the personal acoustic model obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker;If it does not exist, then carrying out speech recognition, and the corpus information according to voice messaging generation speaker and storage to voice messaging according to basic acoustic model;And the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.This method can the speech recognition process based on speaker adaptation can be directed to each speaker the characteristics of, customize their acoustic model, so as to improve the degree of accuracy of each speaker, improve Consumer's Experience.

Description

Audio recognition method and device
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of audio recognition method and device.
Background technology
In recent years, speech recognition technology development it is more rapid, particularly deep neural network be applied to speech recognition it Afterwards, speech recognition performance is increased substantially.With the development of mobile Internet, phonetic entry mode is more and more universal, Voice is also more and more extensive using crowd.Therefore, how to improve the degree of accuracy of speech recognition turns into urgent problem to be solved.
In correlation technique, speech recognition process is mainly by a large amount of voice trainings, to obtain acoustic model and language mould Type, then carries out speech recognition by the speech data that the acoustic model and language model are inputted to speaker.As can be seen that Training sample is bigger, and accuracy is higher, trains obtained acoustic model effect better, so as to improve the degree of accuracy of speech recognition.
But be during above-mentioned speech recognition, to employ substantial amounts of speech samples, training is constructed the problem of exist Acoustic model, the model be applied to all speakers speech recognition process, for dialectal accent it is heavier or talk it is unclear For the speaker of Chu, the content that the speaker inputs can not may well be identified by above-mentioned voice recognition mode, The recognition accuracy of the acoustic model is reduced, Consumer's Experience is deteriorated.
The content of the invention
The purpose of the present invention is intended at least solve one of above-mentioned technical problem to a certain extent.
Therefore, first purpose of the present invention is to propose a kind of audio recognition method.This method can be based on speaker The characteristics of adaptive speech recognition process can be directed to each speaker, customizes their acoustic model, so as to improve each The degree of accuracy of speaker, improves Consumer's Experience.
Second object of the present invention is to propose a kind of speech recognition equipment.
To achieve these goals, the audio recognition method of first aspect present invention embodiment, including:Obtain speaker defeated The voice messaging entered, and obtain the speaker information of the speaker;Judged whether according to the speaker information and institute State the corresponding personal acoustic model of speaker;If it is present the personal acoustic model is obtained, and according to the speaker's Personal acoustic model carries out speech recognition to the voice messaging;If it does not exist, then according to basic acoustic model to institute's predicate Message breath carries out speech recognition, and the corpus information according to the voice messaging generation speaker and storage;And according to The basic acoustic model and the corpus information of storage generate the personal acoustic model of the speaker.
The audio recognition method of the embodiment of the present invention, can first obtain the voice messaging of speaker's input, and obtain speaker Speaker information, afterwards, personal acoustic model corresponding with speaker can be judged whether according to speaker information, if depositing , then personal acoustic model is obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, if not depositing , then speech recognition is carried out to voice messaging according to basic acoustic model, and the language material for generating speaker according to voice messaging is believed Cease and store, and the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage, that is, saying Entered on the basis of the unrelated acoustic model (i.e. above-mentioned basic acoustic model) of words people using the history speech data of given speaker Row further training, obtains the personal acoustic model of speaker's own characteristic, uses the speaker's in speech recognition process Personal acoustic model is identified, so as to improve everyone speech discrimination accuracy, so equivalent to all voices The user of identification provides private customized speech-recognition services, so as to improve Consumer's Experience.
To achieve these goals, the speech recognition equipment of second aspect of the present invention embodiment, including:First obtains mould Block, for obtaining the voice messaging of speaker's input, and obtains the speaker information of the speaker;Judge module, for root Personal acoustic model corresponding with the speaker is judged whether according to the speaker information;Sound identification module, is used for When the judge module judges to have the personal acoustic model, the personal acoustic model is obtained, and speak according to described The personal acoustic model of people carries out speech recognition to the voice messaging, and judges in the judge module in the absence of described During vocal acoustics's model, speech recognition is carried out to the voice messaging according to basic acoustic model;First generation module, for basis The voice messaging generates the corpus information of the speaker and storage;And second generation module, for according to the basis Acoustic model and the corpus information of storage generate the personal acoustic model of the speaker.
The speech recognition equipment of the embodiment of the present invention, can obtain the voice letter that speaker inputs by the first acquisition module Breath, and the speaker information of speaker is obtained, judge module judges whether corresponding with speaker according to speaker information Personal acoustic model, if in the presence of, sound identification module then obtains personal acoustic model, and according to the personal acoustic model of speaker Speech recognition is carried out to voice messaging, if being not present, sound identification module is then carried out according to basic acoustic model to voice messaging Speech recognition, the first generation module according to voice messaging generate speaker corpus information and storage, the second generation module according to Basic acoustic model and the corpus information of storage generate the personal acoustic model of speaker, i.e., in the unrelated acoustic model of speaker Further trained, be somebody's turn to do using the history speech data of given speaker on the basis of (i.e. above-mentioned basic acoustic model) The personal acoustic model of speaker's own characteristic, is known in speech recognition process using the personal acoustic model of the speaker Not, so as to improve everyone speech discrimination accuracy, so private is provided equivalent to the user to all speech recognitions The customized speech-recognition services of people, so as to improve Consumer's Experience.
The additional aspect and advantage of the present invention will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become from description of the accompanying drawings below to embodiment is combined Substantially and be readily appreciated that, wherein:
Fig. 1 is the flow chart of audio recognition method according to an embodiment of the invention;
Fig. 2 is the flow chart of the personal acoustic model of generation according to an embodiment of the invention;
Fig. 3 is the flow chart of audio recognition method in accordance with another embodiment of the present invention;
Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention;
Fig. 5 is the structured flowchart of the second generation module according to an embodiment of the invention;And
Fig. 6 is the structured flowchart of speech recognition equipment in accordance with another embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.
Audio recognition method and device according to embodiments of the present invention is described below in conjunction with accompanying drawing.
It should be noted that speech recognition refers to the voice of people is automatically converted into corresponding text by machine.In recent years Come, speech recognition technology development is more rapid, and particularly deep neural network is applied to after speech recognition, speech recognition system System performance is increased substantially.With the development of mobile Internet, phonetic entry mode is increasingly popular, and voice is used Crowd is also more and more extensive.Because the pronunciation of each user suffers from its respective acoustic characteristic, if it is possible in identification process Using this feature, the lifting of identifying system necessarily can be further brought.
Therefore, the present invention proposes a kind of audio recognition method, including:The voice messaging of speaker's input is obtained, and is obtained Take the speaker information of speaker;Personal acoustic model corresponding with speaker is judged whether according to speaker information;Such as Fruit is present, then obtains personal acoustic model, and carry out speech recognition to voice messaging according to the personal acoustic model of speaker;Such as Fruit is not present, then speech recognition is carried out to voice messaging according to basic acoustic model, and generate speaker's according to voice messaging Corpus information is simultaneously stored;And the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.
Fig. 1 is the flow chart of audio recognition method according to an embodiment of the invention.As shown in figure 1, the speech recognition Method can include:
S101, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.
It should be noted that in an embodiment of the present invention, speaker information can be the corresponding ID of speaker (IDentity, identity number).The speaker ID can be the identifier that server is speaker's distribution, and the speaker ID There is one-to-one relationship with the vocal print feature of speaker.
Specifically, the voice messaging that speaker inputs can be collected by the microphone in terminal, and can be to the voice messaging Vocal print feature is extracted, can obtain that the vocal print feature is corresponding to speak according to vocal print feature and the corresponding relation of speaker information afterwards People's information (speaker ID etc.).
It is appreciated that in another embodiment of the present invention, speaker information can also be used in speaker eventually The ID or MAC Address at end etc..That is, in this step, after the voice messaging of speaker's input is obtained, can also obtain Take ID or MAC Address of terminal used in the speaker etc..
S102, personal acoustic model corresponding with speaker is judged whether according to speaker information.
Wherein, in an embodiment of the present invention, personal acoustic model can be regarded as the acoustic model of speaker oneself, this Vocal acoustics's model can include the characteristic voice of speaker.
S103, if it is present obtain personal acoustic model, and according to the personal acoustic model of speaker to voice messaging Carry out speech recognition.
Specifically, when judging to exist personal acoustic model corresponding with speaker, the personal acoustic model can be obtained, it After the personal acoustic model and voice messaging can be sent to decoder, wherein the decoder can be located at server in, the decoding Device first can carry out feature extraction to obtain corresponding acoustic feature to the voice messaging, then can be by the acoustic feature and a voice Model is learned to be matched with being compared to be identified result according to certain criterion.
S104, if it does not exist, then carrying out speech recognition to voice messaging according to basic acoustic model, and believes according to voice The corpus information of breath generation speaker and storage.
Wherein, in an embodiment of the present invention, basic acoustic model can be regarded as being applied to the model of popular speaker, i.e., By collecting the speech data of multiple speakers, and to acoustic model obtained from speech data progress voice training.
That is, when judging to be not present personal acoustic model corresponding with speaker, can by basic acoustic model and Voice messaging is sent to decoder to realize the speech recognition to the voice messaging, afterwards can be made the voice messaging of this input For the corpus information of the speaker, and stored, with history of forming corpus information.
S105, the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.
Specifically, in an embodiment of the present invention, as shown in Fig. 2 generation personal acoustic model implements process It may include following steps:
S201, judges whether the quantity of the corpus information of storage reaches predetermined threshold value.
Specifically, judge whether the quantity of the corpus information of storage saves bit by bit certain amount (such as predetermined threshold value).
S202, if reaching predetermined threshold value, the corpus information to storage is screened to obtain corresponding effective language material Information.
Specifically, in an embodiment of the present invention, when the corpus information for judging storage saves bit by bit pre-set threshold numbers, The screening parameter of the corpus information of every storage can be first obtained, and the corpus information of every storage is commented according to screening parameter Point.Afterwards, the fraction of the corpus information of every storage can be generated according to the weight of appraisal result and screening parameter.Finally, according to The fraction of the corpus information of every storage is screened to obtain corresponding effective corpus information to the corpus information of storage.Its In, in an embodiment of the present invention, screening parameter may include but be not limited to confidence level, speech energy, voice length and voice and know Other content etc..
, can be to the history corpus information of storage more specifically, after the corpus information of speaker runs up to certain amount Carry out screening analysis, you can the corpus information to every storage carries out confidence calculations, speech energy calculating, voice length respectively Calculating, the calculating of speech recognition content etc., can respectively give a mark to result of calculation afterwards, then be sieved according to marking result and each Select the weight calculation of parameter to go out the final score of every corpus information stored, finally, may filter that every low storage of fraction Corpus information, and regard the high corpus information of remaining fraction as effective corpus information.
S203, carries out the judgement of speaker's primary and secondary according to every effective corpus information and belongs to having for same speaker to filter out Imitate corpus information.
Specifically, all effective corpus informations can be analyzed, such as signal to noise ratio, the language by calculating every corpus information Sound feature and speaker's sex situation carry out primary and secondary user's judgement, can if find that current speaker includes multiple natural persons All effective corpus informations are screened by the information such as phonetic feature and speaker's sex situation and belong to same to filter out Effective corpus information of one speaker.
It should be noted that in another embodiment of the present invention, the signal to noise ratio by calculating every corpus information, Phonetic feature and speaker's sex situation are carried out after primary and secondary user's judgement, if finding, current speaker includes multiple natural persons When, the current corpus information can also be rejected, that is, reject the corpus information for including multiple natural persons, the language material not believed Breath carries out model training.
S204, carries out model training to generate according to basic acoustic model to the effective corpus information for belonging to same speaker The personal acoustic model of speaker.
Specifically, in an embodiment of the present invention, can be first to belonging to effective corpus information carry out sound of same speaker Feature extraction is learned, and the corresponding speech recognition content of effective corpus information of same speaker is analyzed, is put with obtaining height The word level recognition result and corresponding acoustic feature of reliability, afterwards, can the personal acoustic model of random initializtion, and according to basic sound Learn model, word level recognition result and acoustic feature and carry out gradient calculation, finally, the gradient after calculating can be iterated with life Into personal acoustic model.
That is, corpus information can be used to carry out acoustic feature extraction, while analyzing speech recognition content, protect Stay down the word recognition result and corresponding acoustic feature of high confidence level, afterwards, can the personal acoustic model of random initializtion, and combine Acoustic feature and word the level recognition result that basic acoustic model is retained using previous step carry out gradient calculation, finally, use acquisition Gradient carry out the personal acoustic model of iteration optimization.
Thus, it can carry out model training to obtain corresponding personal acoustic mode by the corpus information for each speaker Type, the personal acoustic model includes the characteristic voice of user, can be used for improving the voice degree of accuracy in speech recognition process.
The audio recognition method of the embodiment of the present invention, can first obtain the voice messaging of speaker's input, and obtain speaker Speaker information, afterwards, personal acoustic model corresponding with speaker can be judged whether according to speaker information, if depositing , then personal acoustic model is obtained, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, if not depositing , then speech recognition is carried out to voice messaging according to basic acoustic model, and the language material for generating speaker according to voice messaging is believed Cease and store, and the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage, that is, saying Entered on the basis of the unrelated acoustic model (i.e. above-mentioned basic acoustic model) of words people using the history speech data of given speaker Row further training, obtains the personal acoustic model of speaker's own characteristic, uses the speaker's in speech recognition process Personal acoustic model is identified, so as to improve everyone speech discrimination accuracy, so equivalent to all voices The user of identification provides private customized speech-recognition services, so as to improve Consumer's Experience.
Fig. 3 is the flow chart of audio recognition method in accordance with another embodiment of the present invention.
, in an embodiment of the present invention, can also be to personal acoustic model in order to further improve the degree of accuracy of speech recognition Carry out model optimization.Specifically, as shown in figure 3, the audio recognition method can include:
S301, obtains the voice messaging of speaker's input, and obtains the speaker information of speaker.
S302, personal acoustic model corresponding with speaker is judged whether according to speaker information.
S303, if it is present obtain personal acoustic model, and according to the personal acoustic model of speaker to voice messaging Carry out speech recognition.
S304, model optimization is carried out according to voice messaging to personal acoustic model.
It is appreciated that the speech recognition process based on speaker adaptation is transparent to user, user is perceived less than certainly Adaptation training and the flow of self-adapting estimation.With increase of the user using the number of times of speech recognition, speech recognition server The adaptive model of continuous optimization user, the speech discrimination accuracy of user be able to will also be improved constantly.As can be seen here, speak People's adaptive training is not a sex work, but saving bit by bit with speaker's language material, constantly repeats adaptive training mistake Journey.In an embodiment of the present invention, each adaptive training process can be optimized based on personal acoustic model before, with not It is disconnected to improve speech discrimination accuracy.
S305, if it does not exist, then carrying out speech recognition to voice messaging according to basic acoustic model, and believes according to voice The corpus information of breath generation speaker and storage.
S306, the personal acoustic model of speaker is generated according to basic acoustic model and the corpus information of storage.
The audio recognition method of the embodiment of the present invention, language is carried out in the personal acoustic model according to speaker to voice messaging After sound identification, also model optimization can be carried out to personal acoustic model according to voice messaging, so can further improve individual The degree of accuracy of speech recognition.
In order to realize above-described embodiment, the invention also provides a kind of speech recognition equipment.
Fig. 4 is the structured flowchart of speech recognition equipment according to an embodiment of the invention.As shown in figure 4, the voice is known Other device can include:First acquisition module 10, judge module 20, sound identification module 30, the first generation module 40 and second Generation module 50.
Specifically, the first acquisition module 10 can be used for the voice messaging for obtaining speaker's input, and obtain saying for speaker Talk about people's information.It should be noted that in an embodiment of the present invention, speaker information can be the corresponding ID of speaker.This is spoken People ID can be the identifier that server is speaker's distribution, and the vocal print feature of the speaker ID and speaker have one-to-one corresponding Relation.
More specifically, the first acquisition module 10 can collect the voice messaging that speaker inputs by the microphone in terminal, And vocal print feature can be extracted to the voice messaging, the sound can be obtained according to the corresponding relation of vocal print feature and speaker information afterwards The corresponding speaker information of line feature (speaker ID etc.).
It is appreciated that in another embodiment of the present invention, speaker information can also be used in speaker eventually The ID or MAC Address at end etc..That is, the first acquisition module 10 is after the voice messaging of speaker's input is obtained, also ID or MAC Address of terminal used in the speaker etc. can be obtained.
Judge module 20 can be used for judging whether personal acoustic model corresponding with speaker according to speaker information. Wherein, in an embodiment of the present invention, personal acoustic model can be regarded as the acoustic model of speaker oneself, the personal acoustic mode Type can include the characteristic voice of speaker.
Sound identification module 30 can be used for, when judge module 20 judges to exist personal acoustic model, obtaining personal acoustic mode Type, and speech recognition is carried out to voice messaging according to the personal acoustic model of speaker, and judge in judge module 20 not deposit In personal acoustic model, speech recognition is carried out to voice messaging according to basic acoustic model.
More specifically, when judge module 20 judges to exist personal acoustic model corresponding with speaker, speech recognition mould Block 30 can obtain the personal acoustic model, and it is special to obtain corresponding acoustics first can to carry out feature extraction to the voice messaging afterwards Levy, then the acoustic feature can be matched with being compared to be identified result with personal acoustic model according to certain criterion.
Wherein, in an embodiment of the present invention, basic acoustic model can be regarded as being applied to the model of popular speaker, i.e., By collecting the speech data of multiple speakers, and to acoustic model obtained from speech data progress voice training.
First generation module 40 can be used for corpus information and the storage that speaker is generated according to voice messaging.More specifically, After sound identification module 30 carries out speech recognition according to basic acoustic model to voice messaging, the first generation module 40 can be by This time voice messaging of input and is stored as the corpus information of the speaker, with history of forming corpus information.
Second generation module 50 can be used for the individual that speaker is generated according to basic acoustic model and the corpus information of storage Acoustic model.Specifically, in one embodiment of the invention, as shown in figure 5, second generation module 50 may include:Sentence Disconnected unit 51, the first screening unit 52, the second screening unit 53 and generation unit 54.Specifically, judging unit 51 can be used for sentencing Whether the quantity of the corpus information of disconnected storage reaches predetermined threshold value.More specifically, judging unit 51 can determine whether the language material letter of storage Whether the quantity of breath saves bit by bit certain amount (such as predetermined threshold value).
First screening unit 52 can be used for judging that the quantity of the corpus information of storage reaches predetermined threshold value in judging unit 51 When, the corpus information to storage is screened to obtain corresponding effective corpus information.Specifically, in embodiments of the invention In, the first screening unit 52 can first obtain the screening parameter of the corpus information of every storage, and every is deposited according to screening parameter The corpus information of storage is scored, afterwards, and the corpus information of every storage is generated according to the weight of appraisal result and screening parameter Fraction, and screened corresponding effective to obtain to the corpus information of storage according to the fraction of the corpus information of every storage Corpus information.Wherein, in an embodiment of the present invention, to may include but be not limited to confidence level, speech energy, voice long for screening parameter Degree and speech recognition content.
More specifically, after the corpus information of speaker runs up to certain amount, the first screening unit 52 can be to storage History corpus information carry out screening analysis, you can to every storage corpus information carry out confidence calculations, voice energy respectively Calculating, the calculating of voice length computation, speech recognition content etc. are measured, result of calculation can respectively be given a mark afterwards, then basis The weight calculation of marking result and each screening parameter goes out the final score of the corpus information of every storage, finally, may filter that The corpus information of every low storage of fraction, and it regard the high corpus information of remaining fraction as effective corpus information.
Second screening unit 53 can be used for carrying out the judgement of speaker's primary and secondary to filter out category according to every effective corpus information In effective corpus information of same speaker.More specifically, the second screening unit 53 can be divided all effective corpus informations Analysis, such as carries out primary and secondary user's judgement by calculating the signal to noise ratio, phonetic feature and speaker's sex situation of every corpus information, If, can be by the information such as phonetic feature and speaker's sex situation to all it was found that when current speaker includes multiple natural persons Effective corpus information is screened belongs to effective corpus information of same speaker to filter out.
It should be noted that in another embodiment of the present invention, the second screening unit 53 is by calculating every language After signal to noise ratio, phonetic feature and speaker's sex situation the progress primary and secondary user's judgement for expecting information, if finding currently to speak When people includes multiple natural persons, the current corpus information can also be rejected, that is, reject the language material for including multiple natural persons Information, does not carry out model training to the corpus information.
Generation unit 54 can be used for carrying out mould to the effective corpus information for belonging to same speaker according to basic acoustic model Type training is to generate the personal acoustic model of speaker.Specifically, in an embodiment of the present invention, generation unit 54 can be first right The effective corpus information for belonging to same speaker carries out acoustic feature extraction, and to effective corpus information correspondence of same speaker Speech recognition content analyzed, afterwards, can be with to obtain the word level recognition result and corresponding acoustic feature of high confidence level The personal acoustic model of machine initialization, and gradient calculation is carried out according to basic acoustic model, word level recognition result and acoustic feature, most Afterwards, the gradient after calculating can be iterated to generate personal acoustic model.
That is, corpus information can be used to carry out acoustic feature extraction for generation unit 54, while to speech recognition content Analyzed, retain the word level recognition result and corresponding acoustic feature of high confidence level, afterwards, can random initializtion vocal acoustics Model, and acoustic feature and word level recognition result progress gradient calculation that basic acoustic model is retained using previous step are combined, most Afterwards, the personal acoustic model of iteration optimization is carried out using the gradient of acquisition.
Thus, it can carry out model training to obtain corresponding personal acoustic mode by the corpus information for each speaker Type, the personal acoustic model includes the characteristic voice of user, can be used for improving the voice degree of accuracy in speech recognition process.
Further, in one embodiment of the invention, as shown in fig. 6, the speech recognition equipment may also include optimization Module 60, optimization module 60 can be used for after the personal acoustic model according to speaker carries out speech recognition to voice messaging, Model optimization is carried out to personal acoustic model according to voice messaging.It is appreciated that the speech recognition based on speaker adaptation Journey is transparent to user, and user perceives the flow less than adaptive training and self-adapting estimation.As user uses voice The increase of the number of times of identification, speech recognition server can be by the adaptive model of continuous optimization user, the speech recognition of user The degree of accuracy will also be improved constantly.As can be seen here, speaker adaptation training is not a sex work, but with human speech of speaking Saving bit by bit for material, constantly repeats adaptive training process.In an embodiment of the present invention, optimization module 60 is in adaptive instruction every time It can be optimized during white silk based on personal acoustic model before, to improve constantly speech discrimination accuracy.Thus, so may be used Further to improve the degree of accuracy of personal speech recognition.
The speech recognition equipment of the embodiment of the present invention, can obtain the voice letter that speaker inputs by the first acquisition module Breath, and the speaker information of speaker is obtained, judge module judges whether corresponding with speaker according to speaker information Personal acoustic model, if in the presence of, sound identification module then obtains personal acoustic model, and according to the personal acoustic model of speaker Speech recognition is carried out to voice messaging, if being not present, sound identification module is then carried out according to basic acoustic model to voice messaging Speech recognition, the first generation module according to voice messaging generate speaker corpus information and storage, the second generation module according to Basic acoustic model and the corpus information of storage generate the personal acoustic model of speaker, i.e., in the unrelated acoustic model of speaker Further trained, be somebody's turn to do using the history speech data of given speaker on the basis of (i.e. above-mentioned basic acoustic model) The personal acoustic model of speaker's own characteristic, is known in speech recognition process using the personal acoustic model of the speaker Not, so as to improve everyone speech discrimination accuracy, so private is provided equivalent to the user to all speech recognitions The customized speech-recognition services of people, so as to improve Consumer's Experience.
In the description of the invention, it is to be understood that term " first ", " second " are only used for describing purpose, and can not It is interpreted as indicating or implies relative importance or the implicit quantity for indicating indicated technical characteristic.Thus, define " the One ", at least one this feature can be expressed or be implicitly included to the feature of " second ".In the description of the invention, " multiple " It is meant that at least two, such as two, three etc., unless otherwise specifically defined.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification Close and combine.
Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, fragment or the portion of the code of one or more executable instructions for the step of realizing specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not be by shown or discussion suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Represent in flow charts or logic and/or step described otherwise above herein, for example, being considered use In the order list for the executable instruction for realizing logic function, it may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress for combining these instruction execution systems, device or equipment and using Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wirings Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir (CDROM).In addition, can even is that can be in the paper of printing described program thereon or other are suitable for computer-readable medium Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware Any one of row technology or their combination are realized:With the logic gates for realizing logic function to data-signal Discrete logic, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each embodiment of the invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..Although having been shown and retouching above Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention System, one of ordinary skill in the art can be changed to above-described embodiment, change, replace and become within the scope of the invention Type.

Claims (12)

1. a kind of audio recognition method, it is characterised in that comprise the following steps:
The voice messaging of speaker's input is obtained, and obtains the speaker information of the speaker, wherein, the speaker information The ID or MAC Address of terminal used in the speaker corresponding ID or described speakers;
Personal acoustic model corresponding with the speaker is judged whether according to the speaker information;
If it is present obtain the personal acoustic model, and according to the personal acoustic model of the speaker to the voice Information carries out speech recognition;
If it does not exist, then carrying out speech recognition to the voice messaging according to basic acoustic model, and believed according to the voice The corpus information of the breath generation speaker and storage;And
The personal acoustic model of the speaker is generated according to the basic acoustic model and the corpus information of storage.
2. audio recognition method as claimed in claim 1, it is characterised in that according to the basic acoustic model and the language of storage Expect that information generates the personal acoustic model of the speaker, specifically include:
Judge whether the quantity of the corpus information of storage reaches predetermined threshold value;
If reaching predetermined threshold value, the corpus information to the storage is screened to obtain corresponding effective corpus information;
The judgement of speaker's primary and secondary is carried out according to every effective corpus information and belongs to effective language material of same speaker to filter out and believes Breath;And
Carry out model training to the effective corpus information for belonging to same speaker to generate according to the basic acoustic model The personal acoustic model of the speaker.
3. audio recognition method as claimed in claim 2, it is characterised in that the corpus information of the storage is screened with Corresponding effective corpus information is obtained, is specifically included:
The screening parameter of the corpus information of every storage is obtained, and the language material of described every storage is believed according to the screening parameter Breath is scored;
The fraction of the corpus information of every storage is generated according to the weight of appraisal result and the screening parameter;
Screened corresponding to obtain to the corpus information of the storage according to the fraction of the corpus information of described every storage Effective corpus information.
4. audio recognition method as claimed in claim 3, it is characterised in that wherein, the screening parameter includes confidence level, language Sound energy, voice length and speech recognition content.
5. audio recognition method as claimed in claim 2, it is characterised in that belonged to according to the basic acoustic model to described Effective corpus information of same speaker carries out model training to generate the personal acoustic model of the speaker, specifically includes:
Acoustic feature extraction is carried out to the effective corpus information for belonging to same speaker, and had to the same speaker The corresponding speech recognition content of effect corpus information is analyzed, to obtain the word level recognition result and corresponding acoustics of high confidence level Feature;
Personal acoustic model described in random initializtion, and according to the basic acoustic model, institute's predicate level recognition result and described Acoustic feature carries out gradient calculation;And
Gradient after calculating is iterated to generate the personal acoustic model.
6. the audio recognition method as any one of claim 1 to 5, it is characterised in that according to the speaker's Personal acoustic model is carried out to the voice messaging after speech recognition, and methods described also includes:
Model optimization is carried out to the personal acoustic model according to the voice messaging.
7. a kind of speech recognition equipment, it is characterised in that including:
First acquisition module, for obtaining the voice messaging of speaker's input, and obtains the speaker information of the speaker, its In, the speaker information is the ID or MAC of terminal used in the speaker corresponding ID or described speakers Location;
Judge module, for judging whether personal acoustic mode corresponding with the speaker according to the speaker information Type;
Sound identification module, for when the judge module judges to have the personal acoustic model, obtaining described voice Model is learned, and speech recognition is carried out to the voice messaging according to the personal acoustic model of the speaker, and is sentenced described When disconnected module judges to be not present the personal acoustic model, voice knowledge is carried out to the voice messaging according to basic acoustic model Not;
First generation module, corpus information and storage for generating the speaker according to the voice messaging;And
Second generation module, the individual for generating the speaker according to the basic acoustic model and the corpus information of storage Acoustic model.
8. speech recognition equipment as claimed in claim 7, it is characterised in that second generation module includes:
Judging unit, for judging whether the quantity of corpus information of storage reaches predetermined threshold value;
First screening unit, for judging that the quantity of corpus information of the storage reaches the default threshold in the judging unit During value, the corpus information to the storage is screened to obtain corresponding effective corpus information;
Second screening unit, belongs to same theory for carrying out the judgement of speaker's primary and secondary according to every effective corpus information to filter out Talk about effective corpus information of people;And
Generation unit, for carrying out mould to the effective corpus information for belonging to same speaker according to the basic acoustic model Type training is to generate the personal acoustic model of the speaker.
9. speech recognition equipment as claimed in claim 8, it is characterised in that first screening unit specifically for:
The screening parameter of the corpus information of every storage is obtained, and the language material of described every storage is believed according to the screening parameter Breath is scored;
The fraction of the corpus information of every storage is generated according to the weight of appraisal result and the screening parameter;
Screened corresponding to obtain to the corpus information of the storage according to the fraction of the corpus information of described every storage Effective corpus information.
10. speech recognition equipment as claimed in claim 9, it is characterised in that wherein, the screening parameter include confidence level, Speech energy, voice length and speech recognition content.
11. speech recognition equipment as claimed in claim 8, it is characterised in that the generation unit specifically for:
Acoustic feature extraction is carried out to the effective corpus information for belonging to same speaker, and had to the same speaker The corresponding speech recognition content of effect corpus information is analyzed, to obtain the word level recognition result and corresponding acoustics of high confidence level Feature;
Personal acoustic model described in random initializtion, and according to the basic acoustic model, institute's predicate level recognition result and described Acoustic feature carries out gradient calculation;And
Gradient after calculating is iterated to generate the personal acoustic model.
12. the speech recognition equipment as any one of claim 7 to 11, it is characterised in that also include:
Optimization module, for the personal acoustic model according to the speaker to the voice messaging carry out speech recognition it Afterwards, model optimization is carried out to the personal acoustic model according to the voice messaging.
CN201510558047.4A 2015-09-02 2015-09-02 Audio recognition method and device Active CN105096941B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510558047.4A CN105096941B (en) 2015-09-02 2015-09-02 Audio recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510558047.4A CN105096941B (en) 2015-09-02 2015-09-02 Audio recognition method and device

Publications (2)

Publication Number Publication Date
CN105096941A CN105096941A (en) 2015-11-25
CN105096941B true CN105096941B (en) 2017-10-31

Family

ID=54577227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510558047.4A Active CN105096941B (en) 2015-09-02 2015-09-02 Audio recognition method and device

Country Status (1)

Country Link
CN (1) CN105096941B (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105632489A (en) * 2016-01-20 2016-06-01 曾戟 Voice playing method and voice playing device
CN107545889B (en) * 2016-06-23 2020-10-23 华为终端有限公司 Model optimization method and device suitable for pattern recognition and terminal equipment
CN107767880B (en) * 2016-08-16 2021-04-16 杭州萤石网络有限公司 Voice detection method, camera and intelligent home nursing system
CN106683680B (en) * 2017-03-10 2022-03-25 百度在线网络技术(北京)有限公司 Speaker recognition method and device, computer equipment and computer readable medium
CN108737324B (en) * 2017-04-13 2021-03-02 腾讯科技(深圳)有限公司 Method and device for generating artificial intelligence service assembly and related equipment and system
CN107481718B (en) * 2017-09-20 2019-07-05 Oppo广东移动通信有限公司 Audio recognition method, device, storage medium and electronic equipment
CN107702706B (en) * 2017-09-20 2020-08-21 Oppo广东移动通信有限公司 Path determining method and device, storage medium and mobile terminal
CN107910008B (en) * 2017-11-13 2021-06-11 河海大学 Voice recognition method based on multiple acoustic models for personal equipment
CN108538293B (en) * 2018-04-27 2021-05-28 海信视像科技股份有限公司 Voice awakening method and device and intelligent device
CN108717854A (en) * 2018-05-08 2018-10-30 哈尔滨理工大学 Method for distinguishing speek person based on optimization GFCC characteristic parameters
CN110634472A (en) * 2018-06-21 2019-12-31 中兴通讯股份有限公司 Voice recognition method, server and computer readable storage medium
CN110728976B (en) * 2018-06-30 2022-05-06 华为技术有限公司 Method, device and system for voice recognition
CN109215638B (en) * 2018-10-19 2021-07-13 珠海格力电器股份有限公司 Voice learning method and device, voice equipment and storage medium
CN109243468B (en) * 2018-11-14 2022-07-12 出门问问创新科技有限公司 Voice recognition method and device, electronic equipment and storage medium
CN110164431B (en) * 2018-11-15 2023-01-06 腾讯科技(深圳)有限公司 Audio data processing method and device and storage medium
CN109635209B (en) * 2018-12-12 2021-03-12 广东小天才科技有限公司 Learning content recommendation method and family education equipment
CN110096479B (en) * 2019-03-15 2023-07-25 平安科技(深圳)有限公司 Batch renaming method and device for voice information, computer equipment and storage medium
CN110264997A (en) * 2019-05-30 2019-09-20 北京百度网讯科技有限公司 The method, apparatus and storage medium of voice punctuate
CN110120221A (en) * 2019-06-06 2019-08-13 上海蔚来汽车有限公司 The offline audio recognition method of user individual and its system for vehicle system
CN110570843B (en) * 2019-06-28 2021-03-05 北京蓦然认知科技有限公司 User voice recognition method and device
CN110288995B (en) * 2019-07-19 2021-07-16 出门问问(苏州)信息科技有限公司 Interaction method and device based on voice recognition, storage medium and electronic equipment
CN113096646B (en) * 2019-12-20 2022-06-07 北京世纪好未来教育科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN111081262A (en) * 2019-12-30 2020-04-28 杭州中科先进技术研究院有限公司 Lightweight speech recognition system and method based on customized model
CN111261168A (en) * 2020-01-21 2020-06-09 杭州中科先进技术研究院有限公司 Speech recognition engine and method supporting multi-task and multi-model
CN111292746A (en) * 2020-02-07 2020-06-16 普强时代(珠海横琴)信息技术有限公司 Voice input conversion system based on human-computer interaction
CN113823263A (en) * 2020-06-19 2021-12-21 深圳Tcl新技术有限公司 Voice recognition method and system
CN111816174A (en) * 2020-06-24 2020-10-23 北京小米松果电子有限公司 Speech recognition method, device and computer readable storage medium
CN111785275A (en) * 2020-06-30 2020-10-16 北京捷通华声科技股份有限公司 Voice recognition method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1655235A (en) * 2004-02-12 2005-08-17 微软公司 Automatic identification of telephone callers based on voice characteristics
CN103187053A (en) * 2011-12-31 2013-07-03 联想(北京)有限公司 Input method and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7539616B2 (en) * 2006-02-20 2009-05-26 Microsoft Corporation Speaker authentication using adapted background models

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1655235A (en) * 2004-02-12 2005-08-17 微软公司 Automatic identification of telephone callers based on voice characteristics
CN103187053A (en) * 2011-12-31 2013-07-03 联想(北京)有限公司 Input method and electronic equipment

Also Published As

Publication number Publication date
CN105096941A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
CN105096941B (en) Audio recognition method and device
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
CN102982811B (en) Voice endpoint detection method based on real-time decoding
CN107437415B (en) Intelligent voice interaction method and system
CN105336322B (en) Polyphone model training method, and speech synthesis method and device
CN104036774B (en) Tibetan dialect recognition methods and system
CN102270450B (en) System and method of multi model adaptation and voice recognition
CN107945790B (en) Emotion recognition method and emotion recognition system
CN107301860A (en) Audio recognition method and device based on Chinese and English mixing dictionary
CN102142253B (en) Voice emotion identification equipment and method
CN108417201B (en) Single-channel multi-speaker identity recognition method and system
CN106503805A (en) A kind of bimodal based on machine learning everybody talk with sentiment analysis system and method
CN107492382A (en) Voiceprint extracting method and device based on neutral net
CN104575490A (en) Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm
CN104765996B (en) Voiceprint password authentication method and system
CN104903954A (en) Speaker verification and identification using artificial neural network-based sub-phonetic unit discrimination
JP6440967B2 (en) End-of-sentence estimation apparatus, method and program thereof
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN111862942B (en) Method and system for training mixed speech recognition model of Mandarin and Sichuan
CN106782615A (en) Speech data emotion detection method and apparatus and system
CN101548313A (en) Voice activity detection system and method
CN110853621B (en) Voice smoothing method and device, electronic equipment and computer storage medium
CN109448704A (en) Construction method, device, server and the storage medium of tone decoding figure
CN108320732A (en) The method and apparatus for generating target speaker's speech recognition computation model
CN108986798A (en) Processing method, device and the equipment of voice data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant