Audio recognition method, device, terminal and computer-readable recording medium
Technical field
The present embodiments relate to speech recognition technology, more particularly to a kind of audio recognition method, device, terminal and calculating
Machine readable storage medium storing program for executing.
Background technology
In voice command words identification technology, misrecognition is always a more insoluble problem.Order word identifies
Why false recognition rate is higher, is because the order word recognition method of prior art is generally by constructing decoding network come real
It is existing, multigroup aligned phoneme sequence corresponding with default order word is included in the decoding network.Inputting any voice all can be according to the language
Sound searches out an aligned phoneme sequence matched the most from decoding network, therefore causes to misidentify.
The method for solving for noise to be identified as order word at present is to calculate the confidence level of recognition result, when confidence level is more than in advance
If threshold value when represent that identification is correct, when confidence level is less than the threshold value expression do not recognize order word.Due to confidence level
Calculate rely on several factors, especially it is affected by environment can cause confidence level value changes scope it is very big.Under noisy environment, often
The very high situation of the very low but wrong recognition result confidence level of correct recognition result confidence level occurs so that false recognition rate
It is still very high.
The content of the invention
The present invention provides a kind of recognition methods of voice command, device, terminal and computer-readable recording medium, to realize
Avoid noise being identified as order word, and without calculating confidence level after speech recognition, reach the effect for reducing false recognition rate.
In a first aspect, the embodiments of the invention provide a kind of audio recognition method, including:
According to the acoustic feature of the voice collected, the voice and the acoustics phase of the aligned phoneme sequence in decoding network are calculated
Like probability;Wherein, the decoding network includes multigroup aligned phoneme sequence;In the corresponding default order word of each group of aligned phoneme sequence
Perhaps correspond to noise content;
According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained;
It is the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.
Second aspect, present invention also offers a kind of speech recognition equipment, including:
Computing module, for the acoustic feature according to the voice collected, calculate the voice and the sound in decoding network
The acoustics likelihood probability of prime sequences;Wherein, the decoding network includes multigroup aligned phoneme sequence;Each group of aligned phoneme sequence is corresponding one
Noise content is perhaps corresponded in default order word;
Matching module, for according to the acoustics likelihood probability, obtaining the voice and the matching of the aligned phoneme sequence being general
Rate;
Identification module, for being the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.
The third aspect, present invention also offers a kind of terminal, the terminal includes:
One or more processors;
Memory, for storing one or more programs,
When one or more of programs are by one or more of computing devices so that one or more of processing
Device realizes the audio recognition method that any embodiment of the present invention provides.
Fourth aspect, present invention also offers a kind of computer-readable recording medium, computer program is stored thereon with, should
The audio recognition method that any embodiment of the present invention provides is realized when program is executed by processor.
The present invention can solved by increasing aligned phoneme sequence corresponding to noise content, the voice collected in decoding network
Searched in code network and be just identified as noise or order word when most matching aligned phoneme sequence, without searching for aligned phoneme sequence in decoding network
Confidence calculations are carried out to search result afterwards, the confidence calculations method influenceed by environment phoneme is used so as to solve prior art
The problem of causing false recognition rate high, realization avoids noise being identified as order word, and reduces the effect of false recognition rate.
Brief description of the drawings
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides;
Fig. 2 is the flow chart for the audio recognition method that the embodiment of the present invention two provides;
Fig. 3 is the structural representation for the speech recognition equipment that the embodiment of the present invention three provides;
Fig. 4 is the structural representation for the terminal that the embodiment of the present invention four provides.
Embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just
Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides, and the present embodiment is applicable to order word
The situation of identification, this method can be performed by speech recognition equipment, specifically comprised the following steps:
Step 110, the acoustic feature according to the voice collected, calculate the voice and the aligned phoneme sequence in decoding network
Acoustics likelihood probability;
Wherein, the decoding network includes multigroup aligned phoneme sequence;The corresponding default order word of each group of aligned phoneme sequence
Inside perhaps correspond to noise content;Because the embodiment of the present invention is to be applied to the identification to voice command, any non-command word voice
All it is interference for the identification of order word, therefore is all noise, then noise described in the embodiment of the present invention refers to any non-command word
Voice.Specifically, decoding network can be made up of interference networks, the phoneme node connected in interference networks multiple phoneme nodes
Form aligned phoneme sequence.In field of speech recognition, the acoustics likelihood probability of a phoneme and the phoneme in decoding network, typically pass through
The acoustic model of phoneme in structure decoding network realizes that acoustics likelihood probability refers to using the acoustic feature of voice for input correspondence
Acoustic model output probability.
Step 120, according to the acoustics likelihood probability, obtain the matching probability of the voice and the aligned phoneme sequence;
Wherein, in order to simplify the data processing of identification process, matching probability directly can be used as using acoustics likelihood probability;But should
High scene is required for identifying, as the audio recognition method of high discrimination, matching probability removes to be believed comprising acoustics likelihood probability
Breath is outer, can also include other information, for example, for the decoding network using weighted finite state converted configuration, matching
Probability also includes the weight information of aligned phoneme sequence, and the weight information can relate to the probability that aligned phoneme sequence occurs in actual applications,
That is probabilistic language model.For example, in order word identifies scene, partial order word is higher in the probability that practical application occurs, such as
" volume tunes up ", " shutdown " etc., and partial order word is relatively low in the probability that practical application occurs, similar in both acoustic features
In the case of, the aligned phoneme sequence weight corresponding to the former can be set higher than the aligned phoneme sequence weight corresponding to the latter.In addition, weight
Information can also adjust according to the discrimination in the implementation process of audio recognition method.Step 130, by the speech recognition it is
Content corresponding to matching probability highest aligned phoneme sequence.
The operation principle of above-mentioned steps is to increase aligned phoneme sequence corresponding to noise content in decoding network, can be according to typing
The acoustic feature of noise cause the matching of noise corresponding with the noise content in decoding network aligned phoneme sequence so that based on acoustics
Feature recognition goes out the noise of typing, is avoided that non-command word being identified as order word, and compared to prior art using identification after
The method for calculating confidence level, the scheme that the present embodiment avoids for noise being identified as order word are not influenceed by environment phoneme, dropped significantly
Low false recognition rate.
In order to reduce false recognition rate, improve by the matching of noise corresponding with noise content in decoding network aligned phoneme sequence can
Energy property, the present embodiment provide a kind of preferred embodiment.Specifically, step 110, the acoustic feature according to the voice collected,
The voice and the acoustics likelihood probability of the aligned phoneme sequence in decoding network are calculated, is specifically included:
Obtain the acoustic model of aligned phoneme sequence in the decoding network of training in advance;Wherein, sound corresponding to noise content is trained
Noisy samples include the speech samples that multiple differences of acoustic feature between any two are more than default threshold value used by learning model;
According to the acoustic feature of the voice collected, calculated using the acoustic model in the voice and decoding network
The acoustics likelihood probability of aligned phoneme sequence.
In above-mentioned preferred embodiment, the noisy samples of training noise acoustic model include multiple acoustics between any two spy
Levy the speech samples that difference is more than default threshold value, i.e. noise acoustic model is using multiple speech samples instructions to differ greatly
Get, such as noisy ambient sound and a large amount of mutually different non-command word phrases etc..Use the big language of a large amount of differences
Sound sample training to acoustic model corresponding to aligned phoneme sequence can be intended to that between various sound difference minimizes from
Right sound, it is easier to various non-command word voice match.And the order word sample of training order word acoustic model is usually to use
The order word sound that different accents are read aloud, the acoustic feature difference between order word sample is little, therefore only for order word phase
Near sound acoustics likelihood probability is high.Therefore, above-mentioned preferred embodiment can be improved noise content in noise and decoding network
The possibility of corresponding aligned phoneme sequence matching, reduces false recognition rate.
Further, the decoding network uses weighted finite state converted configuration;Then step 120, described
According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained, is specifically included:Calculate the acoustics
Likelihood probability and the weight of the aligned phoneme sequence and value, as the voice and the matching probability of the aligned phoneme sequence.Certainly,
The product of acoustics likelihood probability and weight can also be calculated as matching probability.
Further, the decoding network also includes aligned phoneme sequence corresponding with Jing Yin content.It is corresponding to increase Jing Yin content
Aligned phoneme sequence can improve Consumer's Experience.Because can to noise and it is Jing Yin make differentiation, to the different signal of user feedback.
For example, noise is probably because the wrong voice of user causes, therefore the information that exportable prompting user retells, can for Jing Yin
It can be that accidentally touch identification device causes typing voice to user, identification output can be set not performed any for sky
Operation, leaves user alone, so as to improve Consumer's Experience.
It should be noted that acoustics likelihood probability is calculated, obtain matching probability and then searches for matching probability highest phoneme
Sequence, can be the matching probability of voice for first calculating each aligned phoneme sequence and collecting, then comparison match probability obtains
With probability highest aligned phoneme sequence.Can also be the voice initial phoneme for first searching with collecting acoustics likelihood probability it is close
Decoding network in phoneme, then according to acoustics likelihood probability, weight (including probabilistic language model information) etc., judge the phase
In multigroup aligned phoneme sequence where near phoneme, next phoneme of which group matches with next phoneme of the voice collected
Probability highest, and then determine that next phoneme node of this group of aligned phoneme sequence matches with next phoneme of the voice collected.
Further, judgement search is continued executing with, the aligned phoneme sequence finally obtained is exactly matching probability highest aligned phoneme sequence.
In summary, the technical scheme of the present embodiment, aligned phoneme sequence corresponding to increase noise content, is adopted in decoding network
The voice collected can be searched in decoding network is just identified as noise or order word when most matching aligned phoneme sequence, without solving
Confidence calculations are carried out to search result after code web search aligned phoneme sequence, used so as to solve prior art by environment phoneme shadow
The problem of loud confidence calculations method causes false recognition rate high, realization avoids noise being identified as order word, and reduces and know by mistake
The not effect of rate.
Embodiment two
Fig. 2 is the flow chart for the audio recognition method that the embodiment of the present invention two provides, and the present embodiment is applicable to order word
The situation of identification, this method can be performed by speech recognition equipment.Base of the present embodiment in the audio recognition method of embodiment one
On plinth, the step of adding adjust automatically decoding network parameter so that audio recognition method can dynamically change parameter, lasting drop
Low false recognition rate.The audio recognition method that the present embodiment provides includes:
Step 210, the acoustic feature according to the voice collected, calculate the voice and the aligned phoneme sequence in decoding network
Acoustics likelihood probability;Wherein, the decoding network includes multigroup aligned phoneme sequence;Corresponding one of each group of aligned phoneme sequence is default
Noise content is perhaps corresponded in order word;
Step 220, according to the acoustics likelihood probability, obtain the matching probability of the voice and the aligned phoneme sequence;
Step 230, the content by the speech recognition corresponding to matching probability highest aligned phoneme sequence;
If the voice that step 240, confirmation collect is noise, and is order word set in advance by the speech recognition,
Then improve the weight of aligned phoneme sequence corresponding to noise content in the decoding network.
The present embodiment can also gather confirmation (can provide confirmation by user) after voice is identified, confirm identification
As a result it is whether correct, if the voice for confirming to collect is noise, and it is order word by speech recognition, then illustrates that false recognition rate is still omited
Height, therefore the weight of aligned phoneme sequence corresponding to noise content in the decoding network is improved, to increase noise aligned phoneme sequence with adopting
The matching probability of the voice collected so that non-command word voice is more likely to be identified as noise.Further, settable confirmation is adopted
The voice integrated reaches default threshold value as noise and by the speech recognition as the number of order word, just improves noise phoneme sequence
The weight of row, to avoid identifying individually that it is unbalance that mistake causes to adjust.
Preferably, in addition to:If the voice for confirming to collect is order word, and is noise by the speech recognition, then drop
The weight of aligned phoneme sequence corresponding to noise content in the low decoding network.
Further, the settable voice for confirming to collect is order word and reaches the number that the speech recognition is noise
To the weight of default threshold value, just reduction noise aligned phoneme sequence.In order to reduce false recognition rate, inevitably on a small quantity will
Order word is identified as the situation of noise, and above-mentioned preferred scheme can improve the discrimination to order word.
Further, the also settable instruction triggered according to user, is adjusted in the decoding network corresponding to noise content
The weight of aligned phoneme sequence, to reduce false recognition rate or improve discrimination.
The technical scheme of the present embodiment, increase aligned phoneme sequence corresponding to noise content, the language collected in decoding network
Sound can be searched in decoding network is just identified as noise or order word when most matching aligned phoneme sequence, realization avoids knowing noise
Not Wei order word, and reduce false recognition rate effect.And according to recognition result, adjust the power of noise aligned phoneme sequence in decoding network
Weight, to realize dynamic modification parameter, persistently reduce false recognition rate.
Embodiment three
Fig. 3 is the structural representation for the speech recognition equipment that the embodiment of the present invention three provides.The speech recognition equipment includes:
Computing module 310, for the acoustic feature according to the voice collected, calculate in the voice and decoding network
The acoustics likelihood probability of aligned phoneme sequence;Wherein, the decoding network includes multigroup aligned phoneme sequence;Each group of aligned phoneme sequence corresponding one
Noise content is perhaps corresponded in individual default order word;
Matching module 320, for according to the acoustics likelihood probability, obtaining the matching of the voice and the aligned phoneme sequence
Probability;
Identification module 330, for being the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.
Preferably, the decoding network uses weighted finite state converted configuration.The speech recognition equipment is also
Including:
Weight adjusting module 340, if being noise for the voice for confirming to collect, and it is to set in advance by the speech recognition
Fixed order word, then improve the weight of aligned phoneme sequence corresponding to noise content in the decoding network.
Preferably, matching module 320 includes:
With value computing unit, for calculating the weight and value of the acoustics likelihood probability and the aligned phoneme sequence, as
The voice and the matching probability of the aligned phoneme sequence.
Preferably, the decoding network also includes aligned phoneme sequence corresponding with Jing Yin content.
Preferably, the computing module includes:
Model acquiring unit, the acoustic model of aligned phoneme sequence in the decoding network for obtaining training in advance;Wherein, train
Noisy samples include multiple differences of acoustic feature between any two more than default used by acoustic model corresponding to noise content
The speech samples of threshold value;
Model arithmetic unit, for the acoustic feature according to the voice collected, using described in acoustic model calculating
Voice and the acoustics likelihood probability of the aligned phoneme sequence in decoding network.
The speech recognition equipment that the embodiment of the present invention is provided, which can perform the voice that any embodiment of the present invention is provided, to be known
Other method, possess the corresponding functional module of execution method and beneficial effect.
Example IV
Fig. 4 is a kind of structural representation for terminal that the embodiment of the present invention four provides, as shown in figure 4, the terminal includes place
Manage device 410, memory 420, input unit 430 and output device 440;In terminal the quantity of processor 410 can be one or
It is multiple, in Fig. 4 by taking a processor 410 as an example;Processor 410, memory 420, input unit 430 and output dress in terminal
Putting 440 can be connected by bus or other modes, in Fig. 4 exemplified by being connected by bus.
Memory 420 is used as a kind of computer-readable recording medium, and journey is can perform available for storage software program, computer
Sequence and module, programmed instruction/module is (for example, speech recognition fills as corresponding to the audio recognition method in the embodiment of the present invention
Computing module 310, matching module 320, identification module 330 and weight adjusting module 340 in putting).Processor 410 passes through operation
Software program, instruction and the module being stored in memory 420, so as to perform at the various function application and data of terminal
Reason, that is, realize above-mentioned audio recognition method.
Memory 420 can mainly include storing program area and storage data field, wherein, storing program area can store operation system
Application program needed for system, at least one function;Storage data field can store uses created data etc. according to terminal.This
Outside, memory 420 can include high-speed random access memory, can also include nonvolatile memory, for example, at least one
Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 420 can enter one
Step includes that relative to the remotely located memory of processor 410, these remote memories network connection to terminal can be passed through.On
The example for stating network includes but is not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Input unit 430 can be used for the numeral or character information for receiving input, and produce with the user of terminal set with
And the key signals input that function control is relevant.Output device 740 may include the display devices such as display screen.
Embodiment five
The embodiment of the present invention five also provides a kind of computer-readable recording medium for being stored with computer program, the calculating
Machine program realizes a kind of audio recognition method when being subsequently can by computer device and performing, and this method includes:
According to the acoustic feature of the voice collected, the voice and the acoustics phase of the aligned phoneme sequence in decoding network are calculated
Like probability;Wherein, the decoding network includes multigroup aligned phoneme sequence;In the corresponding default order word of each group of aligned phoneme sequence
Perhaps correspond to noise content;
According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained;
It is the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.
Certainly, a kind of computer-readable recording medium for storage computer program that the embodiment of the present invention is provided, its journey
The method operation that sequence is not limited to the described above, can also carry out in the audio recognition method that any embodiment of the present invention is provided
Associative operation.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to realized by hardware, but the former is more in many cases
Good embodiment.Based on such understanding, what technical scheme substantially contributed to prior art in other words
Part can be embodied in the form of software product, and the computer software product can be stored in computer-readable recording medium
In, floppy disk, read-only storage (Read-Only Memory, ROM), random access memory (Random such as computer
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are causing a computer to set
Standby (can be personal computer, server, or network equipment etc.) performs the method described in each embodiment of the present invention.
It is worth noting that, in the embodiment of above-mentioned speech recognition equipment, included unit and module are simply pressed
Divided according to function logic, but be not limited to above-mentioned division, as long as corresponding function can be realized;In addition,
The specific name of each functional unit is also only to facilitate mutually distinguish, the protection domain being not intended to limit the invention.
Pay attention to, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.