CN107644638A

CN107644638A - Audio recognition method, device, terminal and computer-readable recording medium

Info

Publication number: CN107644638A
Application number: CN201710964474.1A
Authority: CN
Inventors: 何金来; 雷宇
Original assignee: Beijing Intelligent Housekeeper Technology Co Ltd
Current assignee: Beijing Rubu Technology Co.,Ltd.
Priority date: 2017-10-17
Filing date: 2017-10-17
Publication date: 2018-01-30
Anticipated expiration: 2037-10-17
Also published as: CN107644638B

Abstract

The invention discloses a kind of audio recognition method, including the acoustic feature according to the voice collected, calculate the voice and the acoustics likelihood probability of the aligned phoneme sequence in decoding network；Wherein described decoding network includes multigroup aligned phoneme sequence；Noise content is perhaps corresponded in the corresponding default order word of each group of aligned phoneme sequence；According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained；It is the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.Correspondingly, invention additionally discloses a kind of speech recognition equipment, terminal and computer-readable recording medium.The present invention, which realizes, to be avoided noise being identified as order word, and without calculating confidence level after speech recognition, reaches the effect for reducing false recognition rate.

Description

Audio recognition method, device, terminal and computer-readable recording medium

Technical field

The present embodiments relate to speech recognition technology, more particularly to a kind of audio recognition method, device, terminal and calculating Machine readable storage medium storing program for executing.

Background technology

In voice command words identification technology, misrecognition is always a more insoluble problem.Order word identifies Why false recognition rate is higher, is because the order word recognition method of prior art is generally by constructing decoding network come real It is existing, multigroup aligned phoneme sequence corresponding with default order word is included in the decoding network.Inputting any voice all can be according to the language Sound searches out an aligned phoneme sequence matched the most from decoding network, therefore causes to misidentify.

The method for solving for noise to be identified as order word at present is to calculate the confidence level of recognition result, when confidence level is more than in advance If threshold value when represent that identification is correct, when confidence level is less than the threshold value expression do not recognize order word.Due to confidence level Calculate rely on several factors, especially it is affected by environment can cause confidence level value changes scope it is very big.Under noisy environment, often The very high situation of the very low but wrong recognition result confidence level of correct recognition result confidence level occurs so that false recognition rate It is still very high.

The content of the invention

The present invention provides a kind of recognition methods of voice command, device, terminal and computer-readable recording medium, to realize Avoid noise being identified as order word, and without calculating confidence level after speech recognition, reach the effect for reducing false recognition rate.

In a first aspect, the embodiments of the invention provide a kind of audio recognition method, including：

According to the acoustic feature of the voice collected, the voice and the acoustics phase of the aligned phoneme sequence in decoding network are calculated Like probability；Wherein, the decoding network includes multigroup aligned phoneme sequence；In the corresponding default order word of each group of aligned phoneme sequence Perhaps correspond to noise content；

According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained；

It is the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.

Second aspect, present invention also offers a kind of speech recognition equipment, including：

Computing module, for the acoustic feature according to the voice collected, calculate the voice and the sound in decoding network The acoustics likelihood probability of prime sequences；Wherein, the decoding network includes multigroup aligned phoneme sequence；Each group of aligned phoneme sequence is corresponding one Noise content is perhaps corresponded in default order word；

Matching module, for according to the acoustics likelihood probability, obtaining the voice and the matching of the aligned phoneme sequence being general Rate；

Identification module, for being the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.

The third aspect, present invention also offers a kind of terminal, the terminal includes：

One or more processors；

Memory, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processing Device realizes the audio recognition method that any embodiment of the present invention provides.

Fourth aspect, present invention also offers a kind of computer-readable recording medium, computer program is stored thereon with, should The audio recognition method that any embodiment of the present invention provides is realized when program is executed by processor.

The present invention can solved by increasing aligned phoneme sequence corresponding to noise content, the voice collected in decoding network Searched in code network and be just identified as noise or order word when most matching aligned phoneme sequence, without searching for aligned phoneme sequence in decoding network Confidence calculations are carried out to search result afterwards, the confidence calculations method influenceed by environment phoneme is used so as to solve prior art The problem of causing false recognition rate high, realization avoids noise being identified as order word, and reduces the effect of false recognition rate.

Brief description of the drawings

Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides；

Fig. 2 is the flow chart for the audio recognition method that the embodiment of the present invention two provides；

Fig. 3 is the structural representation for the speech recognition equipment that the embodiment of the present invention three provides；

Fig. 4 is the structural representation for the terminal that the embodiment of the present invention four provides.

Embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.

Embodiment one

Fig. 1 is the flow chart for the audio recognition method that the embodiment of the present invention one provides, and the present embodiment is applicable to order word The situation of identification, this method can be performed by speech recognition equipment, specifically comprised the following steps：

Step 110, the acoustic feature according to the voice collected, calculate the voice and the aligned phoneme sequence in decoding network Acoustics likelihood probability；

Wherein, the decoding network includes multigroup aligned phoneme sequence；The corresponding default order word of each group of aligned phoneme sequence Inside perhaps correspond to noise content；Because the embodiment of the present invention is to be applied to the identification to voice command, any non-command word voice All it is interference for the identification of order word, therefore is all noise, then noise described in the embodiment of the present invention refers to any non-command word Voice.Specifically, decoding network can be made up of interference networks, the phoneme node connected in interference networks multiple phoneme nodes Form aligned phoneme sequence.In field of speech recognition, the acoustics likelihood probability of a phoneme and the phoneme in decoding network, typically pass through The acoustic model of phoneme in structure decoding network realizes that acoustics likelihood probability refers to using the acoustic feature of voice for input correspondence Acoustic model output probability.

Step 120, according to the acoustics likelihood probability, obtain the matching probability of the voice and the aligned phoneme sequence；

Wherein, in order to simplify the data processing of identification process, matching probability directly can be used as using acoustics likelihood probability；But should High scene is required for identifying, as the audio recognition method of high discrimination, matching probability removes to be believed comprising acoustics likelihood probability Breath is outer, can also include other information, for example, for the decoding network using weighted finite state converted configuration, matching Probability also includes the weight information of aligned phoneme sequence, and the weight information can relate to the probability that aligned phoneme sequence occurs in actual applications, That is probabilistic language model.For example, in order word identifies scene, partial order word is higher in the probability that practical application occurs, such as " volume tunes up ", " shutdown " etc., and partial order word is relatively low in the probability that practical application occurs, similar in both acoustic features In the case of, the aligned phoneme sequence weight corresponding to the former can be set higher than the aligned phoneme sequence weight corresponding to the latter.In addition, weight Information can also adjust according to the discrimination in the implementation process of audio recognition method.Step 130, by the speech recognition it is Content corresponding to matching probability highest aligned phoneme sequence.

The operation principle of above-mentioned steps is to increase aligned phoneme sequence corresponding to noise content in decoding network, can be according to typing The acoustic feature of noise cause the matching of noise corresponding with the noise content in decoding network aligned phoneme sequence so that based on acoustics Feature recognition goes out the noise of typing, is avoided that non-command word being identified as order word, and compared to prior art using identification after The method for calculating confidence level, the scheme that the present embodiment avoids for noise being identified as order word are not influenceed by environment phoneme, dropped significantly Low false recognition rate.

In order to reduce false recognition rate, improve by the matching of noise corresponding with noise content in decoding network aligned phoneme sequence can Energy property, the present embodiment provide a kind of preferred embodiment.Specifically, step 110, the acoustic feature according to the voice collected, The voice and the acoustics likelihood probability of the aligned phoneme sequence in decoding network are calculated, is specifically included：

Obtain the acoustic model of aligned phoneme sequence in the decoding network of training in advance；Wherein, sound corresponding to noise content is trained Noisy samples include the speech samples that multiple differences of acoustic feature between any two are more than default threshold value used by learning model；

According to the acoustic feature of the voice collected, calculated using the acoustic model in the voice and decoding network The acoustics likelihood probability of aligned phoneme sequence.

In above-mentioned preferred embodiment, the noisy samples of training noise acoustic model include multiple acoustics between any two spy Levy the speech samples that difference is more than default threshold value, i.e. noise acoustic model is using multiple speech samples instructions to differ greatly Get, such as noisy ambient sound and a large amount of mutually different non-command word phrases etc..Use the big language of a large amount of differences Sound sample training to acoustic model corresponding to aligned phoneme sequence can be intended to that between various sound difference minimizes from Right sound, it is easier to various non-command word voice match.And the order word sample of training order word acoustic model is usually to use The order word sound that different accents are read aloud, the acoustic feature difference between order word sample is little, therefore only for order word phase Near sound acoustics likelihood probability is high.Therefore, above-mentioned preferred embodiment can be improved noise content in noise and decoding network The possibility of corresponding aligned phoneme sequence matching, reduces false recognition rate.

Further, the decoding network uses weighted finite state converted configuration；Then step 120, described According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained, is specifically included：Calculate the acoustics Likelihood probability and the weight of the aligned phoneme sequence and value, as the voice and the matching probability of the aligned phoneme sequence.Certainly, The product of acoustics likelihood probability and weight can also be calculated as matching probability.

Further, the decoding network also includes aligned phoneme sequence corresponding with Jing Yin content.It is corresponding to increase Jing Yin content Aligned phoneme sequence can improve Consumer's Experience.Because can to noise and it is Jing Yin make differentiation, to the different signal of user feedback. For example, noise is probably because the wrong voice of user causes, therefore the information that exportable prompting user retells, can for Jing Yin It can be that accidentally touch identification device causes typing voice to user, identification output can be set not performed any for sky Operation, leaves user alone, so as to improve Consumer's Experience.

It should be noted that acoustics likelihood probability is calculated, obtain matching probability and then searches for matching probability highest phoneme Sequence, can be the matching probability of voice for first calculating each aligned phoneme sequence and collecting, then comparison match probability obtains With probability highest aligned phoneme sequence.Can also be the voice initial phoneme for first searching with collecting acoustics likelihood probability it is close Decoding network in phoneme, then according to acoustics likelihood probability, weight (including probabilistic language model information) etc., judge the phase In multigroup aligned phoneme sequence where near phoneme, next phoneme of which group matches with next phoneme of the voice collected Probability highest, and then determine that next phoneme node of this group of aligned phoneme sequence matches with next phoneme of the voice collected. Further, judgement search is continued executing with, the aligned phoneme sequence finally obtained is exactly matching probability highest aligned phoneme sequence.

In summary, the technical scheme of the present embodiment, aligned phoneme sequence corresponding to increase noise content, is adopted in decoding network The voice collected can be searched in decoding network is just identified as noise or order word when most matching aligned phoneme sequence, without solving Confidence calculations are carried out to search result after code web search aligned phoneme sequence, used so as to solve prior art by environment phoneme shadow The problem of loud confidence calculations method causes false recognition rate high, realization avoids noise being identified as order word, and reduces and know by mistake The not effect of rate.

Embodiment two

Fig. 2 is the flow chart for the audio recognition method that the embodiment of the present invention two provides, and the present embodiment is applicable to order word The situation of identification, this method can be performed by speech recognition equipment.Base of the present embodiment in the audio recognition method of embodiment one On plinth, the step of adding adjust automatically decoding network parameter so that audio recognition method can dynamically change parameter, lasting drop Low false recognition rate.The audio recognition method that the present embodiment provides includes：

Step 210, the acoustic feature according to the voice collected, calculate the voice and the aligned phoneme sequence in decoding network Acoustics likelihood probability；Wherein, the decoding network includes multigroup aligned phoneme sequence；Corresponding one of each group of aligned phoneme sequence is default Noise content is perhaps corresponded in order word；

Step 220, according to the acoustics likelihood probability, obtain the matching probability of the voice and the aligned phoneme sequence；

Step 230, the content by the speech recognition corresponding to matching probability highest aligned phoneme sequence；

If the voice that step 240, confirmation collect is noise, and is order word set in advance by the speech recognition, Then improve the weight of aligned phoneme sequence corresponding to noise content in the decoding network.

The present embodiment can also gather confirmation (can provide confirmation by user) after voice is identified, confirm identification As a result it is whether correct, if the voice for confirming to collect is noise, and it is order word by speech recognition, then illustrates that false recognition rate is still omited Height, therefore the weight of aligned phoneme sequence corresponding to noise content in the decoding network is improved, to increase noise aligned phoneme sequence with adopting The matching probability of the voice collected so that non-command word voice is more likely to be identified as noise.Further, settable confirmation is adopted The voice integrated reaches default threshold value as noise and by the speech recognition as the number of order word, just improves noise phoneme sequence The weight of row, to avoid identifying individually that it is unbalance that mistake causes to adjust.

Preferably, in addition to：If the voice for confirming to collect is order word, and is noise by the speech recognition, then drop The weight of aligned phoneme sequence corresponding to noise content in the low decoding network.

Further, the settable voice for confirming to collect is order word and reaches the number that the speech recognition is noise To the weight of default threshold value, just reduction noise aligned phoneme sequence.In order to reduce false recognition rate, inevitably on a small quantity will Order word is identified as the situation of noise, and above-mentioned preferred scheme can improve the discrimination to order word.

Further, the also settable instruction triggered according to user, is adjusted in the decoding network corresponding to noise content The weight of aligned phoneme sequence, to reduce false recognition rate or improve discrimination.

The technical scheme of the present embodiment, increase aligned phoneme sequence corresponding to noise content, the language collected in decoding network Sound can be searched in decoding network is just identified as noise or order word when most matching aligned phoneme sequence, realization avoids knowing noise Not Wei order word, and reduce false recognition rate effect.And according to recognition result, adjust the power of noise aligned phoneme sequence in decoding network Weight, to realize dynamic modification parameter, persistently reduce false recognition rate.

Embodiment three

Fig. 3 is the structural representation for the speech recognition equipment that the embodiment of the present invention three provides.The speech recognition equipment includes：

Computing module 310, for the acoustic feature according to the voice collected, calculate in the voice and decoding network The acoustics likelihood probability of aligned phoneme sequence；Wherein, the decoding network includes multigroup aligned phoneme sequence；Each group of aligned phoneme sequence corresponding one Noise content is perhaps corresponded in individual default order word；

Matching module 320, for according to the acoustics likelihood probability, obtaining the matching of the voice and the aligned phoneme sequence Probability；

Identification module 330, for being the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.

Preferably, the decoding network uses weighted finite state converted configuration.The speech recognition equipment is also Including：

Weight adjusting module 340, if being noise for the voice for confirming to collect, and it is to set in advance by the speech recognition Fixed order word, then improve the weight of aligned phoneme sequence corresponding to noise content in the decoding network.

Preferably, matching module 320 includes：

With value computing unit, for calculating the weight and value of the acoustics likelihood probability and the aligned phoneme sequence, as The voice and the matching probability of the aligned phoneme sequence.

Preferably, the decoding network also includes aligned phoneme sequence corresponding with Jing Yin content.

Preferably, the computing module includes：

Model acquiring unit, the acoustic model of aligned phoneme sequence in the decoding network for obtaining training in advance；Wherein, train Noisy samples include multiple differences of acoustic feature between any two more than default used by acoustic model corresponding to noise content The speech samples of threshold value；

Model arithmetic unit, for the acoustic feature according to the voice collected, using described in acoustic model calculating Voice and the acoustics likelihood probability of the aligned phoneme sequence in decoding network.

The speech recognition equipment that the embodiment of the present invention is provided, which can perform the voice that any embodiment of the present invention is provided, to be known Other method, possess the corresponding functional module of execution method and beneficial effect.

Example IV

Fig. 4 is a kind of structural representation for terminal that the embodiment of the present invention four provides, as shown in figure 4, the terminal includes place Manage device 410, memory 420, input unit 430 and output device 440；In terminal the quantity of processor 410 can be one or It is multiple, in Fig. 4 by taking a processor 410 as an example；Processor 410, memory 420, input unit 430 and output dress in terminal Putting 440 can be connected by bus or other modes, in Fig. 4 exemplified by being connected by bus.

Memory 420 is used as a kind of computer-readable recording medium, and journey is can perform available for storage software program, computer Sequence and module, programmed instruction/module is (for example, speech recognition fills as corresponding to the audio recognition method in the embodiment of the present invention Computing module 310, matching module 320, identification module 330 and weight adjusting module 340 in putting).Processor 410 passes through operation Software program, instruction and the module being stored in memory 420, so as to perform at the various function application and data of terminal Reason, that is, realize above-mentioned audio recognition method.

Memory 420 can mainly include storing program area and storage data field, wherein, storing program area can store operation system Application program needed for system, at least one function；Storage data field can store uses created data etc. according to terminal.This Outside, memory 420 can include high-speed random access memory, can also include nonvolatile memory, for example, at least one Disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 420 can enter one Step includes that relative to the remotely located memory of processor 410, these remote memories network connection to terminal can be passed through.On The example for stating network includes but is not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Input unit 430 can be used for the numeral or character information for receiving input, and produce with the user of terminal set with And the key signals input that function control is relevant.Output device 740 may include the display devices such as display screen.

Embodiment five

The embodiment of the present invention five also provides a kind of computer-readable recording medium for being stored with computer program, the calculating Machine program realizes a kind of audio recognition method when being subsequently can by computer device and performing, and this method includes：

Certainly, a kind of computer-readable recording medium for storage computer program that the embodiment of the present invention is provided, its journey The method operation that sequence is not limited to the described above, can also carry out in the audio recognition method that any embodiment of the present invention is provided Associative operation.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to realized by hardware, but the former is more in many cases Good embodiment.Based on such understanding, what technical scheme substantially contributed to prior art in other words Part can be embodied in the form of software product, and the computer software product can be stored in computer-readable recording medium In, floppy disk, read-only storage (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are causing a computer to set Standby (can be personal computer, server, or network equipment etc.) performs the method described in each embodiment of the present invention.

It is worth noting that, in the embodiment of above-mentioned speech recognition equipment, included unit and module are simply pressed Divided according to function logic, but be not limited to above-mentioned division, as long as corresponding function can be realized；In addition, The specific name of each functional unit is also only to facilitate mutually distinguish, the protection domain being not intended to limit the invention.

Pay attention to, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

A kind of 1. audio recognition method, it is characterised in that including：

According to the acoustic feature of the voice collected, it is similar to the acoustics of the aligned phoneme sequence in decoding network general to calculate the voice Rate；Wherein, the decoding network includes multigroup aligned phoneme sequence, and each group of aligned phoneme sequence is corresponded in a default order word perhaps Corresponding noise content；

According to the acoustics likelihood probability, the matching probability of the voice and the aligned phoneme sequence is obtained；

It is the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.
2. audio recognition method as claimed in claim 1, it is characterised in that the decoding network is to use weighted finite state Converted configuration；

It is described according to the acoustics likelihood probability, obtain the matching probability of the voice and the aligned phoneme sequence, specifically include：

The weight and value of the acoustics likelihood probability and the aligned phoneme sequence is calculated, as the voice and the aligned phoneme sequence Matching probability.
3. audio recognition method as claimed in claim 2, it is characterised in that also include：

If the voice for confirming to collect is noise, and is order word set in advance by the speech recognition, then the solution is improved The weight of aligned phoneme sequence corresponding to noise content in code network.
4. the audio recognition method as described in claim 1-3 is any, it is characterised in that the decoding network also include with it is Jing Yin Aligned phoneme sequence corresponding to content.
5. the audio recognition method as described in claim 1-3 is any, it is characterised in that the sound for the voice that the basis collects Feature is learned, the voice and the acoustics likelihood probability of the aligned phoneme sequence in decoding network is calculated, specifically includes：

Obtain the acoustic model of aligned phoneme sequence in the decoding network of training in advance；Wherein, acoustic mode corresponding to noise content is trained Noisy samples include the speech samples that multiple differences of acoustic feature between any two are more than default threshold value used by type；

According to the acoustic feature of the voice collected, the voice and the phoneme in decoding network are calculated using the acoustic model The acoustics likelihood probability of sequence.
A kind of 6. speech recognition equipment, it is characterised in that including：

Computing module, for the acoustic feature according to the voice collected, calculate the voice and the phoneme sequence in decoding network The acoustics likelihood probability of row；Wherein, the decoding network includes multigroup aligned phoneme sequence, and corresponding one of each group of aligned phoneme sequence is default Order word in perhaps correspond to noise content；

Matching module, for according to the acoustics likelihood probability, obtaining the matching probability of the voice and the aligned phoneme sequence；

Identification module, for being the content corresponding to matching probability highest aligned phoneme sequence by the speech recognition.
7. speech recognition equipment as claimed in claim 6, it is characterised in that the decoding network is to use weighted finite state Converted configuration；

The speech recognition equipment also includes：

Weight adjusting module, if being noise for the voice for confirming to collect, and it is life set in advance by the speech recognition Word is made, then improves the weight of aligned phoneme sequence corresponding to noise content in the decoding network.
8. speech recognition equipment as claimed in claims 6 or 7, it is characterised in that the computing module includes：

Model acquiring unit, the acoustic model of aligned phoneme sequence in the decoding network for obtaining training in advance；Wherein, noise is trained Noisy samples are more than default threshold value including multiple differences of acoustic feature between any two used by acoustic model corresponding to content Speech samples；

Model arithmetic unit, for the acoustic feature according to the voice collected, the voice is calculated using the acoustic model With the acoustics likelihood probability of the aligned phoneme sequence in decoding network.
9. a kind of terminal, it is characterised in that the terminal includes：

One or more processors；

Memory, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processors are real The now audio recognition method as described in any in claim 1-5.
10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The audio recognition method as described in any in claim 1-5 is realized during execution.