CN113823276A - Voice recognition threshold setting method - Google Patents

Voice recognition threshold setting method Download PDF

Info

Publication number
CN113823276A
CN113823276A CN202111147823.3A CN202111147823A CN113823276A CN 113823276 A CN113823276 A CN 113823276A CN 202111147823 A CN202111147823 A CN 202111147823A CN 113823276 A CN113823276 A CN 113823276A
Authority
CN
China
Prior art keywords
recognition
confidence
function
loss
gain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111147823.3A
Other languages
Chinese (zh)
Other versions
CN113823276B (en
Inventor
陈思应
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN202111147823.3A priority Critical patent/CN113823276B/en
Publication of CN113823276A publication Critical patent/CN113823276A/en
Application granted granted Critical
Publication of CN113823276B publication Critical patent/CN113823276B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

A speech recognition threshold setting method comprises the following steps: s1, determining an identification function and a false identification function, S2, respectively calculating the profit and the loss of the identification function and the false identification function, and calculating the total profit gains, S3, carrying out derivation on the total profit gains by taking the confidence coefficient as a variable, wherein the confidence value when the derivative is zero is a confidence threshold value. According to the invention, through the analysis of recognition and error recognition, the optimal confidence threshold of each command word is determined by a maximum profit method, although the recognition rate under news noise is slightly reduced, the error recognition rate is greatly reduced, and the overall recognition experience effect is improved.

Description

Voice recognition threshold setting method
Technical Field
The invention belongs to the technical field of voice recognition, relates to voice recognition threshold setting, and particularly relates to a voice recognition threshold setting method.
Background
With the iterative update of the technology, the voice recognition technology is mature day by day and is widely used in actual products such as sound equipment, toys, home control and the like. The current mainstream speech recognition technology is mainly realized in a deep neural network learning mode, the deep neural network learning comprises two steps of training and recognition, the training is to obtain an acoustic model through the calculation of the probability from speech to syllable, the recognition is to calculate the probability from the current speech corresponding to syllable to text according to the acoustic model and a language model, and in practical application, for the speech recognition, only two states of recognition and non-recognition are needed, so that the probability needs to be converted into a binary quantity. It is common practice to set a probability (confidence) threshold, i.e. when the obtained confidence value reaches or exceeds the threshold, it indicates that the speech recognition is successful; otherwise, the result is unsuccessful.
The threshold is usually a difficult problem, if the threshold is set too large, the recognition rate will be reduced, and if the threshold is set too small, although the recognition rate is not affected, the probability of false recognition outside the set will be increased, and the recognition experience will be reduced.
Disclosure of Invention
In order to overcome the technical defects in the prior art and take recognition rate and error recognition into account, the invention provides a method for setting a speech recognition threshold value.
The method for setting the speech recognition threshold comprises the following steps:
s1, determining an identification function and a false identification function;
s2, respectively calculating the profit and the loss of the identification function and the error identification function,
the total gain, gains, is calculated,
gains= gainerr-losserr+gainrec-lossrec
wherein gainrec、lossrec、gainerr、losserrRespectively identifying function gain, identifying function loss, misrecognized function gain and misrecognized function loss;
and S3, carrying out derivation on the total gains by taking the confidence coefficient as a variable, wherein the confidence value when the derivative is zero is a confidence coefficient threshold value.
Preferably, the identification function and the misidentification function are respectively:
rec(x)=-ax2+bx+c
err(x)=-ax2+mx+n
wherein, x is confidence, rec (x) is identification times, err (x) is false identification times;
a, b, c, m, n are constants greater than zero, and b is not equal to m, c is not equal to n;
the confidence threshold t = (n-c)/(b-m).
According to the invention, through the analysis of recognition and error recognition, the optimal confidence threshold of each command word is determined by a maximum profit method, although the recognition rate under news noise is slightly reduced, the error recognition rate is greatly reduced, and the overall recognition experience effect is improved.
Drawings
FIG. 1 is a diagram illustrating an exemplary embodiment of two function curves of the recognition function and the misrecognition function according to the present invention; in fig. 1, a solid line curve is a recognition function, a dashed line curve is a misrecognition function, an abscissa is a confidence threshold with a unit of 1%, an ordinate is a frequency, and min and max are left and right end points of an interval of the confidence threshold respectively.
Detailed Description
The following provides a more detailed description of the present invention.
The method for setting the speech recognition threshold comprises the following steps:
s1, determining an identification function and a false identification function;
s2, respectively calculating the profit and the loss of the identification function and the error identification function,
the total gain, gains, is calculated,
gains= gainerr-losserr+gainrec-lossrec
wherein gainrec、lossrec、gainerr、losserrRespectively identifying function gain, identifying function loss, misrecognized function gain and misrecognized function loss;
and S3, carrying out derivation on the total gains by taking the confidence coefficient as a variable, wherein the confidence value when the derivative is zero is a confidence coefficient threshold value.
According to statistics, during recognition, probability scores, namely confidence degrees, obtained when pure target word sound signals are sent into a neural network are mostly distributed in a high-score section; when the target word is recognized by mistake, the voice signal contains syllables which are similar to one or more syllables in the target word, so that the overall confidence coefficient is increased, and the false recognition occurs, but the confidence coefficient of the false recognition is mostly distributed in a low-score section.
Counting and fitting data according to the identification and the error identification of a large number of command words; the confidence distribution of a general recognition or a misrecognition shows a rule as shown by two function curves of err and rec in fig. 1. Wherein, the solid line is the confidence coefficient distribution curve rec of the recognition function, and the dotted line is the confidence coefficient curve err of the misrecognition function. The two functions err and rec can be approximated as follows:
rec(x)=-ax2+bx+c (1)
err(x)=-ax2+mx+n (2)
wherein, x is confidence, rec (x) is identification times, err (x) is false identification times;
wherein a, b, c, m, n are constants greater than zero, and b ≠ m, c ≠ n. x is confidence;
for all confidence degrees x, rec and err in the confidence degree interval are all larger than or equal to zero, in practice, the opening sizes and symmetry axes of rec and err are different according to different combination modes of command words, so the values of a, b, c, m and n are different, but the values of a, b, c, m and n are constants larger than zero.
By combining the formula (1) and the formula (2), the intersection point x of the two curves can be obtainedoIs composed of
xo =(n-c)/(b-m) (3)
In order to consider recognition and misrecognition, an optimal confidence threshold t needs to be determined, and the problem is simplified, namely, the confidence threshold t corresponding to the maximum benefit is obtained.
First, in the confidence interval [ min, max]Gain of the medium computation recognition functionrecAnd identifying loss of function lossrec
Yield refers to the correct rate and the corresponding loss is the error rate. For example, when the confidence threshold is set at 0.25, it has been found that the recognition accuracy is 97%, i.e., the yield is 97% and the loss is 3%.
The confidence interval is the range of possible values of the confidence, and the confidence threshold t is positioned in the confidence interval;
Figure 173223DEST_PATH_IMAGE001
(4)
Figure 299311DEST_PATH_IMAGE002
(5)
secondly, the gain of the misrecognition function is calculatederrSum misrecognized function losserr
Figure 622845DEST_PATH_IMAGE003
(6)
Figure 419287DEST_PATH_IMAGE004
(7)
Finally, the total gain gains can be calculated by the formulas (1) - (2) and (4) - (7)
Figure 537285DEST_PATH_IMAGE005
gains=-(-at3/3+bt2/2+ct)+(-at3/3+mt2/2+nt)-
(-at3/3+bt2/2+ct)-(-at3/3+mt2/2+nt)+const ---(8)
Wherein, gainrec、lossrec、gainerr、losserrRespectively identifying function gain, identifying function loss, misrecognized function gain and misrecognized function loss; const is a constant.
The formula (8) is arranged and the derivative is obtained by solving the t
gains= -2(b-m)t+2(n-c) (9)
While derivative gainsWhen the total gain gains is zero, the total gain gains has a maximum value, that is, the total gain gains is maximum, and the threshold value at this time is obtained by equation (9):
t=(n-c)/(b-m) (10)
comparing the formula (3) with the formula (10), it can be found that the physical meaning of the formula (10) is: when the value of the confidence threshold is the intersection point of the recognition distribution curve and the misrecognition distribution curve, the recognition yield is the maximum, and the best recognition experience effect can be obtained at the moment.
As can be seen from the above, the confidence threshold t at which the total profit is the maximum is consistent with the confidence value corresponding to the intersection of the recognition and misrecognized distribution curve in fig. 1.
The specific embodiment is as follows:
typically, when an acoustic model is trained, a preliminary confidence threshold, such as 25, is determined after statistical testing through a large number of test sets. At this threshold, the recognition effect can meet the general requirements of users for speech recognition, but is not the best experience. In order to obtain the best experience effect, the threshold value of each command word needs to be determined separately in consideration of the balance between recognition and misrecognition.
Firstly, under the condition of initially setting a confidence threshold value of 25, selecting 10 different sound sources such as 5 men and 5 women, respectively carrying out recognition rate test under quiet and news noise, and counting and recognizing scoring distribution according to a test result; and then selecting the audio of the 12-hour synthesis art program to perform false recognition test, and counting false recognition scoring distribution according to results. With a threshold of 25, the partial command word quiet recognition/misrecognition confidence distribution is shown in table 1.
Each number in table 1 is a confidence level, 12-hour synthesis program audio is used in the false recognition test in table 1, the number of times of false recognition of each command word is indefinite in the 12 hours, and the number of scores obtained by each command word is indefinite in the false recognition test. The distribution of scores is followed, not the number of scores.
In the recognition test, 10 voices are used, 5 men and 5 women read each command word, so that the test result of each command word corresponds to 10 scores, and the specific score results are shown in table 1.
TABLE 1
Figure 681827DEST_PATH_IMAGE006
According to the maximum profit principle, the confidence threshold of each command word is adjusted by the method of the invention according to the distribution of recognition and misrecognition, namely the confidence threshold is obtained according to the formula (10).
The results of the comparison before and after are shown in Table 2.
TABLE 2
Figure 684418DEST_PATH_IMAGE007
After independently confirming the threshold value for each command word, testing and confirming the misrecognition and the recognition again, wherein the misrecognition data is shown in a table 3; the recognition rates under quiet and noisy conditions are shown in table 4.
Table 3 shows the comparison test result of the number of misrecognitions before and after the confidence threshold adjustment in tables 1 and 2, and it can be seen from table 3 that the number of misrecognitions is reduced by 51.05% compared to that before the threshold is adjusted individually, which indicates that after the confidence threshold is adjusted for each command word, the number of misrecognitions as a whole is greatly reduced by more than 50%, and the correct recognition rate is improved.
Table 4 is a comparison test result of the recognition times before and after the confidence threshold adjustment of tables 1 and 2, which shows that the overall recognition rate is reduced little after the confidence threshold is adjusted for each command word.
It can be seen from table 3 and table 4 that, after the confidence threshold is adjusted, the number of times of misrecognition decreases significantly while the number of times of cognition is maintained basically, and the overall recognition effect is improved.
TABLE 3
Figure 663263DEST_PATH_IMAGE008
TABLE 4
Figure 217741DEST_PATH_IMAGE009
After threshold adjustment is carried out on each command word independently, the recognition rate under news noise is slightly reduced to greatly reduce false recognition, and therefore the recognition experience effect of a user is improved.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims (2)

1. A speech recognition threshold setting method is characterized by comprising the following steps:
s1, determining an identification function and a false identification function;
s2, respectively calculating the profit and the loss of the identification function and the error identification function,
the total gain, gains, is calculated,
gains= gainerr-losserr+gainrec-lossrec
wherein gainrec、lossrec、gainerr、losserrRespectively identifying function gain, identifying function loss, misrecognized function gain and misrecognized function loss;
and S3, carrying out derivation on the total gains by taking the confidence coefficient as a variable, wherein the confidence value when the derivative is zero is a confidence coefficient threshold value.
2. The speech recognition threshold setting method of claim 1,
the recognition function rec and the misrecognition function err are respectively:
rec(x)=-ax2+bx+c
err(x)=-ax2+mx+n
wherein, x is confidence, rec (x) is identification times, err (x) is false identification times;
a, b, c, m, n are constants greater than zero, and b is not equal to m, c is not equal to n; the confidence threshold t = (n-c)/(b-m).
CN202111147823.3A 2021-09-29 2021-09-29 Voice recognition threshold setting method Active CN113823276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111147823.3A CN113823276B (en) 2021-09-29 2021-09-29 Voice recognition threshold setting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111147823.3A CN113823276B (en) 2021-09-29 2021-09-29 Voice recognition threshold setting method

Publications (2)

Publication Number Publication Date
CN113823276A true CN113823276A (en) 2021-12-21
CN113823276B CN113823276B (en) 2023-06-02

Family

ID=78915819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111147823.3A Active CN113823276B (en) 2021-09-29 2021-09-29 Voice recognition threshold setting method

Country Status (1)

Country Link
CN (1) CN113823276B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060056547A1 (en) * 2004-09-07 2006-03-16 Alcatel Analog/digital conversion with adjustable thresholds
CN103578468A (en) * 2012-08-01 2014-02-12 联想(北京)有限公司 Method for adjusting confidence coefficient threshold of voice recognition and electronic device
CN106022032A (en) * 2015-03-30 2016-10-12 欧姆龙株式会社 Individual identification device, and identification threshold setting method
CN108875493A (en) * 2017-10-12 2018-11-23 北京旷视科技有限公司 The determination method and determining device of similarity threshold in recognition of face
CN109887507A (en) * 2019-04-22 2019-06-14 成都启英泰伦科技有限公司 A method of reducing comparable speech order word false recognition rate
CN111200466A (en) * 2019-12-10 2020-05-26 北京航空航天大学杭州创新研究院 Confidence threshold optimization method for digital signal demodulation
CN111814990A (en) * 2020-06-23 2020-10-23 汇纳科技股份有限公司 Threshold determination method, system, storage medium and terminal
CN112489648A (en) * 2020-11-25 2021-03-12 广东美的制冷设备有限公司 Wake-up processing threshold adjustment method, voice home appliance, and storage medium
CN112802483A (en) * 2021-04-14 2021-05-14 南京山猫齐动信息技术有限公司 Method, device and storage medium for optimizing intention recognition confidence threshold

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060056547A1 (en) * 2004-09-07 2006-03-16 Alcatel Analog/digital conversion with adjustable thresholds
CN103578468A (en) * 2012-08-01 2014-02-12 联想(北京)有限公司 Method for adjusting confidence coefficient threshold of voice recognition and electronic device
CN106022032A (en) * 2015-03-30 2016-10-12 欧姆龙株式会社 Individual identification device, and identification threshold setting method
CN108875493A (en) * 2017-10-12 2018-11-23 北京旷视科技有限公司 The determination method and determining device of similarity threshold in recognition of face
CN109887507A (en) * 2019-04-22 2019-06-14 成都启英泰伦科技有限公司 A method of reducing comparable speech order word false recognition rate
CN111200466A (en) * 2019-12-10 2020-05-26 北京航空航天大学杭州创新研究院 Confidence threshold optimization method for digital signal demodulation
CN111814990A (en) * 2020-06-23 2020-10-23 汇纳科技股份有限公司 Threshold determination method, system, storage medium and terminal
CN112489648A (en) * 2020-11-25 2021-03-12 广东美的制冷设备有限公司 Wake-up processing threshold adjustment method, voice home appliance, and storage medium
CN112802483A (en) * 2021-04-14 2021-05-14 南京山猫齐动信息技术有限公司 Method, device and storage medium for optimizing intention recognition confidence threshold

Also Published As

Publication number Publication date
CN113823276B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
TWI466101B (en) Method and system for speech recognition
CN107342076B (en) Intelligent home control system and method compatible with abnormal voice
CN111816165A (en) Voice recognition method and device and electronic equipment
EP1557822B1 (en) Automatic speech recognition adaptation using user corrections
US7693713B2 (en) Speech models generated using competitive training, asymmetric training, and data boosting
CN108922541A (en) Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
Kingsbury et al. Robust speech recognition in noisy environments: The 2001 IBM SPINE evaluation system
US20050015251A1 (en) High-order entropy error functions for neural classifiers
CN108877784A (en) A kind of robust speech recognition methods based on accents recognition
WO2020186742A1 (en) Voice recognition method applied to ground-air communication
WO2023088083A1 (en) Speech enhancement method and apparatus
US20160210982A1 (en) Method and Apparatus to Enhance Speech Understanding
Salam et al. Malay isolated speech recognition using neural network: a work in finding number of hidden nodes and learning parameters.
CN113192535A (en) Voice keyword retrieval method, system and electronic device
GROZDIĆ et al. Comparison of Cepstral Normalization Techniques in Whispered Speech Recognition.
CN111027675B (en) Automatic adjusting method and system for multimedia playing setting
US20220335925A1 (en) Systems and methods for adapting human speaker embeddings in speech synthesis
CN113823276B (en) Voice recognition threshold setting method
Viñals et al. Phonetically-Aware Embeddings, Wide Residual Networks with Time-Delay Neural Networks and Self Attention Models for the 2018 NIST Speaker Recognition Evaluation.
Wu et al. Performance improvements through combining phone-and syllable-scale information in automatic speech recognition.
JPH10260696A (en) Method and device for gullet voicing articulation
CN108986844B (en) Speech endpoint detection method based on speaker speech characteristics
GB2564607A (en) Acoustic model learning device, acoustic model learning method, speech recognition device, and speech recognition method
Sholtz et al. Spoken Digit Recognition Using Vowel‐Consonant Segmentation
US20050246172A1 (en) Acoustic model training method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant