CN113823276A

CN113823276A - Voice recognition threshold setting method

Info

Publication number: CN113823276A
Application number: CN202111147823.3A
Authority: CN
Inventors: 陈思应
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2021-12-21
Anticipated expiration: 2041-09-29
Also published as: CN113823276B

Abstract

A speech recognition threshold setting method comprises the following steps: s1, determining an identification function and a false identification function, S2, respectively calculating the profit and the loss of the identification function and the false identification function, and calculating the total profit gains, S3, carrying out derivation on the total profit gains by taking the confidence coefficient as a variable, wherein the confidence value when the derivative is zero is a confidence threshold value. According to the invention, through the analysis of recognition and error recognition, the optimal confidence threshold of each command word is determined by a maximum profit method, although the recognition rate under news noise is slightly reduced, the error recognition rate is greatly reduced, and the overall recognition experience effect is improved.

Description

Voice recognition threshold setting method

Technical Field

The invention belongs to the technical field of voice recognition, relates to voice recognition threshold setting, and particularly relates to a voice recognition threshold setting method.

Background

With the iterative update of the technology, the voice recognition technology is mature day by day and is widely used in actual products such as sound equipment, toys, home control and the like. The current mainstream speech recognition technology is mainly realized in a deep neural network learning mode, the deep neural network learning comprises two steps of training and recognition, the training is to obtain an acoustic model through the calculation of the probability from speech to syllable, the recognition is to calculate the probability from the current speech corresponding to syllable to text according to the acoustic model and a language model, and in practical application, for the speech recognition, only two states of recognition and non-recognition are needed, so that the probability needs to be converted into a binary quantity. It is common practice to set a probability (confidence) threshold, i.e. when the obtained confidence value reaches or exceeds the threshold, it indicates that the speech recognition is successful; otherwise, the result is unsuccessful.

The threshold is usually a difficult problem, if the threshold is set too large, the recognition rate will be reduced, and if the threshold is set too small, although the recognition rate is not affected, the probability of false recognition outside the set will be increased, and the recognition experience will be reduced.

Disclosure of Invention

In order to overcome the technical defects in the prior art and take recognition rate and error recognition into account, the invention provides a method for setting a speech recognition threshold value.

The method for setting the speech recognition threshold comprises the following steps:

s1, determining an identification function and a false identification function;

s2, respectively calculating the profit and the loss of the identification function and the error identification function,

the total gain, gains, is calculated,

gains= gain_err-loss_err+gain_rec-loss_rec

wherein gain_rec、loss_rec、gain_err、loss_errRespectively identifying function gain, identifying function loss, misrecognized function gain and misrecognized function loss;

and S3, carrying out derivation on the total gains by taking the confidence coefficient as a variable, wherein the confidence value when the derivative is zero is a confidence coefficient threshold value.

Preferably, the identification function and the misidentification function are respectively:

rec(x)=-ax2+bx+c

err(x)=-ax2+mx+n

wherein, x is confidence, rec (x) is identification times, err (x) is false identification times;

a, b, c, m, n are constants greater than zero, and b is not equal to m, c is not equal to n;

the confidence threshold t = (n-c)/(b-m).

According to the invention, through the analysis of recognition and error recognition, the optimal confidence threshold of each command word is determined by a maximum profit method, although the recognition rate under news noise is slightly reduced, the error recognition rate is greatly reduced, and the overall recognition experience effect is improved.

Drawings

FIG. 1 is a diagram illustrating an exemplary embodiment of two function curves of the recognition function and the misrecognition function according to the present invention; in fig. 1, a solid line curve is a recognition function, a dashed line curve is a misrecognition function, an abscissa is a confidence threshold with a unit of 1%, an ordinate is a frequency, and min and max are left and right end points of an interval of the confidence threshold respectively.

Detailed Description

The following provides a more detailed description of the present invention.

s1, determining an identification function and a false identification function;

the total gain, gains, is calculated,

gains= gain_err-loss_err+gain_rec-loss_rec

According to statistics, during recognition, probability scores, namely confidence degrees, obtained when pure target word sound signals are sent into a neural network are mostly distributed in a high-score section; when the target word is recognized by mistake, the voice signal contains syllables which are similar to one or more syllables in the target word, so that the overall confidence coefficient is increased, and the false recognition occurs, but the confidence coefficient of the false recognition is mostly distributed in a low-score section.

Counting and fitting data according to the identification and the error identification of a large number of command words; the confidence distribution of a general recognition or a misrecognition shows a rule as shown by two function curves of err and rec in fig. 1. Wherein, the solid line is the confidence coefficient distribution curve rec of the recognition function, and the dotted line is the confidence coefficient curve err of the misrecognition function. The two functions err and rec can be approximated as follows:

rec(x)=-ax²+bx+c （1）

err(x)=-ax²+mx+n （2）

wherein a, b, c, m, n are constants greater than zero, and b ≠ m, c ≠ n. x is confidence;

for all confidence degrees x, rec and err in the confidence degree interval are all larger than or equal to zero, in practice, the opening sizes and symmetry axes of rec and err are different according to different combination modes of command words, so the values of a, b, c, m and n are different, but the values of a, b, c, m and n are constants larger than zero.

By combining the formula (1) and the formula (2), the intersection point x of the two curves can be obtained_oIs composed of

x_o =(n-c)/(b-m) (3)

In order to consider recognition and misrecognition, an optimal confidence threshold t needs to be determined, and the problem is simplified, namely, the confidence threshold t corresponding to the maximum benefit is obtained.

First, in the confidence interval [ min, max]Gain of the medium computation recognition function_recAnd identifying loss of function loss_rec。

Yield refers to the correct rate and the corresponding loss is the error rate. For example, when the confidence threshold is set at 0.25, it has been found that the recognition accuracy is 97%, i.e., the yield is 97% and the loss is 3%.

The confidence interval is the range of possible values of the confidence, and the confidence threshold t is positioned in the confidence interval;

（4）

（5）

secondly, the gain of the misrecognition function is calculated_errSum misrecognized function loss_err

（6）

（7）

Finally, the total gain gains can be calculated by the formulas (1) - (2) and (4) - (7)

gains=-(-at³/3+bt²/2+ct)+(-at³/3+mt²/2+nt)-

(-at³/3+bt²/2+ct)-(-at³/3+mt²/2+nt)+const ---（8）

Wherein, gain_rec、loss_rec、gain_err、loss_errRespectively identifying function gain, identifying function loss, misrecognized function gain and misrecognized function loss; const is a constant.

The formula (8) is arranged and the derivative is obtained by solving the t

gains^’= -2(b-m)t+2(n-c) （9）

While derivative gains^’When the total gain gains is zero, the total gain gains has a maximum value, that is, the total gain gains is maximum, and the threshold value at this time is obtained by equation (9):

t=(n-c)/(b-m) (10)

comparing the formula (3) with the formula (10), it can be found that the physical meaning of the formula (10) is: when the value of the confidence threshold is the intersection point of the recognition distribution curve and the misrecognition distribution curve, the recognition yield is the maximum, and the best recognition experience effect can be obtained at the moment.

As can be seen from the above, the confidence threshold t at which the total profit is the maximum is consistent with the confidence value corresponding to the intersection of the recognition and misrecognized distribution curve in fig. 1.

The specific embodiment is as follows:

typically, when an acoustic model is trained, a preliminary confidence threshold, such as 25, is determined after statistical testing through a large number of test sets. At this threshold, the recognition effect can meet the general requirements of users for speech recognition, but is not the best experience. In order to obtain the best experience effect, the threshold value of each command word needs to be determined separately in consideration of the balance between recognition and misrecognition.

Firstly, under the condition of initially setting a confidence threshold value of 25, selecting 10 different sound sources such as 5 men and 5 women, respectively carrying out recognition rate test under quiet and news noise, and counting and recognizing scoring distribution according to a test result; and then selecting the audio of the 12-hour synthesis art program to perform false recognition test, and counting false recognition scoring distribution according to results. With a threshold of 25, the partial command word quiet recognition/misrecognition confidence distribution is shown in table 1.

Each number in table 1 is a confidence level, 12-hour synthesis program audio is used in the false recognition test in table 1, the number of times of false recognition of each command word is indefinite in the 12 hours, and the number of scores obtained by each command word is indefinite in the false recognition test. The distribution of scores is followed, not the number of scores.

In the recognition test, 10 voices are used, 5 men and 5 women read each command word, so that the test result of each command word corresponds to 10 scores, and the specific score results are shown in table 1.

TABLE 1

According to the maximum profit principle, the confidence threshold of each command word is adjusted by the method of the invention according to the distribution of recognition and misrecognition, namely the confidence threshold is obtained according to the formula (10).

The results of the comparison before and after are shown in Table 2.

TABLE 2

After independently confirming the threshold value for each command word, testing and confirming the misrecognition and the recognition again, wherein the misrecognition data is shown in a table 3; the recognition rates under quiet and noisy conditions are shown in table 4.

Table 3 shows the comparison test result of the number of misrecognitions before and after the confidence threshold adjustment in tables 1 and 2, and it can be seen from table 3 that the number of misrecognitions is reduced by 51.05% compared to that before the threshold is adjusted individually, which indicates that after the confidence threshold is adjusted for each command word, the number of misrecognitions as a whole is greatly reduced by more than 50%, and the correct recognition rate is improved.

Table 4 is a comparison test result of the recognition times before and after the confidence threshold adjustment of tables 1 and 2, which shows that the overall recognition rate is reduced little after the confidence threshold is adjusted for each command word.

It can be seen from table 3 and table 4 that, after the confidence threshold is adjusted, the number of times of misrecognition decreases significantly while the number of times of cognition is maintained basically, and the overall recognition effect is improved.

TABLE 3

TABLE 4

After threshold adjustment is carried out on each command word independently, the recognition rate under news noise is slightly reduced to greatly reduce false recognition, and therefore the recognition experience effect of a user is improved.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A speech recognition threshold setting method is characterized by comprising the following steps:

s1, determining an identification function and a false identification function;

the total gain, gains, is calculated,

gains= gain_err-loss_err+gain_rec-loss_rec

2. The speech recognition threshold setting method of claim 1,

the recognition function rec and the misrecognition function err are respectively:

rec(x)=-ax²+bx+c

err(x)=-ax²+mx+n

a, b, c, m, n are constants greater than zero, and b is not equal to m, c is not equal to n; the confidence threshold t = (n-c)/(b-m).