CN113555005B

CN113555005B - Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium

Info

Publication number: CN113555005B
Application number: CN202111107722.3A
Authority: CN
Inventors: 罗海霞; 王莎; 白锦峰
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-28
Anticipated expiration: 2041-09-22
Also published as: CN113555005A

Abstract

The application relates to a model training method, a confidence determining method, a model training device, a confidence determining device, electronic equipment and a storage medium, which are applied to the technical field of voice recognition, wherein the model training method comprises the following steps: acquiring a plurality of first voice data and first text information corresponding to each first voice data; inputting the acoustic features extracted from the first voice data into a pre-trained coding and decoding model to obtain the depth features and the locations output results of the first voice data; constructing label data according to the first text information; inputting the depth features corresponding to the first voice data into an initial temperature coefficient prediction model to obtain a temperature coefficient prediction value; inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value; and adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model. The reliability of the output confidence can be improved.

Description

Model training method, model training device, confidence determining method, confidence determining device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for model training and confidence determination, an electronic device, and a storage medium.

Background

Automatic speech recognition technology has been widely used in the industry, and its basic principle is to convert speech signals into corresponding text information through a machine. Since the correctness of the recognition result directly affects the user experience and the downstream task, the reliability of the output result can be generally evaluated by using the confidence level.

The speech recognition technology generally constructs a recognition model based on a deep neural network, and directly adopts the output posterior probability as a confidence coefficient, however, in the iteration process of an actual model, the output probability of a prediction result is far greater than that of a non-prediction result, so that the model shows an "overconfident" phenomenon on the output result, which is specifically shown in that the model gives a higher confidence coefficient even if the prediction result is not a correct result. Therefore, the output probability cannot directly and accurately reflect the true reliability of the model prediction result, i.e. the reliability of the confidence coefficient is low.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the present application provides a model training method, a confidence determining method, an apparatus, an electronic device, and a storage medium.

According to a first aspect of the present application, there is provided a temperature coefficient prediction model training method, including:

acquiring a plurality of first voice data and first text information corresponding to each first voice data;

inputting the acoustic features extracted from the first voice data into a pre-trained coding and decoding model to obtain the depth features and the logits output results of the first voice data;

constructing label data according to the first text information;

inputting the depth features corresponding to the first voice data into an initial temperature coefficient prediction model to obtain a temperature coefficient prediction value;

inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value;

and adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

According to a second aspect of the present application, there is provided a confidence determination method, the method comprising:

acquiring voice data to be recognized, and extracting acoustic features of the voice data to be recognized;

inputting the acoustic features into a pre-trained coding and decoding model to obtain the depth features and the logits output results of the voice data to be recognized;

inputting the depth characteristics into a target temperature coefficient prediction model which is trained in advance to obtain a temperature coefficient; wherein the target temperature coefficient prediction model is trained based on the method of the first aspect;

and determining the confidence coefficient of the text recognition result of the voice data to be recognized according to the temperature coefficient and the output result of the logits.

According to a third aspect of the present application, there is provided a temperature coefficient prediction model training apparatus, comprising:

the device comprises a first sample data acquisition module, a second sample data acquisition module and a processing module, wherein the first sample data acquisition module is used for acquiring a plurality of first voice data and first text information corresponding to each first voice data;

the data processing module is used for inputting the acoustic features extracted from the first voice data into a pre-trained coding and decoding model to obtain the depth features and the logits output results of the first voice data;

the tag data construction module is used for constructing tag data according to the first text information;

the temperature coefficient value prediction module is used for inputting the depth characteristics corresponding to the first voice data into an initial temperature coefficient prediction model to obtain a temperature coefficient prediction value;

a first loss function value determining module, configured to input the temperature coefficient prediction value, the tag data, and the logits output result into a first loss function, and determine a loss function value;

and the target temperature coefficient prediction model training module is used for adjusting the parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

According to a fourth aspect of the present application, there is provided a confidence determination apparatus, the apparatus comprising:

the acoustic feature extraction module is used for acquiring voice data to be recognized and extracting acoustic features of the voice data to be recognized;

the data processing module is used for inputting the acoustic features into a pre-trained coding and decoding model to obtain the depth features and the logits output results of the voice data to be recognized;

the temperature coefficient determining module is used for inputting the depth characteristics into a pre-trained target temperature coefficient prediction model to obtain a temperature coefficient; wherein the target temperature coefficient prediction model is trained based on the method of the first aspect;

and the confidence coefficient determining module is used for determining the confidence coefficient of the text recognition result of the voice data to be recognized according to the temperature coefficient and the output result of the logits.

According to a fifth aspect of the present application, there is provided an electronic device comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of the first or second aspect.

According to a sixth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first or second aspect.

According to a seventh aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first or second aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

on the basis of the pre-trained coding and decoding model, extracting the depth feature of each first voice data through the coding and decoding model, and obtaining the logits output result of each first voice data. Inputting the depth characteristics into an initial temperature coefficient prediction model to obtain a temperature coefficient prediction value; inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value; and adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model, namely training an independent temperature coefficient prediction model to predict the temperature coefficient on the basis of the encoding and decoding model. Because the temperature coefficient is a hyper-parameter in the neural network and is used for adjusting the smoothness of the final output result of the classification model, the confidence coefficient is corrected through the temperature coefficient, so that the corrected confidence coefficient can describe the output result more accurately, and the reliability of the output confidence coefficient is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a temperature coefficient prediction model training method and a confidence determination method of an embodiment of the present application may be applied;

FIG. 2 is a flow chart of a method for training a temperature coefficient prediction model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a temperature coefficient prediction model training method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a method for training a codec model according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating a method for training a temperature coefficient prediction model according to an embodiment of the present disclosure;

FIG. 6 is a flowchart illustrating a method for training a temperature coefficient prediction model according to an embodiment of the present disclosure;

FIG. 7 is a diagram of a Transformer model;

FIG. 8 is a flow chart of a confidence determination method in an embodiment of the present application;

FIG. 9 is a schematic diagram of a confidence determination method in an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a training apparatus for a temperature coefficient prediction model according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a confidence level determination apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present application. It should be understood that the drawings and embodiments of the present application are for illustration purposes only and are not intended to limit the scope of the present application.

It should be understood that the various steps recited in the method embodiments of the present application may be performed in a different order and/or in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present application is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present application are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this application are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.

The names of messages or information exchanged between a plurality of devices in the embodiments of the present application are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Automated speech recognition systems have enabled numerous applications to be developed for speech assistants, intent detection, keyword extraction, and emotion analysis. They are sensitive to correctness of recognition results produced by automatic speech recognition systems, and therefore, the confidence associated with the predicted output can be used to evaluate the recognition results for post-processing of the output. In addition, the confidence coefficient also has more practical application values, such as refusal identification of words outside the set, model adaptation in the training process of the large-vocabulary automatic speech recognition system, model training data screening by utilizing the confidence coefficient and the like.

The encoding-decoding modeling method based on the attention mechanism is a speech recognition method widely used in the industry at present, and compared with the traditional hybrid model, the encoding-decoding modeling method based on the attention mechanism is simpler in modeling mode and has better recognition effect. At present, the confidence problem of the encoding-decoding modeling method based on the attention mechanism has some related researches on handwriting recognition and machine translation, but the research has not been deeply researched in an automatic speech recognition model. Based on the above, the application provides a model training method, a model training device, a confidence determining device, an electronic device and a storage medium, so as to improve the accuracy of the confidence of a recognition result during automatic speech recognition.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a temperature coefficient prediction model training method and a confidence determination method according to an embodiment of the present application may be applied.

As shown in fig. 1, system architecture 100 may include a smart device 101, one or both of smart device 102, a network 103, and a server 104. Network 103 serves as a medium for providing communication links between smart device 101, smart device 102, and server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The smart devices 101 and 102 may be various electronic devices capable of recognizing voice data, including but not limited to smart speakers, smart phones, tablet computers, and the like. It should be understood that the number of intelligent devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of intelligent devices, networks, and servers, as desired for implementation. For example, server 104 may be a server cluster comprised of multiple servers, or the like.

The temperature coefficient prediction model training method and the confidence determining method provided by the embodiment of the present application are generally executed by the server 104, and accordingly, the temperature coefficient prediction model training device and the confidence determining device may be disposed in the server 104. However, it is easily understood by those skilled in the art that the temperature coefficient prediction model training method and the confidence determination method provided in the embodiments of the present application may also be executed by the smart device 101 and the smart device 102. For example, the server 104 may pre-train to generate a coding and decoding model, and obtain the speech data and the text information corresponding to the speech data, and the server 104 processes the speech data through the coding and decoding model, extracts the depth features of the speech data, and obtains the logits output result and the text prediction result. Constructing label data according to text information corresponding to the voice data, and inputting the depth characteristics into an initial temperature coefficient prediction model to obtain a temperature coefficient prediction value; inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value; and adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

After the target temperature coefficient prediction model is trained, under the condition that the voice data to be recognized sent by the intelligent device 101 and the intelligent device 102 are received, the voice data to be recognized are processed through the coding and decoding model, the depth features in the voice data are extracted, and the corresponding logits output results are obtained. And further inputting the depth characteristics into a target temperature coefficient prediction model to obtain a temperature coefficient, and determining the confidence coefficient of the text recognition result of the voice data to be recognized according to the temperature coefficient and the logits output result so as to improve the reliability of the confidence coefficient.

The temperature coefficient is a hyper-parameter in the neural network and is used for adjusting the smoothness of the final output result of the classification model. For example, assuming that the pre-trained codec model is a three-classification model, the true result of the input speech data to be recognized is the first class, the decoded logits output result is [3,2,1], under the condition of not using a temperature coefficient, the result obtained by calculating through a normalized exponential function is [0.665,0.245,0.090], that is, the confidence at this time is 0.665, although the model prediction is correct, the given confidence is not high, and does not accord with the actual reliability of the model. Assuming that the temperature coefficient T =0.5 predicted by the temperature coefficient prediction model, dividing the logits output result by the temperature coefficient can obtain [3/0.5,2/0.5,1/0.5], and the result is calculated as [0.867,0.117,0.016] by normalizing the exponential function, that is, the confidence is 0.867 at this time. Therefore, the temperature coefficient does not change the final output result (the category where the maximum value in logits is), but the confidence coefficient adjusted by the temperature coefficient can reflect the reliability of the model prediction result more accurately.

First, the temperature coefficient prediction model training method according to the embodiment of the present application will be described in detail below.

Referring to fig. 2, fig. 2 is a flowchart of a method for training a temperature coefficient prediction model in an embodiment of the present application, which may include the following steps:

step S210, a plurality of first voice data and first text information corresponding to each first voice data are obtained.

In an embodiment of the present application, the first speech data and the first text information are sample data used for training a temperature coefficient prediction model. The first voice data may be audio data collected when the user speaks, and the first text information is text information obtained by performing voice recognition on the first voice data. That is, the first text information and the first voice data are in one-to-one correspondence.

Step S220, inputting the acoustic features extracted from the first voice data into the pre-trained coding and decoding model to obtain the depth features and the logits output results of the first voice data.

The coding and decoding model is a neural network model comprising a coding module and a decoding module and can be obtained by training in advance. The following describes the training method of the codec model in detail, and is not described herein again. The coding and decoding model can identify voice data to obtain corresponding text information. In the embodiment of the application, the acoustic features extracted from each piece of first voice data are input into the coding and decoding model, so that the depth features of the first voice data can be extracted, and the output results of logits can be obtained. The logits output result refers to an original output value of the neural network model, and a final classification result, namely a text prediction result, can be output after the logits output result is input into the softmax layer for normalization processing. Therefore, the text prediction result is the final output result of the coding and decoding model, and the depth feature and the logits output result are both intermediate results of the coding and decoding model.

The method and the device can divide each first voice data into voice data of a plurality of time steps, and each time step can correspond to one character/word. And (3) extracting acoustic features aiming at the voice data of each time step, and inputting the extracted acoustic features into the coding and decoding model to obtain corresponding depth features and logits output results. The depth features can be used as input data for training the temperature coefficient prediction model, and the logits output results used to calculate the loss function values.

Step S230, constructing tag data according to the first text information.

In the embodiment of the application, the first text information is real text information corresponding to the first voice data, and label data for training the temperature coefficient prediction model can be directly constructed according to the first text information. For each character in the first text information, corresponding tag data may be constructed. Optionally, the tag data corresponding to each character is a vector of a preset dimension, and the preset dimension is the total number of text characters in the text character sequence; and if the character is the Nth text character in the text character sequence, the value of the Nth element in the label data is a first numerical value, the values of other elements in the label data are second numerical values, and N is a positive integer not greater than the preset dimension.

The text character sequence may include all text characters, or may be text characters with high frequency of use, and the like. For example, for Chinese, a text character sequence may be a sequence of all Kanji characters, and for other languages, a text character sequence may be a sequence of all words in the language of the language. If a certain character in the first text information is the hundredth character in the text character sequence, the value of the hundredth element in the label data corresponding to the character is the first numerical value, and the values of the other elements are the second numerical values. The first numerical value and the second numerical value are used to distinguish a currently recognized character from other characters in the first text information. For example, the first value may be 1, the second value may be 0, etc., and the first value and the second value may be other values, which are not limited in this application.

Step S240, inputting the depth feature corresponding to the first voice data into the initial temperature coefficient prediction model to obtain a temperature coefficient prediction value.

It should be noted that the depth feature of the first speech data extracted by the codec model may be input to the initial temperature coefficient prediction model as input data. The initial temperature coefficient prediction model is an original temperature coefficient prediction model with parameters not adjusted, and the depth characteristics are processed through the initial temperature coefficient prediction model to obtain a temperature coefficient prediction value. It will be appreciated that the predicted temperature coefficient value is not typically an accurate predicted temperature coefficient value, and therefore, the parameter values of the initial temperature coefficient prediction model may be continually adjusted through a training process to optimize the parameter values.

And step S250, inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value.

In the embodiment of the application, the logits output result represents the real logits output result of the first voice data, the temperature coefficient prediction value represents the prediction result, and the temperature coefficient prediction value, the label data and the logits output result are input into the first loss function, so that the loss function value can be determined. Wherein the first loss function includes, but is not limited to: a negative log-likelihood function. Specifically, the confidence prediction result may be obtained according to the logits output result and the temperature coefficient prediction value. And obtaining a loss function value according to the confidence coefficient prediction result and the label data.

And step S260, adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

It can be understood that, in the model training process, the smaller the loss function value is, the higher the accuracy of the temperature coefficient prediction model obtained by training is represented, and therefore, in the case that the loss function value converges to a certain threshold, the training process may be ended, so as to obtain the final target temperature coefficient prediction model.

Fig. 3 is a schematic diagram of a temperature coefficient prediction model training method in the embodiment of the present application, that is, a schematic diagram corresponding to the embodiment of fig. 2, and it can be seen that the temperature coefficient prediction model is obtained by training on the basis of an encoding and decoding model, and training data of the temperature coefficient prediction model is constructed according to output of the encoding and decoding model, so that accuracy of the encoding and decoding model will affect accuracy of the temperature coefficient prediction model. In order to improve the accuracy of the temperature coefficient prediction model, the stability of the coding and decoding model can be verified when the coding and decoding model is trained in advance, so that the stable coding and decoding model is obtained finally.

According to the temperature coefficient prediction model training method, on the basis of the pre-trained coding and decoding model, the depth feature of each first voice data is extracted through the coding and decoding model, and the logits output result of each first voice data is obtained. Inputting the depth characteristics into an initial temperature coefficient prediction model to obtain a temperature coefficient prediction value; inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value; and adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model, namely training an independent temperature coefficient prediction model to predict the temperature coefficient on the basis of the encoding and decoding model. Because the temperature coefficient is a hyper-parameter in the neural network and is used for adjusting the smoothness of the final output result of the classification model, the confidence coefficient is corrected through the temperature coefficient, so that the corrected confidence coefficient can describe the output result more accurately, and the reliability of the output confidence coefficient is improved.

The following describes a training method of the codec model in the embodiment of the present application.

In the embodiment of fig. 2, the generated codec model may be trained before the temperature coefficient prediction model is trained. Referring to fig. 4, fig. 4 is a flowchart of a method for training a coding/decoding model in an embodiment of the present application, which may include the following steps:

step S410, a plurality of second voice data and second text information corresponding to each second voice data are acquired.

The training data used for training the codec model and the data used for training the temperature coefficient prediction model may be different data or the same data. In order to improve the stability of the temperature coefficient prediction model, the first speech data and the first text information used for training the temperature coefficient prediction model may be included in the second speech data and the second text information, that is, the first speech data and the first text information may be partial sub-data or all data in the second speech data and the second text information.

Step S420, extracting the acoustic features in the second speech data and the text features in the second text information.

The extracted acoustic features may be fbank (filterbank) or mfcc (Mel-frequency cepstral coefficients), and the extracted text features may include time-series features, etc.

And step S430, inputting the acoustic characteristics and the text characteristics into the initial model to obtain a second text prediction result.

In the embodiment of the application, the acoustic feature and the text feature can be simultaneously used as input, and the prediction is performed through the initial model to obtain the second text prediction result. The initial model refers to a coding and decoding model when the network parameter value is not adjusted.

Step S440, determining a loss function value according to the second text prediction result and the second text information by using a preset second loss function.

In this embodiment of the application, the second loss function may be a negative log-likelihood function, and a corresponding loss function value may be obtained according to the second text prediction result, the second text information, and the second loss function. And the loss function value is used for evaluating the difference degree between the predicted value and the true value of the model, and the smaller the loss function value is, the better the robustness of the model is.

And S450, training the initial model based on the loss function value to generate an encoding and decoding model.

In the training process, the value of the loss function is smaller than a preset threshold value by adjusting the network parameter value, finally, the training of the model is completed, and the coding and decoding model is generated. In order to ensure the stability of the coding and decoding model and further ensure the stability of the temperature coefficient prediction model, the coding and decoding model can be tested, namely the stability of the coding and decoding model is verified through test data. For example, when the recognition accuracy reaches 96%, the codec model may be considered to be relatively stable.

Referring to fig. 5, fig. 5 is a flowchart of another method for training a temperature coefficient prediction model in an embodiment of the present application, which may include the following steps:

step S510, a plurality of first voice data and first text information corresponding to each first voice data are obtained.

This step is the same as step S210 in the embodiment of fig. 2, and specific reference may be made to the description in the embodiment of fig. 2, which is not repeated herein.

Step S520, inputting the acoustic features extracted from the first speech data into the codec model, and obtaining the output features of the coding unit, the input features of the decoding unit, and the output results of the locations.

In the embodiment of the present application, a method for acquiring a logits output result may refer to the description in the embodiment of fig. 2. For the depth feature of the first speech data, since the coding/decoding model includes the coding unit and the decoding unit, for different types of coding/decoding models, there is usually a difference between the output feature of the coding unit and the input feature of the decoding unit, and therefore, the output feature of the coding unit and the input feature of the decoding unit can be extracted by inputting the acoustic feature into the coding/decoding model.

Step S530, according to the output characteristic and the input characteristic, determining the depth characteristic of the first voice data.

After obtaining the output features of the encoding unit and the input features of the decoding unit, the depth features may be determined based on the output features and the input features. It is understood that the output features and the input features of different types of codec models may also be different, and the method for determining the depth features will be described in detail below for a specific type of codec model, and will not be described in detail here.

And step S540, constructing label data according to the first text information.

Step S550, inputting the depth feature corresponding to the first voice data into the initial temperature coefficient prediction model to obtain a temperature coefficient prediction value.

And step S560, inputting the temperature coefficient predicted value, the label data and the output result of the locations into a first loss function, and determining a loss function value.

And step S570, adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

The steps S540 to S570 are the same as the steps S230 to S260 in the embodiment of fig. 2, and specific reference may be made to the description in the embodiment of fig. 2, which is not repeated herein.

According to the temperature coefficient prediction model training method, the depth characteristics are determined by combining the output characteristics of the coding unit and the input characteristics of the decoding unit, more comprehensive depth characteristics can be extracted, the temperature coefficient prediction model is trained by using the depth characteristics as the input of the temperature coefficient prediction model, and the accuracy of the trained temperature coefficient prediction model can be improved.

The Transformer model is a typical representative of an encoding-decoding modeling method based on an attention mechanism, is a neural network model for modeling a time series, and is widely used in the fields of natural language processing, machine translation, speech recognition and the like at present. In the field of speech recognition, acoustic features of speech data are input into a transform model, and corresponding text information can be output. In addition, the Transformer model is widely favored by the industry due to its concurrent computation capability.

In the embodiment of the present application, the encoding/decoding model may be a Transformer model, and the following describes a training method of the temperature coefficient prediction model by taking the Transformer model as an example.

After the Transformer model is generated according to the method shown in fig. 4, a temperature coefficient prediction model may be further trained on the basis of the Transformer model. Referring to fig. 6, fig. 6 is a flowchart of another method for training a temperature coefficient prediction model in an embodiment of the present application, which may include the following steps:

step S610, a plurality of first voice data and first text information corresponding to each first voice data are obtained.

In step S620, the acoustic features extracted from the first speech data are input into the transform model, and the output features of the coding unit and the input features of the coding-decoding attention layer are obtained.

Referring to fig. 7, fig. 7 is a schematic diagram of a Transformer model, which may include: n encoding units and N decoding units; each coding unit may include a self-attention layer, a summation normalization layer, a feed-forward neural network, and a summation normalization layer. The decoding unit includes: a self-attention layer, a summation normalization layer, an encoding-decoding attention layer, a summation normalization layer, a feed-forward neural network, and a summation normalization layer, where N is a positive integer. The number of the coding units and the decoding units shown in fig. 7 is 1, and there may be more coding units and decoding units in general, for example, the number of the coding units and the decoding units may be 5, 6, etc., which is not limited herein.

In the embodiment of the present application, the output characteristic of the encoding unit may be an output characteristic of any encoding unit, and the input characteristic of the encoding-decoding attention layer may be an input characteristic of the encoding-decoding attention layer in any decoding unit. And in the case that the output characteristic of the coding unit is the output characteristic of the last coding unit and the input characteristic of the coding-decoding attention layer is the input characteristic of the coding-decoding attention layer in the first decoding unit, the obtained depth characteristic is more complete and the accuracy of the depth characteristic is higher. The output characteristics of the last encoding unit and the input characteristics of the encoding-decoding attention layer in the first decoding unit can be seen in the dotted arrow portion shown in fig. 7. The input to the transform model encoding section is the acoustic features extracted from the first speech data, and the input to the decoding section is the text features of the historical text information that has been output at the present time. For example, if the first voice data is "i love beijing tiananmen", if the word "day" is being recognized at present, the history text information that has been output at the present time is "i love beijing", and after feature extraction is performed on the history text information, the history text information is input to the decoding unit.

Similarly, in the training phase, step S430, the acoustic features and the text features are input into the initial model, i.e., the acoustic features are input into the encoding unit, and the text features are input into the decoding unit.

Step S630, determining a depth feature of the first speech data according to the output feature and the input feature.

For the Transformer model, the output characteristics of the last coding unit may include: a key value K matrix and a value V matrix, the input characteristics of the coding-decoding attention layer in the first decoding unit include: and inquiring the Q matrix, and determining the depth characteristic of the first voice data according to the K matrix, the V matrix and the Q matrix.

Alternatively, the following formula may be used:

，

determining a depth feature f of the first voice data; wherein softmax represents a normalized exponential function,

the number of columns of the Q matrix and the K matrix, i.e., the vector dimension, is represented. The inner product of each row vector of the Q matrix and the K matrix is calculated in the formula, and can be divided by the inner product in order to prevent the inner product from being too large

Square root of (1), K^TRepresenting the transpose of the K matrix.

Because the K matrix and the V matrix integrate audio context information, and the Q matrix considers historical text information which is output at the current moment, the depth characteristics obtained by the method provide rich acoustic information and text information for the training of the temperature coefficient prediction model.

Step S640, constructing tag data according to the first text information.

In the embodiment of the application, when constructing the tag data, if the first value is 1 and the second value is 0, the output of the target temperature coefficient prediction model generated by training is the inverse of the temperature coefficient in consideration of stability. The reason why the tag data can be constructed according to the above-mentioned manner is that there is a strong positive correlation between the tag data and the reciprocal of the temperature coefficient, and it is specifically shown that when the value of the element corresponding to the character in the tag data corresponding to the character is 1, it indicates that the output result is correct, and the greater the output confidence of the model is, the more reliable the model is, that is, the reciprocal of the temperature coefficient is close to 1; when the value of the element corresponding to the character in the tag data is 0, it indicates that the output result is wrong, and the smaller the output confidence of the model is, the more reliable the model is, i.e. the reciprocal of the temperature coefficient is close to 0.

In the embodiment of the present application, the first value and the second value may also be other values, for example, the first value is 0, the second value is 1, and the like. It will be appreciated that where the first and second values are other values, the output of the temperature coefficient prediction model may be the temperature coefficient, or the output of the temperature coefficient prediction model and the temperature coefficient may satisfy a formula that may be determined based on the particular values of the first and second values.

Step S650, inputting the depth feature corresponding to the first voice data into the initial temperature coefficient prediction model to obtain a temperature coefficient prediction value.

And step S660, inputting the temperature coefficient predicted value, the label data and the logits output result into a first loss function, and determining a loss function value.

In the embodiment of the application, the temperature coefficient prediction model can adopt a feedforward neural network with two hidden layers, the input is 512-dimensional, and the last layer is single output. The first loss function may employ a negative log-likelihood function, which may be expressed as:

，

wherein Loss represents a Loss function value, B represents the size of the batch,

the logits output results representing the nth time step,

a depth feature representing the nth time step,

a predicted value of the temperature coefficient representing the nth time step,

tag data representing the character corresponding to the ith batch nth time step, e.g., in a multi-class transform model where the character at the current time corresponds to the 2 nd text character in the text character sequence, then

=[0,1,0,0,...]. If the first voice data corresponds to L time steps, n is an integer from 1 to L, and L is a positive integer.

And step S670, adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

The temperature coefficient prediction model training method according to the embodiment of the application can be used for generating a stable Transformer model through pre-training, extracting depth features including audio information (namely output features of a coding unit) and historical text information (namely input features of a decoding unit) output at the current moment through the Transformer model, and using the depth features for training an independent temperature coefficient prediction model so as to predict a temperature coefficient through the temperature coefficient prediction model. Further, the reliability can be corrected by the temperature coefficient, and the reliability of the output reliability can be improved.

Referring to fig. 8, fig. 8 is a flowchart of a confidence determination method in an embodiment of the present application, which may include the following steps:

step S810, acquiring voice data to be recognized, and extracting acoustic features of the voice data to be recognized.

And S820, inputting the acoustic features into the pre-trained coding and decoding model to obtain the depth features and the logits output results of the voice data to be recognized.

Similar to the training process, after the acoustic features of the speech data to be recognized are extracted, the extracted acoustic features are input into the coding and decoding model for decoding, so that the depth features of each time step can be obtained, and simultaneously, the logits output result of the time step is obtained.

Step S830, inputting the depth characteristics into a target temperature coefficient prediction model which is trained in advance to obtain a temperature coefficient; and the target temperature coefficient prediction model is obtained by training based on the temperature coefficient prediction model training method.

The method for training the temperature coefficient prediction model may refer to the foregoing embodiment of fig. 2 and embodiment of fig. 5, and is not repeated here. The depth features are input into a temperature coefficient prediction model, and the temperature coefficient of the decoding result of the time step can be predicted.

And step 840, determining the confidence of the text recognition result of the voice data to be recognized according to the temperature coefficient and the output result of the locations.

In the embodiment of the present application, after obtaining the temperature coefficient and the logits output result, the following formula may be used:

determining the confidence C of the text recognition result of the nth time step of the voice data to be recognized_nWherein, in the step (A),

denotes the n-thThe results of the logits output at the time step,

a depth feature representing the nth time step,

and representing the predicted value of the temperature coefficient of the nth time step, wherein if the voice data to be recognized corresponds to S time steps, n is an integer from 1 to S, and S is a positive integer.

Referring to fig. 9, fig. 9 is a schematic diagram of a confidence determining method in an embodiment of the present application, when acoustic features of speech data to be recognized are processed through a coding/decoding model, similar to a training process, text features of historical text information may be extracted with reference to historical text information that has been output at a current time, and depth features are obtained based on the acoustic features and the text features, so that the depth features include audio information and the historical text information that has been output at the current time, and accuracy of the depth features is improved. Then, when predicting the temperature coefficient from the depth feature, the accuracy of the temperature coefficient prediction can be improved. Further, the accuracy of the confidence determined based on the temperature coefficient may be improved.

According to the confidence determining method, the encoding and decoding model and the temperature coefficient prediction model are trained in advance, the depth features are extracted through the encoding and decoding model, the depth features are processed through the temperature coefficient prediction model to obtain the temperature coefficient of each time step, the temperature coefficient is divided by the output result of the logits of the current time step, the distribution of the output probability can be adjusted on the premise that the recognition result is not changed, so that the confidence is corrected, the corrected confidence describes the output result of the model more accurately, and the reliability of the output confidence is improved.

Corresponding to the above method embodiment, an embodiment of the present application further provides a temperature coefficient prediction model training apparatus, and referring to fig. 10, the temperature coefficient prediction model training apparatus 1000 includes:

a first sample data obtaining module 1010, configured to obtain a plurality of first voice data and first text information corresponding to each first voice data;

a data processing module 1020, configured to input the acoustic features extracted from the first speech data into a pre-trained codec model, so as to obtain depth features and logits output results of the first speech data;

a tag data construction module 1030, configured to construct tag data according to the first text information;

the temperature coefficient value prediction module 1040 is configured to input the depth feature corresponding to the first speech data into the initial temperature coefficient prediction model to obtain a temperature coefficient prediction value;

a first loss function value determining module 1050, configured to input the temperature coefficient prediction value, the tag data, and the logits output result into a first loss function, and determine a loss function value;

and the target temperature coefficient prediction model training module 1060 is used for adjusting parameters of the initial temperature coefficient prediction model according to the loss function value to obtain a target temperature coefficient prediction model.

In an alternative embodiment, the codec model includes: an encoding unit and a decoding unit;

the data processing module is specifically used for determining the depth feature of the first voice data through the following steps:

inputting the acoustic features extracted from the first voice data into a coding and decoding model to obtain the output features of a coding unit and the input features of a decoding unit; based on the output features and the input features, depth features of the first speech data are determined.

In an alternative embodiment, the coding/decoding model is a Transformer model, and the decoding unit comprises a coding-decoding attention layer;

the data processing module is specifically used for inputting the acoustic features extracted from the first voice data into the coding and decoding model to obtain the output features of the coding unit and the input features of the decoding unit through the following steps:

the acoustic features extracted from the first speech data are input into a transform model, resulting in output features of the coding unit and input features of the coding-decoding attention layer.

In an optional implementation manner, the data processing module is specifically configured to determine the depth feature of the first speech data according to the output feature and the input feature by:

the output characteristics include: a key value K matrix and a value V matrix, the input features including: inquiring a Q matrix;

and determining the depth characteristic of the first voice data according to the K matrix, the V matrix and the Q matrix.

In an optional implementation manner, the data processing module is specifically configured to determine the depth feature of the first speech data according to the K matrix, the V matrix, and the Q matrix by:

according to the following formula:

，

representing the number of columns of the Q matrix and the K matrix, K^TRepresenting the transpose of the K matrix.

In an optional implementation manner, the tag data construction module is specifically configured to construct, for each character in the first text information, tag data corresponding to the character, where the tag data is a vector of a preset dimension, and the preset dimension is a total number of text characters in a text character sequence; and if the character is the Nth text character in the text character sequence, the value of the Nth element in the label data is a first numerical value, the values of other elements in the label data are second numerical values, and N is a positive integer not greater than the preset dimension.

In an alternative embodiment, if the first value is 1 and the second value is 0, the output of the target temperature coefficient prediction model generated by training is the inverse of the temperature coefficient.

In an optional implementation manner, the training apparatus for a temperature coefficient prediction model further includes:

the second sample data acquisition module is used for acquiring a plurality of second voice data and second text information corresponding to each second voice data;

the feature extraction module is used for extracting acoustic features in the second voice data and text features in the second text information;

the prediction module is used for inputting the acoustic characteristics and the text characteristics into the initial model to obtain a second text prediction result;

the second loss function value determining module is used for determining a loss function value according to a second text prediction result and second text information by using a preset second loss function;

and the coding and decoding model training module is used for training the initial model based on the loss function value to generate a coding and decoding model.

In an alternative embodiment, the first loss function comprises a negative log-likelihood function expressed as:

，

the logits output results representing the nth time step,

a depth feature representing the nth time step,

and the tag data represents the tag data of the character corresponding to the nth time step of the ith batch, if the first voice data corresponds to L time steps, n is an integer from 1 to L, and L is a positive integer.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a confidence determination apparatus 1100 according to an embodiment of the present application, including:

the acoustic feature extraction module 1110 is configured to acquire voice data to be recognized and extract an acoustic feature of the voice data to be recognized;

the data processing module 1120 is configured to input the acoustic features into the pre-trained codec model to obtain depth features and logits output results of the speech data to be recognized;

a temperature coefficient determination module 1130, configured to input the depth feature into a pre-trained target temperature coefficient prediction model to obtain a temperature coefficient; the target temperature coefficient prediction model is obtained by training based on the temperature coefficient prediction model training method;

and the confidence level determining module 1140 is used for determining the confidence level of the text recognition result of the voice data to be recognized according to the temperature coefficient and the output result of the logits.

In an optional implementation manner, the confidence level determining module is specifically configured to:

determining the confidence C of the text recognition result at the nth time step of the speech data to be recognized_nWherein, in the step (A),

the logits output results representing the nth time step,

a depth feature representing the nth time step,

The details of each module or unit in the above device have been described in detail in the corresponding method, and therefore are not described herein again.

An embodiment of the present application further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform the methods of embodiments of the present application.

Embodiments of the present application also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program is configured to cause a computer to perform the method of the embodiments of the present application when executed by a processor of the computer.

Embodiments of the present application further provide a computer program product, including a computer program, where the computer program is used to make a computer execute the method of the embodiments of the present application when the computer program is executed by a processor of the computer.

Referring to fig. 12, a block diagram of a structure of an electronic device 1200, which may be a server or a client of the present application, which is an example of a hardware device that may be applied to aspects of the present application, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 12, the electronic apparatus 1200 includes a computing unit 1201, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the electronic device 1200 are connected to the I/O interface 1205, including: an input unit 1206, an output unit 1207, a storage unit 1208, and a communication unit 1209. The input unit 1206 may be any type of device capable of inputting information to the electronic device 1200, and the input unit 1206 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 1207 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 1204 may include, but is not limited to, magnetic or optical disks. The communication unit 1209 allows the electronic device 1200 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 1201 performs the respective methods and processes described above. For example, in some embodiments, the temperature coefficient prediction model training method, and the confidence determination method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1200 via the ROM 1202 and/or the communication unit 1209. In some embodiments, the computing unit 1201 may be configured to perform a temperature coefficient prediction model training method, or a confidence determination method, in any other suitable manner (e.g., by means of firmware).

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A method for training a temperature coefficient prediction model, the method comprising:

constructing label data according to the first text information;

2. The method of claim 1, wherein the coding model comprises: an encoding unit and a decoding unit;

inputting the acoustic features extracted from the first voice data into the coding and decoding model to obtain the depth features of the first voice data, wherein the depth features comprise:

inputting the acoustic features extracted from the first voice data into the coding and decoding model to obtain the output features of the coding unit and the input features of the decoding unit;

determining a depth feature of the first speech data based on the output feature and the input feature.

3. The method of claim 2, wherein the codec model is a Transformer model, and wherein the decoding unit comprises an encoding-decoding attention layer;

inputting the acoustic features extracted from the first speech data into the transform model to obtain the output features of the encoding unit and the input features of the decoding unit, including:

inputting the acoustic features extracted from the first speech data into the transform model, and obtaining the output features of the coding unit and the input features of the coding-decoding attention layer.

4. The method of claim 3, wherein determining the depth feature of the first speech data based on the output feature and the input feature comprises:

5. The method of claim 4, wherein determining the depth feature of the first speech data according to the K matrix, the V matrix, and the Q matrix comprises:

according to the following formula:

，

determining a depth feature f of the first speech data; wherein softmax represents a normalized exponential function,

6. The method of claim 1, wherein constructing tag data from the first textual information comprises:

for each character in the first text information, constructing label data corresponding to the character, wherein the label data are vectors of preset dimensions, and the preset dimensions are the total number of text characters in a text character sequence; and if the character is the Nth text character in the text character sequence, the value of the Nth element in the label data is a first numerical value, the values of other elements in the label data are second numerical values, and N is a positive integer not greater than the preset dimension.

7. The method of claim 6, wherein if the first value is 1 and the second value is 0, the output of the trained target temperature coefficient prediction model is the inverse of the temperature coefficient.

8. The method of claim 1, wherein the method for training the codec model comprises:

acquiring a plurality of second voice data and second text information corresponding to each second voice data;

extracting acoustic features in the second voice data and text features in the second text information;

inputting the acoustic features and the text features into an initial model to obtain a second text prediction result;

determining a loss function value according to the second text prediction result and the second text information by using a preset second loss function;

and training the initial model based on the loss function value to generate the coding and decoding model.

9. The method of claim 6, wherein the first loss function comprises a negative log-likelihood function represented as:

，

wherein Loss represents a Loss functionThe value, B, indicates the size of the batch,

the logits output results representing the nth time step,

a depth feature representing the nth time step,

10. A confidence determination method, the method comprising:

inputting the depth characteristics into a target temperature coefficient prediction model which is trained in advance to obtain a temperature coefficient; wherein the target temperature coefficient prediction model is trained based on the method of any one of claims 1-9;

11. The method according to claim 10, wherein the determining the confidence level of the text recognition result of the speech data to be recognized according to the temperature coefficient and the logits output result comprises:

according to the following formula:

the logits output results representing the nth time step,

a depth feature representing the nth time step,

12. A temperature coefficient prediction model training apparatus, the apparatus comprising:

13. A confidence determination apparatus, the apparatus comprising:

the temperature coefficient determining module is used for inputting the depth characteristics into a pre-trained target temperature coefficient prediction model to obtain a temperature coefficient; wherein the target temperature coefficient prediction model is trained based on the method of any one of claims 1-9;

14. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method of any one of claims 1-9 or to perform the method of claim 10 or 11.

15. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9 or perform the method of claim 10 or 11.