CN114067786A

CN114067786A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN114067786A
Application number: CN202010739778.XA
Authority: CN
Inventors: 孙思宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2022-02-18

Abstract

The application relates to the technical field of voice recognition, and discloses a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the voice recognition method comprises the following steps: acquiring a voice to be recognized; recognizing the voice to be recognized through a voice recognition model to obtain a voice recognition result; wherein the speech recognition model is trained by: training the initial voice recognition model based on each initial training sample to obtain a preliminarily trained recognition model; obtaining identification error-prone rate representation information corresponding to each initial training sample when the initial training sample is identified through the identification model after the initial training; selecting a target sample from each initial training sample according to the characterization information; and training the preliminarily trained recognition model based on each target sample to obtain the voice recognition model. The scheme provided by the application can improve the recognition accuracy of the voice recognition model.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a storage medium.

Background

In a speech recognition scene, a training model is more and more commonly adopted to perform speech recognition, but because a plurality of characters with the same or similar pronunciation exist in a language, when the trained model recognizes speech, situations that recognized sentences have language sickness or word inadequacy and other recognition errors often occur.

In order to improve the recognition accuracy of the speech recognition model, in the prior art, the training of the minimum word error rate is often performed, when the training of the minimum word error rate is performed, sample data in the training process of all models are involved in the training of the minimum word error rate, however, for training sentences which can be correctly recognized and have high distinctiveness, the training of the minimum word error rate does not bring benefit, because the data has obtained a smaller average word error rate, the training resource waste can be caused by the minimum word error rate of the sentences, the training efficiency of the minimum word error rate is lower, and the speech recognition effect of the trained models is still to be improved.

Disclosure of Invention

The purpose of the present application is to solve at least one of the above technical drawbacks, and to provide the following solutions:

in one aspect of the present application, a speech recognition method is provided, including:

acquiring a voice to be recognized;

recognizing the speech to be recognized through a speech recognition model to obtain a speech recognition result; the voice recognition model is obtained by training in the following mode:

training the initial voice recognition model based on each initial training sample to obtain a preliminarily trained recognition model;

acquiring identification error-prone rate representation information corresponding to each initial training sample when the initial training sample is identified through the identification model after the initial training;

selecting a target sample from each initial training sample according to the characterization information;

and training the primarily trained recognition model based on each target sample to obtain a voice recognition model.

In another aspect of the present application, there is provided a speech recognition apparatus including:

the voice to be recognized acquisition module is used for acquiring voice to be recognized;

the recognition result obtaining module is used for recognizing the speech to be recognized through the speech recognition model to obtain a speech recognition result;

wherein, this speech recognition model is obtained through the training of trainer, and this trainer includes:

the initial training module is used for training the initial voice recognition model based on each initial training sample to obtain a recognition model after initial training;

the target sample screening module is used for obtaining identification error-prone rate characterization information corresponding to each initial training sample when the initial training samples are identified through the identification model after the initial training, and selecting a target sample from each initial training sample according to the characterization information;

and the model retraining module is used for training the recognition model after the initial training based on each target sample to obtain a voice recognition model.

In a further aspect of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the speech recognition method of the first aspect of the present application.

In yet another aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when executed by a processor, implements the speech recognition method illustrated in the first aspect of the present application.

The beneficial effect that technical scheme that this application provided brought is:

according to the voice recognition method, the initial training samples are screened by using the identification error-prone rate characterization information, the initial training samples with high error identification probability can be screened out to serve as target samples, and the voice recognition model is trained based on the target samples; moreover, the target sample is selected from the initial training samples according to the identification error-prone rate characterization information, so that the training samples with high identification error-prone rate can be subjected to targeted training, the performance of the trained voice recognition model is improved, and the accuracy of voice recognition is improved.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a speech recognition method provided by an embodiment of the present application;

fig. 2 is a flowchart for obtaining identification error-prone rate characterization information corresponding to each initial training sample when the initial training sample is identified by the initially trained identification model according to an embodiment of the present application;

FIG. 3 is a flow chart of screening target samples and training a speech recognition model using the target samples according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating iterative training of the preliminarily trained recognition model based on each of the target samples according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a first loss function calculating method according to probabilities and numbers of error characters corresponding to candidate recognition results of target samples according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Table 1 is a recognition probability distribution table of 5 normalized samples ranked in the top in the recognition result of "navigate to the home palace" of the initial training sample provided in an embodiment of the present application;

table 2 is a comparison table of information entropies corresponding to characters in the initial training sample "navigate to the home palace" provided in an embodiment of the present application;

table 3 is a comparative table of experimental results provided in one example of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Minimize word error rate: a way to reduce the recognition error rate of a model on characters.

The cross entropy is used to compute the difference between the learning model distribution and the training distribution. The cross entropy loss function may measure the similarity between the distribution of the true markers and the predicted marker distribution of the trained model.

Bundle searching: the method is a heuristic means for solving the optimization problem, a heuristic method is used for predicting K better paths, and only the K paths are searched downwards, namely only limited K nodes in each layer are reserved.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning. The application obtains the recognition model and the voice recognition model after the initial training through machine learning.

The inventor finds out in the research process that in the training process of minimizing the word error rate, if the predicted text corresponding to the maximum prediction probability is consistent with the sample text, the predicted text is used as the selection standard of the target sample. Only limited target samples can be obtained, and the training result is not accurate enough due to insufficient target samples.

In view of the technical problems existing in the prior art, the present application provides a speech recognition method, apparatus, electronic device and storage medium, which aim to solve at least one of the above technical problems in the prior art,

the following describes the technical solutions of the present application and how to solve the above technical problems in detail with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

An embodiment of the present application provides a possible implementation manner, and as shown in fig. 1, provides a flowchart of a speech recognition method, where the scheme may be executed by any electronic device, and optionally may be executed at a server side, and includes the following steps:

step S110, acquiring a voice to be recognized;

step S120, recognizing the voice to be recognized through a voice recognition model to obtain a voice recognition result; the speech recognition model is obtained by training in the manner provided in the following steps S130 to S140:

step S130, training the initial voice recognition model based on each initial training sample to obtain a recognition model after initial training;

step S140, obtaining identification error-prone rate representation information corresponding to each initial training sample when the initial training sample is identified through the identification model after the initial training;

s150, selecting a target sample from each initial training sample according to the characterization information;

and step S160, training the primarily trained recognition model based on each target sample to obtain the voice recognition model.

The scheme provided by the application can be applied to, but is not limited to, the following scenes: an electronic device (such as a server or a user terminal) receives a speech recognition request of a speech to be recognized, the server responds to the request, recognizes the speech to be recognized through a trained speech recognition model, obtains a speech recognition result, namely text information corresponding to the speech to be recognized, and can provide the speech recognition result for a user or a request sending end. For example, the scheme can be applied to an instant messaging application program, through the application program, two or more users can exchange information in a text or voice mode, for a certain user, if the user receives voice information sent by another user, the user can know corresponding information in a voice playing mode directly, that is, when a user terminal or a server receives a voice playing request of the user (for example, the user clicks the received voice information), the user terminal or the server can identify the voice corresponding to the request, and the identification is played to each user in a voice mode; or when the user initiates a request for converting the received voice into text, the user terminal or the server may directly display the text corresponding to the voice to the user.

Optionally, the speech recognition model is trained by:

obtaining a plurality of initial training samples for model training, and training an initial voice recognition model based on each initial training sample to obtain a recognition model after primary training; the recognition model after the initial training is the model corresponding to the model meeting the convergence condition in the model training process, that is, the recognition model after the initial training can basically realize correct recognition of the voice sample, but the recognition accuracy can be further improved.

Acquiring identification error-prone rate representation information corresponding to each initial training sample when the initial training sample is identified through the identification model after initial training; when the identification error-prone rate characterization information characterizes that the initial training sample is identified through the identification model after the initial training, the error-prone rate of the identification result, namely the possibility of error occurrence of the identification result is higher, and the probability that the initial training sample is selected as the target sample is higher.

And selecting target samples from the initial training samples according to the identification error-prone rate characterization information, and training the primarily trained identification model based on the target samples to obtain the voice identification model.

The method screens the initial training sample based on the identification error-prone rate characterization information of the initial training sample to obtain a target sample meeting screening conditions, wherein the screening conditions comprise: the recognition error-prone rates are ranked in the front N, or the recognition error-prone rates are larger than N initial training samples of a preset threshold, wherein N is a positive integer larger than 1, the recognition model is further trained based on the target sample, and the recognition model is favorable for optimizing the recognition error-prone rates, namely compared with the recognition model after primary training, the recognition error-prone rates are reduced by the aid of the voice recognition model obtained by training the target sample.

According to the method and the device, the target sample is screened according to the characterization information of the recognition error-prone rate of the recognition model to the initial training sample, the initial training sample with high recognition error-prone rate can be screened out to serve as the target sample, and the speech recognition model is trained based on the target sample; moreover, the target sample is selected from the initial training sample according to the identification error-prone rate characterization information, so that the identification error-prone rate can be trained specifically, the performance of the trained voice recognition model is improved, and the accuracy of voice recognition is improved.

In order to make the speech recognition scheme and its technical effects provided by the present application clearer, the following examples are provided to illustrate specific embodiments thereof in detail.

In an alternative embodiment, each initial training sample includes sample speech and a sample text corresponding to the sample speech, and when obtaining the identification error-prone rate characterizing information corresponding to each initial training sample during identification by the initially trained identification model, for each initial training sample, the following method may be implemented, and a flowchart of the method is shown in fig. 2, and includes:

step S210, recognizing the sample voice of the sample by using the preliminarily trained recognition model, and obtaining at least two recognition probabilities ranked in the descending order of the recognition probabilities corresponding to each character in the sample text of the sample;

step S220, determining the information entropy corresponding to each character for at least two recognition probabilities corresponding to each character in the sample text;

step S230, determining the identification error-prone rate characterization information corresponding to the initial training sample based on the information entropy corresponding to each character in the sample text.

For each initial training sample, the identification error-prone rate characterization information corresponding to each sample can be obtained through the method. The method comprises the steps of using the initial training sample as a model training sample, enabling the initial training sample to comprise sample voice and a sample text corresponding to the sample voice, enabling the sample text to be a standard text of the sample voice, namely a correct text, inputting the initial training sample into a recognition model after initial training, recognizing the sample voice of the initial training sample by using the recognition model after the initial training, obtaining a plurality of recognition probabilities corresponding to each character in the sample text corresponding to the initial training sample, determining information entropies corresponding to the characters based on the recognition probabilities, enabling the information entropies corresponding to the characters to indicate the probability that the characters are recognized wrongly, and indicating that the probability that the characters are recognized wrongly is larger if the information entropies are larger, and indicating that the models are recognized correctly with higher confidence coefficient if the information entropies are smaller.

The recognition probability refers to the possibility that the sample speech is recognized as a certain character, for example, the sample speech is a speech signal "navigated to the home palace", and the corresponding sample text is "navigated to the home palace", when the sample speech is recognized through a recognition model after preliminary training, each prediction probability corresponding to each character in the sample text can be obtained, and it can be understood that the probability that each character is predicted as each word in a lexicon (dictionary), for example, the probability that a first "leading" word is predicted as a "leading", the probability that the character is predicted as a "leading", and the like, the higher the corresponding probability is, the higher the probability that the first character in the final prediction text is the character is. When speech recognition is performed through the recognition model, the recognition probability can be understood as a plurality of output results (i.e. probability values before normalization processing) with larger values in the output results corresponding to the last hidden layer (i.e. the hidden layer cascaded with the output layer) of the model, and a plurality of recognition probabilities ranked in the front are obtained by normalizing the plurality of output results with larger values.

Optionally, at least two probability values corresponding to the characters and ranked at the top are obtained according to the size of the recognition probability before normalization, and the information entropy corresponding to each character is determined according to at least two recognition probabilities corresponding to each character.

The characters with the same or similar pronunciations are often the characters with the same or similar pronunciations, and the recognition models are not distinguishable due to the similar acoustic pronunciations, so that the information entropy of each corresponding character is obtained according to at least two recognition probabilities which are ranked in the front, if the characters with the same or similar pronunciations exist, the recognition probabilities of the two characters are close, and if the information entropy of the characters is calculated based on the recognition probabilities with the close recognition probabilities, the calculated information entropy is larger.

Examples are as follows: one initial training sample is: (x, y), where x is a sample speech, y is a sample text, and in this example, the sample text y is ('lead', 'navigate', 'go', 'so', 'uterus'), the sample speech and the corresponding sample text are input into the recognition model after the initial training, and the recognition probabilities arranged at the top 5 in the descending order of the recognition probabilities corresponding to the characters included in the training sample are obtained, as shown in table 1, the prediction results of the characters in the sample text from "to" 5 in the ascending order — the recognition probabilities before normalization are in sequence: to-9.111, way-9.011, lead-8.444, billow-7.345 and ticket-5.232, because homophones and characters with similar pronunciation exist in the prediction result, the recognition probabilities before normalization corresponding to the prediction result are relatively close, the information entropy corresponding to the characters 'to' obtained based on a plurality of recognition probabilities before normalization with similar probabilities is 1.2716, and the recognition probabilities before normalization of the characters 'to' 5 before the probability ranking corresponding to the character 'palace' are respectively: 11.123, 3.4, 1.2, 0.991 and 0.7, the probability difference is large, which indicates that the same or similar characters do not appear in the output result, and the model identifies the characters more accurately.

TABLE 1

Sample character	O₁	O₂	O₃	O₄	O₅
						Guide tube	10.032 (lead)	2.22	2.100	0.14	0.111
Navigation device	9.881 (boat)	2.112	2.020	2.001	0.444
						To	9.111 (to)	9.011 (dao)	8.444 (lead)	7.345 (Tao)	5.232 (Ticket)
Therefore, it is	12.222 (so)	1.01	0.99	0.88	0.542
						Palace	11.123 (palace)	3.4	1.2	0.991	0.7

According to the method provided by the embodiment, the recognition probabilities before normalization corresponding to the characters in the sample text "navigate to the home position" in the initial training sample are sequentially obtained, the information entropies of the characters are obtained based on the recognition probabilities, and the information entropies of the characters are respectively as follows as shown in table 2: 0.0079, 0.0113, 1.2716, 0.0006 and 0.0052, and the comparison of the data in the table shows that the entropy of the information of the character "to" is the largest.

TABLE 2

Sample character	Entropy(b,i)
		Guide tube	0.0079
Navigation device	0.0113
		To	1.2716
Therefore, it is	0.0006
		Palace	0.0052

And then, according to the information entropy corresponding to each character in the initial training sample, determining the identification error-prone rate characterization information corresponding to the initial training sample, so as to screen out target data based on the characterization information.

According to the scheme provided by the embodiment, the identification error-prone rate characterization information of the initial training sample is determined through the information entropy of each character in the initial training sample, the larger the information entropy is, the larger the probability that the same or similar pronunciation character exists in the training sample is, and the larger the probability that the identification error occurs is, and the probability that the identification error occurs to the initial training sample by the characterization identification model can be accurately determined based on the information entropy of the character in the initial training sample. And selecting the initial training sample meeting the screening condition from the plurality of initial training samples as a target sample based on the size of the information entropy corresponding to each character, so as to achieve the purpose of reducing the recognition error rate of the voice recognition model based on the selected target sample.

Optionally, for an initial training sample, determining, based on the information entropy corresponding to each character in the sample text, the identification error-prone rate characterizing information corresponding to the initial training sample may be implemented as follows:

and taking the maximum information entropy in the information entropies corresponding to all characters in the sample text of the initial training sample as the identification error-prone rate characterization information corresponding to the initial training sample.

For each initial training sample, all characters contained in the initial training sample are obtained, information entropies corresponding to all characters in the initial training sample are obtained according to the method provided by the embodiment, and the largest information entropy in all the information entropies is used as identification error-prone rate characterization information corresponding to the initial training sample.

The size of the information entropy corresponding to the character represents the size of the probability of the character being recognized wrongly, namely: the information entropy of the characters can represent the probability of the characters being recognized wrongly, if any character corresponds to a larger information entropy, the probability that the character has recognition errors is higher, and if the recognition result corresponding to the initial training sample has recognition errors, the probability that the character with the largest information entropy has recognition errors is the largest.

With reference to table 2, if the character with the largest information entropy is "yes" among all characters of the sample text corresponding to the initial training sample, the information entropy 1.2716 corresponding to "yes" is used as the information entropy of the initial training sample, and the information entropy is used as the information entropy of the initial training sample, so that the probability that the initial training sample is erroneously recognized can be accurately and intuitively represented.

The maximum information entropy corresponding to characters in the initial training sample is used as the identification error-prone rate characterization information of the initial training sample, the identification error-prone rate of the initial training sample can be accurately characterized, the initial training sample which is easily identified by the initial speech recognition model in an error mode is selected as the target sample based on the information entropy, according to the target samples, the initial training sample with the high identification error-prone rate is trained in a targeted mode, and the identification probability of the speech recognition model and the efficiency of obtaining the trained speech recognition model are improved.

In an optional embodiment, the method for recognizing the sample speech of the initial training sample by using the recognition model to obtain at least two recognition probabilities ranked first in descending order of the recognition probabilities corresponding to each character in the sample text of the sample may be implemented by:

a1, recognizing the sample voice of the initial training sample by using the recognition model to obtain at least two recognition probabilities before normalization processing corresponding to each character in the sample text of the sample, wherein the at least two recognition probabilities before normalization processing are at least two of the recognition probabilities before normalization processing, which are ranked in descending order and are ranked in ascending order;

a2, normalizing the recognition probabilities before the normalization processing to obtain at least two recognition probabilities corresponding to each character.

The method comprises the steps of obtaining a prediction probability of each character in a sample text in an initial training sample after being processed by a recognition model, wherein the prediction probability is the recognition probability before normalization processing, selecting a preset number of prediction probabilities which are ranked in the front according to the probability, wherein the preset number can be set according to actual conditions and application requirements, such as 3, 5, 10, 20 and the like, and normalizing the selected preset number of prediction probabilities to obtain a preset number of recognition probabilities after normalization processing.

The method comprises the steps of carrying out normalization processing on a preset number of prediction probabilities ranked in the front to obtain recognition probabilities after the normalization processing, wherein the recognition probabilities can visually represent characters which are most prone to generating recognition errors in a training sample, so that the recognition probability standardization is realized, and the recognition error-prone rate of each character can be represented more accurately according to the information entropy of the characters obtained according to the recognition probabilities.

The embodiment further provides an implementation manner to obtain the recognition probability after the normalization processing, and determine the information entropy corresponding to the characters and the information entropy corresponding to the initial training sample based on the recognition probability, which is specifically as follows:

assuming that the set number is 5, i.e. the first 5 in the descending order are selected, it is assumed that there are B pairs of training data

B is the size of the current initial training sample batch, and for a certain pair of initial training samples

x_bFor the sample speech in the initial training sample b,

for the sample text in the training sample b, the prediction probability O (y) of a certain character in the initial training sample is calculated_k|x_b,y_b,＜i) I.e. the probability value before normalization, where k denotes the kth dimension of the output of the initially trained recognition model, y_b,＜i＝{y₁,y₂,…,y_i-1Selecting a prediction probability O (y) for the prediction result of the ith character in the b-th initial training sample_k|x_b，y_b,＜i) The largest first 5: [ O ]₁,O₂,O₃,O₄,O₅]And normalized by the following formulaThen, a normalized probability value is obtained, namely:

wherein, O_mPrediction probability of m-th prediction result for i-th character, P_mThe identification probability after the prediction probability normalization processing is performed. For the ith character of the b-th initial training sample, calculating the information Entropy of the character, Encopy (b, i), by the following formula:

for the initial training sample, if the information Entropy of the ith character, Encopy (b, i), is smaller, it means that the recognition model makes a correct prediction on the character with higher confidence, because the recognition probability output by the recognition model is mainly concentrated on the correct recognition result, and the probabilities of the rest recognition results are very low, in which case the calculated information Entropy is smaller. If the output result of a certain character generates larger information entropy, the recognition model makes wrong prediction on the character with larger probability. Therefore, as long as any character of the current training data generates larger information entropy, the recognition error probability of the current training sample is considered to be larger, the current training sample can be selected as a target sample, and the recognition model after the initial training is subjected to optimization training to obtain the voice recognition model after the optimization training.

Preferably, the largest information entropy of all characters in the initial training sample is selected as the identification error susceptibility characterization information entry (b) of the initial training sample b, that is, as the error susceptibility measure of the initial training sample:

and repeating the above processes for all the initial training samples B to be 1,2, … and B, sorting the information entropies of all the initial training samples, selecting the initial training samples with the information entropies larger than a preset information entropy threshold value as target samples, or sorting the initial training samples according to the size of the information entropies, and selecting the initial training samples of the front theta part with the larger information entropies as the target samples to perform optimization training of the recognition model.

The process of screening target samples and training a speech recognition model using the target samples to obtain training data consisting of initial training samples is described with reference to the flowchart shown in FIG. 3

And for the B-th initial training sample, calculating the information Entropy Encopy (B, i) of top-5 output by the initially trained recognition model for each character, selecting the maximum information Entropy for each initial training sample to measure the error susceptibility of the initial training sample, representing the error susceptibility as Encopy (B), sequencing the information entropies of all B training data in the initial training samples according to the magnitude of the values, selecting the first theta-B initial training samples with larger values to perform word error rate minimization training, and repeating the processes on the initial training samples in the training data set until the model converges to obtain the trained voice recognition model.

Compared with the mode, the method has the advantages that the target samples are screened by selecting whether the prediction samples with the maximum recognition probability are consistent with the sample texts or not, if the prediction samples with the maximum recognition probability are consistent with the sample texts, the prediction samples are used as the target samples, and if the prediction samples with the maximum recognition probability are inconsistent with the sample texts, the prediction samples are not used as the target samples.

According to the scheme provided by the foregoing embodiment, a target sample is selected from an initial training sample, where the initial training sample includes sample speech and a sample text corresponding to the sample speech, and therefore, the target sample also includes sample speech and a sample text corresponding to the sample speech, and the following embodiments will explain that training a preliminarily trained recognition model based on the target sample may be performed in the following manners, including:

performing iterative training on the preliminarily trained recognition model based on each target sample until a total loss function corresponding to the recognition model reaches a convergence condition;

the input of the recognition model is sample voice of a target sample, the output of the recognition model is predicted text corresponding to the sample voice, and the value of the total loss function represents the difference between the predicted text of the sample voice of each target sample and the sample text.

The identification model after the initial training is subjected to iterative training by using the target sample until the total loss function corresponding to the identification model reaches the convergence condition, and the value of the total loss function represents the difference between the prediction sample of the target sample and the sample text, wherein the convergence condition of the loss function can be configured according to actual requirements, namely the condition of ending the model training can be configured according to the actual requirements. If the difference is within the preset threshold value, namely the difference is within the acceptable range, the total loss function reaches the convergence condition, and the target sample is selected based on the recognition error-prone rate, so that the selected initial training sample which is easily recognized by errors is taken as the target sample, the probability that the speech to be recognized is recognized by errors is favorably reduced, optimization can be realized in the aspect of the recognition error rate, namely the recognition error rate of the speech recognition model is reduced, and the recognition accuracy rate of the speech recognition model is improved.

When the total loss function of the model reaches the convergence condition, namely the error-prone rate of the model for recognizing the sample voice meets the preset condition, compared with the initial voice recognition model, the corresponding recognition model greatly reduces the voice recognition error rate when the convergence condition is reached.

In an alternative embodiment, the total loss function includes a first loss function and a second loss function, a value of the first loss function characterizes a character difference between the predicted text and the sample text of the sample speech of each target sample, and a value of the second loss function characterizes a text difference between the predicted text and the sample text of the sample speech of the target sample.

Optionally, the total loss function of the speech recognition model includes two parts, that is, the first loss function and the second loss function, where a value of the first loss function represents a character difference between the predicted text corresponding to the target sample and the sample text, that is, a difference between each character in the predicted text and each character in the sample text at a corresponding position is measured from a character perspective, and the second loss function measures a difference between the predicted text and the sample text from a text level, and measures a recognition error rate of the recognition model from a character and text level, which is beneficial to reducing the number of recognition errors of the characters and improving the recognition accuracy of the entire text.

On the basis, in an alternative embodiment, the iterative training of the preliminarily trained recognition model based on each target sample may be obtained as follows, and a flowchart thereof is shown in fig. 4, and includes:

step S410, aiming at each training, for each target sample, performing beam search decoding on the sample voice of the target sample by using a recognition model to obtain a set number of candidate recognition results corresponding to the sample voice and a probability corresponding to each candidate recognition result;

step S420, obtaining a prediction text corresponding to each sample voice;

step S430, for each candidate recognition result, determining the number of error characters in the candidate recognition result according to the candidate recognition result and the sample text;

step S440, calculating a value of a first loss function according to the probability corresponding to each candidate recognition result of each target sample and the number of error characters;

step S450, calculating a value of a second loss function according to the predicted text and the sample text of each sample voice;

step S460, calculating a value of the total loss function according to the value of the first loss function and the value of the second loss function.

For each training, the value of the total loss function is obtained as follows:

for each target sample, performing beam search decoding on the trained recognition model to generate N candidate recognition results, wherein the N candidate recognition results output after the beam search decoding are N optimal recognition results, N is a positive integer greater than 1, the probability corresponding to each candidate recognition result is the recognition probability of the text level of the target sample, and the probability can be obtained according to the recognition probability of each character.

And the predicted text corresponding to the sample voice is the recognition result with the maximum recognition probability output by the current recognition model, and the value of the second loss function is calculated based on the predicted text and the sample text corresponding to the sample voice.

The total loss function of the speech recognition model comprises a first loss function and a second loss function, wherein the value of the first loss function is obtained according to the probability corresponding to each candidate recognition result and the number of wrong characters which are wrongly recognized in the candidate recognition result, and the number of the wrong characters is compared with the sample text through the candidate recognition result to determine the number of different characters at the corresponding position. The value of the first loss function is calculated by utilizing the probability corresponding to each candidate recognition result of the target sample and the number of wrong characters, the character difference between the predicted text and the sample text is represented, and the value of the first loss function is calculated based on the N optimal recognition results, so that the optimal value of the first loss function is obtained.

The value of the second loss function may be calculated according to the predicted text and the sample text corresponding to the sample speech of the target sample to measure the text difference between the predicted sample and the sample text, and the value of the second loss function may be calculated in at least one of cross entropy, euclidean distance, and the like.

After obtaining the value of the first loss function and the value of the second loss function, calculating the value of the total loss function according to the value of the first loss function and the value of the second loss function, such as: and carrying out accumulation or interpolation and other processing on the value of the first loss function and the value of the second loss function to determine the value of the total loss function, wherein the value of the total loss function expresses the actual requirement, the convergence direction of the voice recognition model can be adjusted by adjusting the processing modes of the first loss function and the second loss function, model training is carried out on the first loss function obtained based on the N optimal recognition results, and the improvement of the recognition precision of the voice recognition model is facilitated.

In an alternative embodiment, calculating the value of the first loss function according to the probability corresponding to each candidate recognition result of each target sample and the number of error characters may be implemented in the following manner, and a flowchart thereof is shown in fig. 5, and includes:

step S510, for each target sample, determining the average value of the error characters corresponding to the target sample according to the number of the error characters corresponding to each candidate recognition result of the target sample;

step S520, determining the editing distance between each candidate recognition result and the sample text for each candidate recognition result of each target sample;

step S530, for each target sample, determining a training loss value corresponding to the target sample based on the probability and the edit distance corresponding to each candidate recognition result of the target sample and the average value corresponding to the target sample;

step S540, a value of the first loss function is obtained according to the training loss value of each target sample.

Each target sample corresponds to N candidate recognition results, the number of the characters which are wrongly recognized in each candidate recognition result, namely the number of the wrong characters is obtained firstly, and the average value of the wrong characters of the corresponding target sample is determined according to the number of the wrong characters corresponding to each candidate recognition result and the number of the candidate recognition results.

And calculating the editing distance between the candidate recognition result and the sample text according to the candidate recognition result of each target sample. And determining the training loss value of the target sample based on the probability corresponding to each candidate recognition result, the editing distance and the average value of the error characters corresponding to the target sample.

Specifically, a pair of target samples (x, y) is given^*) Wherein x ═ x (x)₁,x₂,…,x_T) As sample speech, y^*＝(y₁,y₂,…,y_I) Is a sample text including I characters, T and I are sample speech andthe length of the sample text. Let Y be { Y ═ Y₁,y₂,…,y_NN optimal candidate paths obtained by decoding using beam search for the target sample, one candidate recognition result for each path, target sample (x, y)^*) Corresponding first loss function L_MWER(x,y^*) Characterized by the following formula:

wherein, the right side of the equation in the formula of the first loss function can represent the training loss value corresponding to the target sample, wherein,

W(y_i,y^*) Is the ith (i)<N) the edit distance between the candidate recognition result and the sample text,

is the average of the error characters, P (y), corresponding to the target sample_nAnd | x) is the recognition probability before normalization corresponding to the candidate recognition result.

The recognition probability of any candidate recognition result. According to the scheme provided by the embodiment, the value of the first loss function is determined based on the probability of each candidate recognition result, the editing distance and the average value of the error characters corresponding to the target sample, so that the first loss function is accurately calculated.

In an alternative embodiment, the first loss function is a minimum word error rate loss function, and the second loss function is a cross entropy loss function.

Optionally, the training process to minimize word error rate is as follows: and for each target sample, performing beam search decoding by using the preliminarily trained recognition model to generate N optimal recognition results, which can be called as N-Best, and training the recognition model based on the N-Best list for the N-Best list corresponding to each target sample, namely the list consisting of the N optimal candidate recognition results until the model reaches a convergence condition. In the process, as the target samples are screened, compared with the number of the initial training samples, the target samples for training the minimum word error rate are greatly reduced, especially under the condition that the data volume of the initial training samples is ten thousand hours, a large amount of invalid training data is filtered, and the training efficiency of the minimum word error rate is greatly improved.

The second loss function is a cross entropy loss function, and the cross entropy can measure the difference degree of two different probability distributions in the same random variable and represent the difference value between the predicted text and the sample text distribution.

According to the scheme provided by the embodiment, the total loss function is determined based on the minimum word error rate loss function and the cross entropy loss function, model training is performed by using the total loss function, and the finally trained voice recognition model can further reduce the recognition error rate of the voice to be recognized.

Optionally, the speech recognition method provided by the present application further includes:

calculating the value of the total loss function based on the values of the first and second loss functions may be implemented as follows:

b1, obtaining a first weight of the first loss function and a second weight of the second loss function;

b2, calculating the value of the total loss function according to the value of the first loss function, the value of the second loss function, the first weight and the second weight.

And acquiring a first weight of the first loss function and a second weight of the second loss function, wherein the first weight and the second weight can be adjusted according to actual conditions or acquired according to machine learning.

Optionally, during training of the speech recognition model, a minimum word error rate loss function and a cross entropy loss function L will be used_CEInterpolation is carried out, and the total loss function L is:

wherein, L is_MWER(x,y^*) Characterised by a function for minimizing the loss of word error rate, L_CEAs a function of cross-entropy loss, (x, y)^*) Is a set of target samples, where x is the sample speech and y is^*For sample text, λ is the weight of the cross entropy loss function.

In the formula, the weight of the minimum word error rate loss function is 1, the weight of the cross entropy loss function is lambda, and the value of lambda can be obtained according to machine learning.

Different weights are respectively set for the first loss function and the second loss function, so that different weights are distributed to different loss functions according to the characteristics of the target sample, and the difference between the predicted text and the sample text of the target sample is accurately obtained.

In the scheme provided by this embodiment, the total loss function of the speech recognition model is a fusion of two loss functions, weights of the first loss function and the second loss function may be adjusted according to an actual situation, if a value of the first loss function corresponding to a certain predicted text is larger, it indicates that a probability of the predicted text being erroneously recognized is large, the first loss function obtains the first loss function based on the probability of each candidate recognition result and the number of erroneously recognized characters, and compared with a method in which the loss function is calculated through the probability of a single candidate recognition result, the method can more comprehensively and accurately measure a difference between the predicted text and a sample text, so as to improve the recognition accuracy of the speech recognition model and the correct number of characters based on the first loss function.

Optionally, the speech recognition model is an end-to-end speech recognition model based on a transformer structure.

The end-to-end speech recognition model based on the transform understands the current speech to be recognized through the context thereof by adopting an attention mechanism, the extraction capability of semantic features is strong, so that the recognition result can be judged according to the surrounding words and the preceding and following sentences for the words with the same or similar pronunciation in the sentence, and the accuracy of the recognition result is higher; moreover, the end-to-end speech recognition model with the transform structure solves the problems that tasks of all parts in the traditional speech recognition method are independent and cannot be optimized in a combined mode, the framework of a single neural network is simpler, and the accuracy is higher as the number of layers of the model is deeper and the training data is larger; the end-to-end speech recognition model based on the transformer structure can better utilize and adapt to the parallel computing capability of new hardware, has higher operation speed, means that the speech with the same duration is transcribed, can be completed in shorter time and can better meet the requirement of real-time transcription.

Based on the solutions provided in the above embodiments of the present application, the inventors performed the following experiments, in which training and testing were performed using 2000 hours of voice data and 1800 hours of voice data, and the results of the experiments are shown in table 3.

TABLE 3

Wherein, the NWER percentage is the percentage of the speech recognition model after the training of the minimum word error rate by using the target sample in the test process, 0% is the recognition model after the training of the minimum word error rate, i.e. the initial training, 100% (prior art) means that the training of the minimum word error rate is performed by using the prior art, the training time length of the prior art is T1 for the vehicle-mounted 2000-hour speech data, the training time length of the prior art is T2 for the vehicle-mounted 18000-hour speech data, 20% is the training of the minimum word error rate by using the trained speech recognition model provided by the application, but the training time length of the speech recognition model is 20% of the training time length of the prior art, 40% is the training of the minimum word error rate by using the trained speech recognition model provided by the application, but the training time length of the speech recognition model is 40% of the training time length of the prior art, as can be seen from the experimental data in Table 3, when the recognition error rates are equal, the training time of the speech recognition model provided by the application is 40% of that of the prior art, so that the training time of the speech recognition model is greatly reduced; compared with the recognition model after the initial training, the recognition error rate of the voice recognition model after the target data training is reduced; when the speech recognition model is trained, the larger the training data amount is, the lower the recognition error rate of the speech recognition model is.

Based on the same principle as the method provided by the embodiment of the present application, the embodiment of the present application further provides a speech recognition apparatus 600, as shown in fig. 6, the apparatus may include: a to-be-recognized speech obtaining module 610 and a recognition result obtaining module 620, where a speech recognition model in the recognition result obtaining module is obtained by training through a training apparatus 700, and the training apparatus 700 includes: a preliminary training module 710, a target sample screening module 720, and a model retraining module 730, wherein:

a to-be-recognized voice acquiring module 610, configured to acquire a to-be-recognized voice;

a recognition result obtaining module 620, configured to recognize the speech to be recognized through the speech recognition model, and obtain a speech recognition result;

wherein, the speech recognition model is obtained by training through the training apparatus 700, and the training apparatus 700 includes:

a preliminary training module 710, configured to train the initial speech recognition model based on each initial training sample, to obtain a recognition model after preliminary training;

the target sample screening module 720 is configured to obtain identification error-prone rate characterization information corresponding to each initial training sample when the initial training sample is identified by the initially trained identification model, and select a target sample from each initial training sample according to the characterization information;

and the model retraining module 730 is configured to train the primarily trained recognition model based on each target sample to obtain a speech recognition model.

The application provides a speech recognition device utilizes discernment error-prone rate characterization information to filter the acquisition target sample to initial training sample, can select the higher initial training sample of discernment error-prone rate as the target sample to carry out the training of speech recognition model based on the target sample, compare with the mode that adopts whole initial training samples to train among the prior art, the training data volume of the speech recognition model that has significantly reduced has promoted the discernment rate of accuracy of speech recognition model.

Optionally, the initial training sample includes sample speech and a sample text corresponding to the sample speech, and the target sample screening module 720 further includes:

the recognition sample voice unit is used for recognizing the sample voice of the sample by using a recognition model aiming at the initial training sample to obtain at least two recognition probabilities which are ranked in the descending order of the recognition probabilities corresponding to each character in the sample text of the sample;

the character information entropy determining unit is used for determining the information entropy corresponding to each character according to the at least two recognition probabilities corresponding to each character in the sample text;

and the characterization information determining unit is used for determining the identification error-prone rate characterization information corresponding to the initial training sample based on the information entropy corresponding to each character in the sample text.

Optionally, for an initial training sample, determining a characterization information unit, specifically configured to:

Optionally, the sample speech unit is identified, specifically for:

identifying sample voice of an initial training sample by using an identification model to obtain at least two identification probabilities before normalization processing corresponding to each character in a sample text of the sample, wherein the at least two identification probabilities before normalization processing are at least two of the identification probabilities before normalization processing, which are ranked in descending order and are ranked in the top order;

and normalizing the recognition probabilities before the normalization processing to obtain at least two recognition probabilities corresponding to each character.

Optionally, the initial training sample includes sample speech and a sample text corresponding to the sample speech, and the model retraining module 730 is specifically configured to:

the input of the recognition model is sample voice of the target sample, the output of the recognition model is predicted text corresponding to the sample voice, and the value of the total loss function represents the difference between the predicted text of the sample voice of each target sample and the sample text.

Optionally, the total loss function in the model retraining module 730 includes a first loss function whose value characterizes a character difference between the predicted text of the sample speech and the sample text of each target sample, and a second loss function whose value characterizes a text difference between the predicted text of the sample speech and the sample text of the target sample.

Optionally, the training device 700 further comprises:

the beam search decoding module is used for performing beam search decoding on the sample voice of the target sample by using the recognition model for each target sample aiming at each training to obtain a set number of candidate recognition results corresponding to the sample voice and the probability corresponding to each candidate recognition result;

the predicted text obtaining module is used for obtaining a predicted text corresponding to the sample voice;

the error character quantity determining module is used for determining the quantity of error characters in the candidate recognition result according to the candidate recognition result and the sample text for each candidate recognition result;

the first loss function module is used for calculating the value of a first loss function according to the probability corresponding to each candidate identification result of each target sample and the number of error characters;

the second loss function module is used for calculating the value of a second loss function according to the predicted text and the sample text of each sample voice;

and the total loss function module is used for calculating the value of the total loss function according to the value of the first loss function and the value of the second loss function.

Optionally, the first loss function module is specifically configured to:

for each target sample, determining the average value of the error characters corresponding to the target sample according to the number of the error characters corresponding to each candidate recognition result of the target sample;

for each candidate recognition result of each target sample, determining an editing distance between the candidate recognition result and the sample text;

for each target sample, determining a training loss value corresponding to the target sample based on the probability and the edit distance corresponding to each candidate recognition result of the target sample and the average value corresponding to the target sample;

and obtaining the value of the first loss function according to the training loss value of each target sample.

Optionally, the first loss function in the training apparatus 700 is a minimum word error rate loss function, and the second loss function is a cross entropy loss function.

Optionally, the training apparatus 700 further comprises:

the obtaining weight module is used for obtaining a first weight of the first loss function and a second weight of the second loss function;

the value of the total loss function is calculated by:

the value of the total loss function is calculated from the first weight, the second weight, the value of the first loss function and the value of the second loss function.

Optionally, the speech recognition model in the speech recognition apparatus is an end-to-end speech recognition model based on a transformer structure.

The speech recognition device of the embodiment of the present application can execute the speech recognition method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by each module and unit in the speech recognition device in each embodiment of the present application correspond to the steps in the speech recognition method in each embodiment of the present application, and for the detailed functional description of each module of the speech recognition device, reference may be specifically made to the description in the corresponding speech recognition method shown in the foregoing, and details are not repeated here.

Based on the same principle as the method shown in the embodiments of the present application, there is also provided in the embodiments of the present application an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing a computer program; a processor for executing the speech recognition method according to any of the alternative embodiments of the present application by calling a computer program. Compared with the prior art, the method and the device have the advantages that the initial training samples are screened by using the identification error-prone rate characterization information, the initial training samples with high identification error-prone rate can be screened out to serve as the target samples, the training of the voice recognition model is carried out based on the target samples, and compared with the mode that all the initial training samples are adopted for training in the prior art, the training data volume of the voice recognition model is greatly reduced, and the identification accuracy of the voice recognition model is improved; and moreover, the target sample is selected from the initial training samples according to the recognition error rate, so that the targeted training for the recognition error rate is realized, and the efficiency of obtaining the trained voice recognition model is improved.

In an alternative embodiment, an electronic device is provided, as shown in fig. 7, the electronic device 4000 shown in fig. 7 may be a server, including: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the speech recognition method provided in the various alternative implementations described above.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The name of the module does not constitute a limitation to the module itself in some cases, and for example, the recognition result obtaining module may also be described as a "obtain speech recognition result module".

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the disclosure. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A speech recognition method, comprising:

acquiring a voice to be recognized;

recognizing the voice to be recognized through a voice recognition model to obtain a voice recognition result; wherein the speech recognition model is trained by:

obtaining identification error-prone rate representation information corresponding to each initial training sample when the initial training sample is identified through the identification model after the initial training;

and training the preliminarily trained recognition model based on each target sample to obtain the voice recognition model.

2. The method of claim 1, wherein the initial training sample comprises a sample speech and a sample text corresponding to the sample speech;

the obtaining of the identification error-prone rate characterization information corresponding to each initial training sample when the initial training sample is identified by the initially trained identification model includes:

aiming at one initial training sample, identifying the sample voice of the sample by using the identification model to obtain at least two identification probabilities which are ranked at the top in descending order in the identification probabilities corresponding to each character in the sample text of the sample;

determining the information entropy corresponding to each character according to the at least two recognition probabilities corresponding to each character in the sample text;

and determining the identification error-prone rate characterization information corresponding to the initial training sample based on the information entropy corresponding to each character in the sample text.

3. The method according to claim 2, wherein for one of the initial training samples, the determining, based on the entropy of the information corresponding to each character in the sample text, the identification error-prone characterization information corresponding to the initial training sample comprises:

and taking the maximum information entropy in the information entropies corresponding to the characters in the sample text of the initial training sample as the identification error-prone rate characterization information corresponding to the initial training sample.

4. The method according to claim 2, wherein the recognizing the sample speech of the sample by using the recognition model to obtain at least two recognition probabilities that are ranked first in descending order of recognition probability corresponding to each character in the sample text of the sample comprises:

identifying the sample voice of the initial training sample by using the identification model to obtain at least two identification probabilities before normalization processing corresponding to each character in the sample text of the sample, wherein the at least two identification probabilities before normalization processing are at least two of the identification probabilities before normalization processing, which are ranked in descending order and are ranked at the top;

5. The method of claim 1, wherein the initial training sample comprises a sample speech and a sample text corresponding to the sample speech;

training the preliminarily trained recognition model based on each target sample, including:

6. The method of claim 5, wherein the overall loss function comprises a first loss function and a second loss function, wherein a value of the first loss function characterizes a character difference between the predicted text and the sample text of the sample speech of each of the target samples, and wherein a value of the second loss function characterizes a text difference between the predicted text and the sample text of the sample speech of the target samples.

7. The method of claim 6, wherein iteratively training the preliminarily trained recognition model based on each of the target samples comprises:

for each training, for each target sample, performing beam search decoding on the sample voice of the target sample by using the recognition model to obtain a set number of candidate recognition results corresponding to the sample voice and a probability corresponding to each candidate recognition result;

obtaining a predicted text corresponding to the sample voice;

for each candidate recognition result, determining the number of wrong characters in the candidate recognition result according to the candidate recognition result and the sample text;

calculating the value of a first loss function according to the probability corresponding to each candidate identification result of each target sample and the number of error characters;

calculating a value of a second loss function according to the predicted text and the sample text of each sample voice;

the value of the total loss function is calculated from the values of the first and second loss functions.

8. The method of claim 7, wherein calculating the value of the first loss function according to the probability corresponding to each candidate recognition result of each target sample and the number of wrong characters comprises:

9. The method according to any one of claims 6 to 8, wherein the first loss function is a minimize word error rate loss function and the second loss function is a cross entropy loss function.

10. The method of any of claims 6 to 8, further comprising:

acquiring a first weight of a first loss function and a second weight of a second loss function;

the value of the total loss function is calculated by:

11. The method of claim 1, wherein the speech recognition model is an end-to-end speech recognition model based on a transformer structure.

12. A speech recognition apparatus, comprising:

the recognition result obtaining module is used for recognizing the voice to be recognized through the voice recognition model to obtain a voice recognition result;

wherein the speech recognition model is obtained by training through a training apparatus, the training apparatus comprising:

and the model retraining module is used for training the preliminarily trained recognition model based on each target sample to obtain the voice recognition model.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any of claims 1-11 when executing the program.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the speech recognition method of any one of claims 1 to 11.