CN113327581B - Recognition model optimization method and system for improving speech recognition accuracy - Google Patents
Recognition model optimization method and system for improving speech recognition accuracy Download PDFInfo
- Publication number
- CN113327581B CN113327581B CN202110487124.7A CN202110487124A CN113327581B CN 113327581 B CN113327581 B CN 113327581B CN 202110487124 A CN202110487124 A CN 202110487124A CN 113327581 B CN113327581 B CN 113327581B
- Authority
- CN
- China
- Prior art keywords
- model
- voice recognition
- output
- sequence
- ctc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000005457 optimization Methods 0.000 title claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims description 59
- 239000011159 matrix material Substances 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000013145 classification model Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/005—Language recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a recognition model optimization method for improving the accuracy rate of voice recognition, which comprises the following steps: inputting voice training data into a CTC model of a DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model; inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence; and optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model. The invention also discloses a recognition model optimization system for improving the accuracy of voice recognition. The invention relates to the technical field of voice recognition. The invention combines the language model to effectively improve the accuracy of voice recognition.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a recognition model optimization method and system for improving voice recognition accuracy.
Background
Speech recognition is generally divided into two stages: 1) a speech recognition stage: this stage uses an acoustic model of speech to convert natural sound signals into a machine-processable syllabic form of numerical expressions. 2) A speech understanding stage: this stage converts the result of the previous stage, namely syllables, into Chinese characters, which must be understood using knowledge of the language model. The most important part in speech recognition is to establish a language model to improve the accuracy of speech recognition.
The common language models used today can be generally divided into two categories: one is a statistical language model based on large-scale corpora; the method is suitable for processing large-scale real corpora, has good consistency of data preparation and strong robustness, but can only reflect the adjacent constraint relation of the language because the realization is limited by the space and time of the system, and can not process the long-distance recursion phenomenon of the language. One is a rule-based language model; the method is based on classifying Chinese vocabulary system according to grammar and semantics, and tries to achieve the large-scale basic unique identification of homophones by determining the lexical, syntax and semantic relationship of natural language; the method is characterized by being suitable for processing closed corpora and reflecting the long-distance constraint relation and the recursion phenomenon of the language, but the method has poor robustness, is not suitable for processing open corpora and has poor consistency of knowledge expression.
In the existing method for recognizing characters by voice, characters in original characters are recognized into homonymous and different characters, and the recognition accuracy is not high.
Disclosure of Invention
In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a recognition model optimization method and system for improving the accuracy of speech recognition, which, in combination with a language model, effectively improves the accuracy of speech recognition.
The embodiment of the invention is realized by the following steps:
in a first aspect, an embodiment of the present invention provides a recognition model optimization method for improving speech recognition accuracy, including the following steps:
inputting voice training data into a CTC model of a DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence;
and optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and the words have the same tone and different characters or the same character and different tones after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, a reasonable voice recognition model is established, and the accuracy of voice recognition is improved. Inputting voice training data into a CTC model of a DeepSpeech voice recognition system (an end-to-end automatic voice recognition system), training and optimizing the CTC model (a time sequence class classification model based on a neural network), and calculating the output loss of the CTC model, wherein the loss is as follows: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum. Then, the speech recognition sequence z is recognized and judged by combining a language model, the sequence z is input into the language model, the probability that the current word is a wrongly written word is judged by the language model, the output probability value w of each single word in the sequence z is calculated, then the output probability value w of each single word is added into the CTC model again, and a new loss calculation formula is obtained, wherein the new loss calculation formula is as follows: and L (S) -ln Π p (z | x) ═ w ═ Σ ln p (z | x) × (x, z) ∈ S, parameters in the corresponding CTC voice model are optimized to obtain an optimized voice recognition model, and voice data are recognized through the optimized voice recognition model, so that the accuracy of voice recognition is improved.
The method combines the CTC model with the existing language recognition model, optimizes the parameters of the voice recognition model and reduces the loss value, so that when a new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Based on the first aspect, in some embodiments of the present invention, the method for inputting the speech recognition sequence into the preset language model to obtain the output probability value of each individual character in the speech recognition sequence includes the following steps:
inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
inputting the output data of the Embedding layer into an output layer with a softmax function in the language model, and calculating the output probability value of each single character according to the softmax function; the softmax function is:wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Based on the first aspect, in some embodiments of the present invention, the method for performing optimization adjustment on the CTC model according to the output probability value of each word to obtain an optimized speech recognition model includes the following steps:
adding the output probability value of each single character into a CTC model, and adjusting a loss function of the CTC model to obtain a new loss function;
and optimizing and adjusting the CTC model according to the new loss function to obtain an optimized voice recognition model.
Based on the first aspect, in some embodiments of the present invention, the new loss function is: l (S) ═ ln pi p (z | x) ═ w ═ Σ ln p (z | x) · (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value for each individual word.
In a second aspect, an embodiment of the present invention provides a recognition model optimization system for improving speech recognition accuracy, including an initial output module, a single-word probability module, and a model optimization module, where:
the initial output module is used for inputting voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
the single character probability module is used for inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence;
and the model optimization module is used for optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and the words have the same tone and different characters or the same character and different tones after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, a reasonable voice recognition model is established, and the accuracy of voice recognition is improved. Inputting voice training data into a CTC model of a DeepSpeech voice recognition system (an end-to-end automatic voice recognition system) through an initial output module, training and optimizing the CTC model (a time sequence class classification model based on a neural network), and calculating the output loss of the CTC model, wherein the loss is as follows: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum. Then, the single character probability module is combined with a language model to recognize and judge a voice recognition sequence z, the sequence z is input into the language model, the current probability that the character is a wrongly written character is judged through the language model, the output probability value w of each single character in the sequence z is calculated, then the output probability value w of each single character is added into the CTC model again through the model optimization module, and a new loss calculation formula is obtained: and L (S) -ln Π p (z | x) ═ w ═ Σ ln p (z | x) × (x, z) ∈ S, parameters in the corresponding CTC voice model are optimized to obtain an optimized voice recognition model, and voice data are recognized through the optimized voice recognition model, so that the accuracy of voice recognition is improved.
The system combines the CTC model with the existing language recognition model, optimizes the parameters of the voice recognition model and reduces the loss value, so that when new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Based on the second aspect, in some embodiments of the invention, the single-word probability module includes a sequence input sub-module, a matrix output sub-module, and a probability calculation sub-module, where:
the sequence input sub-module is used for inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
the matrix output submodule is used for establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
the probability calculation submodule is used for inputting the output data of the Embedding layer into an output layer with a softmax function in the language model and calculating the output probability value of each single character according to the softmax function; the softmax function is:wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Based on the second aspect, in some embodiments of the invention, the model optimization module includes a function adjustment submodule and an identification model submodule, wherein:
the function adjusting submodule is used for adding the output probability value of each single character into the CTC model and adjusting the loss function of the CTC model to obtain a new loss function;
and the recognition model submodule is used for optimizing and adjusting the CTC model according to the new loss function so as to obtain an optimized voice recognition model.
Based on the second aspect, in some embodiments of the invention, the new loss function is: l (S) — ln Π p (z | x) ═ w —, Σ ln p (z | x) — (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value of each individual word.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory for storing one or more programs; a processor. The program or programs, when executed by a processor, implement the method of any of the first aspects as described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method according to any one of the first aspect described above.
The embodiment of the invention at least has the following advantages or beneficial effects:
the embodiment of the invention provides a recognition model optimization method and system for improving voice recognition accuracy, aiming at solving the technical problem that recognition accuracy is not high due to the fact that homonyms and different characters or homonyms and different voices occur in characters after voice recognition in the prior art, and aiming at improving the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model to optimize the CTC model, establish a reasonable voice recognition model and improve the accuracy of voice recognition. And adding the output probability value of each single character into the CTC model again to obtain a new loss calculation formula, optimizing parameters in the corresponding CTC voice model, and reducing the loss value, so that when a new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flowchart of a recognition model optimization method for improving speech recognition accuracy according to an embodiment of the present invention;
FIG. 2 is a schematic block diagram of a recognition model optimization system for improving speech recognition accuracy according to an embodiment of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.
Icon: 100. an initial output module; 200. a single word probability module; 210. a sequence input submodule; 220. a matrix output submodule; 230. a probability calculation submodule; 300. a model optimization module; 310. a function adjusting submodule; 320. identifying a model submodule; 101. a memory; 102. a processor; 103. a communication interface.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Examples
As shown in fig. 1, in a first aspect, an embodiment of the present invention provides a recognition model optimization method for improving speech recognition accuracy, including the following steps:
s1, inputting voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
in some embodiments of the present invention, the voice training data is input into a CTC model of a deep speech recognition system (end-to-end automatic speech recognition system), the CTC model (a neural network-based time series class classification model) is trained and optimized, and the CTC model output loss is calculated, where loss is: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum.
S2, inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence;
in some embodiments of the present invention, after obtaining the speech recognition sequence z, the speech recognition sequence z is recognized and determined by combining with the language model, the sequence z is input into the language model, the probability that the current word is a wrongly-written word is determined by the language model, and the output probability value w of each single word in the sequence z is calculated, so that the speech recognition model is subsequently optimized according to the probability value.
And S3, optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model.
In some embodiments of the present invention, after obtaining the output probability value w of each word, the output probability value of each word is added to the CTC model again, and a new loss calculation formula is obtained as follows: l (S) ═ ln pi (z | x) ═ w ═ Σ ln p (z | x) · (x, z) ∈ S, the loss finally calculated is the difference between the result predicted by the value and the value before reality, the output loss value is reduced, the accuracy of recognition can be improved, the parameters in the corresponding CTC speech model are optimized to obtain the optimized speech recognition model, and the speech data is recognized by the optimized speech recognition model, so that the accuracy of speech recognition is improved.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and different in tone or different in tone in the words after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, parameters are optimized, and loss values are reduced, so that when new voice is input into the voice recognition model, accurate results can be predicted, and the accuracy of voice recognition is improved.
Based on the first aspect, in some embodiments of the present invention, the method for inputting the speech recognition sequence into the preset language model to obtain the output probability value of each individual character in the speech recognition sequence includes the following steps:
inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
inputting the output data of the Embedding layer into an output layer with a softmax function in the language model, and calculating the output probability value of each single character according to the softmax function; the softmax function is:wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Calculating the probability value of each single character firstly requires inputting the index sequence of each single character in the speech recognition sequence into a preset language model, the speech recognition sequence is text data, a corresponding word vector matrix is established according to the number of the single characters included in the text data and the index sequence (index sequence) corresponding to each electron, the word vector matrix is the output matrix of an Embedding layer, then tanh is used as an activation function, and finally the output data of the Embedding layer is sent to the output layer with softmax to output the probability. The BatchNormalization in DeepSpeech is replaced by Layer Normalization, the BatchNormalization is on the batch and is used for normalizing NHW, and the Layer Normalization is in the channel direction and is used for normalizing CHW and mainly has obvious effect on RNN.
The input is an index sequence of a sequence of words, e.g. the word ' this ' has an index in the dictionary (size | V | of 10, the word ' this ' has an index of 23, the ' test ' has an index of 65, then the sentence "this is a test" which is a test ' prediction ' test ', and the index sequence of the above word within the window size is 10,23, 65. The Embedding layer (Embedding) is a matrix with the size of | V | xK (note that the size of K is set by itself, the matrix is equivalent to a word vector initialized randomly and is updated in bp, the part is a word vector after the training of the neural network is completed, and the matrix formed by splicing the 10 th, 23 th and 65 th row vectors is taken out to be the output of the Embedding layer. The hidden layer accepts output of the spliced Embedding layer as input, tanh is used as an activation function, and finally the input is sent to an output layer with softmax, the probability is output, and the optimization target is to enable the softmax value corresponding to the word to be predicted to be maximum.
Based on the first aspect, in some embodiments of the present invention, the method for optimizing and adjusting the CTC model according to the output probability values of the individual words to obtain an optimized speech recognition model includes the following steps:
adding the output probability value of each single character into a CTC model, and adjusting a loss function of the CTC model to obtain a new loss function; the new loss function is: l (S) — ln Π p (z | x) ═ w — lnp (z | x) × (x, z) ∈ S, where x is the speech training data given the input, z is the speech recognition sequence output, S is the training set, and w is the output probability value for each individual word. The loss function can be interpreted as: given a sample, the product of the probabilities of outputting the correct label, p (z | x) p (z | x) representing the probability of outputting the speech recognition sequence z given the input speech training data x, S being the training set.
And optimizing and adjusting the CTC model according to the new loss function to obtain an optimized voice recognition model.
The optimization of the voice recognition model mainly aims to reduce the difference between the result predicted by model recognition and the result before reality, namely the loss value, the smaller the loss value, the higher the accuracy and the smaller the loss, and the accurate result can be predicted by increasing the output probability value of each single word to the loss function when new voice is input into the model, thereby finally improving the accuracy.
As shown in fig. 2, in a second aspect, an embodiment of the present invention provides a recognition model optimization system for improving speech recognition accuracy, including an initial output module 100, a unigram probability module 200, and a model optimization module 300, wherein:
an initial output module 100, configured to input voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
a single word probability module 200, configured to input the voice recognition sequence into a preset language model, so as to obtain an output probability value of each single word in the voice recognition sequence;
and the model optimization module 300 is configured to optimize and adjust the CTC model according to the output probability values of the individual characters to obtain an optimized speech recognition model.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and the words have the same tone and different characters or the same character and different tones after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, a reasonable voice recognition model is established, and the accuracy of voice recognition is improved. Inputting the voice training data into a CTC model of a deep speech recognition system (end-to-end automatic speech recognition system) through an initial output module 100, training and optimizing the CTC model (a time sequence class classification model based on a neural network), and calculating the output loss of the CTC model, wherein the loss is as follows: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum. Then the word probability module 200 combines the language model to identify and judge the speech recognition sequence z, inputs the sequence z into the language model, judges the probability that the current word is a wrongly written word through the language model, calculates the output probability value w of each word in the sequence z, and then adds the output probability value w of each word into the CTC model again through the model optimization module 300, so as to obtain a new loss calculation formula as follows: and L (S) -ln Π p (z | x) ═ w ═ Σ ln p (z | x) × (x, z) ∈ S, parameters in the corresponding CTC voice model are optimized to obtain an optimized voice recognition model, and voice data are recognized through the optimized voice recognition model, so that the accuracy of voice recognition is improved.
The system combines the CTC model with the existing language recognition model, optimizes the parameters of the voice recognition model and reduces the loss value, so that when new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Based on the second aspect, in some embodiments of the present invention, as shown in fig. 2, the single-word probability module 200 includes a sequence input sub-module 210, a matrix output sub-module 220, and a probability calculation sub-module 230, wherein:
the sequence input sub-module 210 is configured to input the index sequence of each individual character in the speech recognition sequence into a preset language model;
the matrix output sub-module 220 is configured to establish a word vector matrix according to the number of the individual characters in the speech recognition sequence and the index sequence of each individual character through a language model, and use the word vector matrix as output data of an Embedding layer;
the probability calculation submodule 230 is configured to input output data of the Embedding layer into an output layer with a softmax function in the language model, and calculate an output probability value of each individual character according to the softmax function; the softmax function is:wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Calculating the probability value of each single character firstly requires inputting the index sequence of each single character in the speech recognition sequence into a preset language model, the speech recognition sequence is text data, a corresponding word vector matrix is established according to the number of the single characters included in the text data and the index sequence (index sequence) corresponding to each electron, the word vector matrix is the output matrix of an Embedding layer, then tanh is used as an activation function, and finally the output data of the Embedding layer is sent to the output layer with softmax to output the probability.
Based on the second aspect, as shown in fig. 2, in some embodiments of the present invention, the model optimization module 300 includes a function adjustment submodule 310 and an identification model submodule 320, wherein:
the function adjusting submodule 310 is configured to add the output probability value of each individual character to the CTC model, and adjust the loss function of the CTC model to obtain a new loss function; the new loss function is: l (S) — ln Π p (z | x) ═ w ═ Σ lnp (z | x) × (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value for each individual word.
And the recognition model sub-module 320 is configured to perform optimization and adjustment on the CTC model according to the new loss function to obtain an optimized speech recognition model.
The optimization of the voice recognition model mainly aims to reduce the difference between the result predicted by model recognition and the result before reality, namely the loss value, the smaller the loss value, the higher the accuracy and the smaller the loss, and the accurate result can be predicted by increasing the output probability value of each single word to the loss function when new voice is input into the model, thereby finally improving the accuracy.
In a third aspect, as shown in fig. 3, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The one or more programs, when executed by the processor 102, implement the method of any of the first aspects as described above.
Also included is a communication interface 103, and the memory 101, processor 102 and communication interface 103 are electrically connected to each other, directly or indirectly, to enable transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules, and the processor 102 executes the software programs and modules stored in the memory 101 to thereby execute various functional applications and data processing. The communication interface 103 may be used for communicating signaling or data with other node devices.
The Memory 101 may be, but is not limited to, a Random Access Memory 101 (RAM), a Read Only Memory 101 (ROM), a Programmable Read Only Memory 101 (PROM), an Erasable Read Only Memory 101 (EPROM), an electrically Erasable Read Only Memory 101 (EEPROM), and the like.
The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may be a general-purpose Processor 102, including a Central Processing Unit (CPU) 102, a Network Processor 102 (NP), and the like; but may also be a Digital Signal processor 102 (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components.
In the embodiments provided in the present application, it should be understood that the disclosed method and system and method can be implemented in other ways. The method and system embodiments described above are merely illustrative, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by the processor 102, implements the method according to any one of the first aspect described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 101 (ROM), a Random Access Memory 101 (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Claims (4)
1. A recognition model optimization method for improving speech recognition accuracy is characterized by comprising the following steps:
inputting voice training data into a CTC model of a DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence; the method comprises the following steps: inputting the index sequence of each single character in the voice recognition sequence into a preset language model; establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer in the language model; inputting the output data of the Embedding layer into an output layer with a softmax function in the language model, and calculating the output probability value of each single character according to the softmax function; the softmax function is:wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence;
optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model; the method comprises the following steps: adding the output probability value of each single character into a CTC model, and adjusting a loss function of the CTC model to obtain a new loss function; optimizing and adjusting the CTC model according to the new loss function to obtain an optimized voice recognition model; the new loss function is: l (S) — ln Π p (z | x) ═ w ═ Σ lnp (z | x) × (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value for each individual word.
2. The utility model provides a promote speech recognition accuracy's recognition model optimization system which characterized in that includes initial output module, individual character probability module and model optimization module, wherein:
the initial output module is used for inputting voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
the single character probability module is used for inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence; the single-word probability module comprises a sequence input submodule, a matrix output submodule and a probability calculation submodule, wherein:
the sequence input sub-module is used for inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
the matrix output submodule is used for establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
the probability calculation submodule is used for inputting the output data of the Embedding layer into an output layer with a softmax function in the language model and calculating the output probability value of each single character according to the softmax function; the softmax function is:wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence;
the model optimization module is used for optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model; the model optimization module comprises a function adjustment submodule and an identification model submodule, wherein: the function adjusting submodule is used for adding the output probability value of each single character into the CTC model and adjusting the loss function of the CTC model to obtain a new loss function; the recognition model submodule is used for optimizing and adjusting the CTC model according to the new loss function so as to obtain an optimized voice recognition model; the new loss function is: l (S) — ln Π p (z | x) ═ w —, Σ ln p (z | x) — (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value of each individual word.
3. An electronic device, comprising:
a memory for storing one or more programs;
a processor;
the one or more programs, when executed by the processor, implement the method of claim 1.
4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110487124.7A CN113327581B (en) | 2021-05-04 | 2021-05-04 | Recognition model optimization method and system for improving speech recognition accuracy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110487124.7A CN113327581B (en) | 2021-05-04 | 2021-05-04 | Recognition model optimization method and system for improving speech recognition accuracy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113327581A CN113327581A (en) | 2021-08-31 |
CN113327581B true CN113327581B (en) | 2022-05-24 |
Family
ID=77414234
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110487124.7A Active CN113327581B (en) | 2021-05-04 | 2021-05-04 | Recognition model optimization method and system for improving speech recognition accuracy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113327581B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117727296B (en) * | 2023-12-18 | 2024-08-09 | 杭州恒芯微电子技术有限公司 | Speech recognition control system based on single fire panel |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109272988A (en) * | 2018-09-30 | 2019-01-25 | 江南大学 | Audio recognition method based on multichannel convolutional neural networks |
CN110287480A (en) * | 2019-05-27 | 2019-09-27 | 广州多益网络股份有限公司 | A kind of name entity recognition method, device, storage medium and terminal device |
CN110390093A (en) * | 2018-04-20 | 2019-10-29 | 普天信息技术有限公司 | A kind of language model method for building up and device |
CN111145729A (en) * | 2019-12-23 | 2020-05-12 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
CN111951785A (en) * | 2019-05-16 | 2020-11-17 | 武汉Tcl集团工业研究院有限公司 | Voice recognition method and device and terminal equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11593655B2 (en) * | 2018-11-30 | 2023-02-28 | Baidu Usa Llc | Predicting deep learning scaling |
-
2021
- 2021-05-04 CN CN202110487124.7A patent/CN113327581B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110390093A (en) * | 2018-04-20 | 2019-10-29 | 普天信息技术有限公司 | A kind of language model method for building up and device |
CN109272988A (en) * | 2018-09-30 | 2019-01-25 | 江南大学 | Audio recognition method based on multichannel convolutional neural networks |
CN111951785A (en) * | 2019-05-16 | 2020-11-17 | 武汉Tcl集团工业研究院有限公司 | Voice recognition method and device and terminal equipment |
CN110287480A (en) * | 2019-05-27 | 2019-09-27 | 广州多益网络股份有限公司 | A kind of name entity recognition method, device, storage medium and terminal device |
CN111145729A (en) * | 2019-12-23 | 2020-05-12 | 厦门快商通科技股份有限公司 | Speech recognition model training method, system, mobile terminal and storage medium |
Non-Patent Citations (1)
Title |
---|
基于语音识别的英语声学检测系统研究;李侠等;《自动化技术与应用》;20191225(第12期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113327581A (en) | 2021-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109887484B (en) | Dual learning-based voice recognition and voice synthesis method and device | |
US20230186912A1 (en) | Speech recognition method, apparatus and device, and storage medium | |
WO2022142041A1 (en) | Training method and apparatus for intent recognition model, computer device, and storage medium | |
US7421387B2 (en) | Dynamic N-best algorithm to reduce recognition errors | |
CN108536670B (en) | Output sentence generation device, method, and program | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
CN111599340A (en) | Polyphone pronunciation prediction method and device and computer readable storage medium | |
CN110569505A (en) | text input method and device | |
CA3123387C (en) | Method and system for generating an intent classifier | |
CN112528653B (en) | Short text entity recognition method and system | |
CN116956835B (en) | Document generation method based on pre-training language model | |
CN115759119B (en) | Financial text emotion analysis method, system, medium and equipment | |
CN115064154A (en) | Method and device for generating mixed language voice recognition model | |
CN113327581B (en) | Recognition model optimization method and system for improving speech recognition accuracy | |
CN112487813B (en) | Named entity recognition method and system, electronic equipment and storage medium | |
Krantz et al. | Language-agnostic syllabification with neural sequence labeling | |
CN114299930A (en) | End-to-end speech recognition model processing method, speech recognition method and related device | |
CN112632956A (en) | Text matching method, device, terminal and storage medium | |
CN116562275B (en) | Automatic text summarization method combined with entity attribute diagram | |
KR100542757B1 (en) | Automatic expansion Method and Device for Foreign language transliteration | |
CN114444492B (en) | Non-standard word class discriminating method and computer readable storage medium | |
CN115270818A (en) | Intention identification method and device, storage medium and computer equipment | |
CN111626059B (en) | Information processing method and device | |
CN115240712A (en) | Multi-mode-based emotion classification method, device, equipment and storage medium | |
CN114974310A (en) | Emotion recognition method and device based on artificial intelligence, computer equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |