CN113327581B - Recognition model optimization method and system for improving speech recognition accuracy - Google Patents

Recognition model optimization method and system for improving speech recognition accuracy Download PDF

Info

Publication number
CN113327581B
CN113327581B CN202110487124.7A CN202110487124A CN113327581B CN 113327581 B CN113327581 B CN 113327581B CN 202110487124 A CN202110487124 A CN 202110487124A CN 113327581 B CN113327581 B CN 113327581B
Authority
CN
China
Prior art keywords
model
voice recognition
output
sequence
ctc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110487124.7A
Other languages
Chinese (zh)
Other versions
CN113327581A (en
Inventor
李传咏
赵莉
卢颖
陈宁
刘睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xi'an Webber Software Co ltd
Original Assignee
Xi'an Webber Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xi'an Webber Software Co ltd filed Critical Xi'an Webber Software Co ltd
Priority to CN202110487124.7A priority Critical patent/CN113327581B/en
Publication of CN113327581A publication Critical patent/CN113327581A/en
Application granted granted Critical
Publication of CN113327581B publication Critical patent/CN113327581B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention discloses a recognition model optimization method for improving the accuracy rate of voice recognition, which comprises the following steps: inputting voice training data into a CTC model of a DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model; inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence; and optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model. The invention also discloses a recognition model optimization system for improving the accuracy of voice recognition. The invention relates to the technical field of voice recognition. The invention combines the language model to effectively improve the accuracy of voice recognition.

Description

Recognition model optimization method and system for improving speech recognition accuracy
Technical Field
The invention relates to the technical field of voice recognition, in particular to a recognition model optimization method and system for improving voice recognition accuracy.
Background
Speech recognition is generally divided into two stages: 1) a speech recognition stage: this stage uses an acoustic model of speech to convert natural sound signals into a machine-processable syllabic form of numerical expressions. 2) A speech understanding stage: this stage converts the result of the previous stage, namely syllables, into Chinese characters, which must be understood using knowledge of the language model. The most important part in speech recognition is to establish a language model to improve the accuracy of speech recognition.
The common language models used today can be generally divided into two categories: one is a statistical language model based on large-scale corpora; the method is suitable for processing large-scale real corpora, has good consistency of data preparation and strong robustness, but can only reflect the adjacent constraint relation of the language because the realization is limited by the space and time of the system, and can not process the long-distance recursion phenomenon of the language. One is a rule-based language model; the method is based on classifying Chinese vocabulary system according to grammar and semantics, and tries to achieve the large-scale basic unique identification of homophones by determining the lexical, syntax and semantic relationship of natural language; the method is characterized by being suitable for processing closed corpora and reflecting the long-distance constraint relation and the recursion phenomenon of the language, but the method has poor robustness, is not suitable for processing open corpora and has poor consistency of knowledge expression.
In the existing method for recognizing characters by voice, characters in original characters are recognized into homonymous and different characters, and the recognition accuracy is not high.
Disclosure of Invention
In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a recognition model optimization method and system for improving the accuracy of speech recognition, which, in combination with a language model, effectively improves the accuracy of speech recognition.
The embodiment of the invention is realized by the following steps:
in a first aspect, an embodiment of the present invention provides a recognition model optimization method for improving speech recognition accuracy, including the following steps:
inputting voice training data into a CTC model of a DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence;
and optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and the words have the same tone and different characters or the same character and different tones after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, a reasonable voice recognition model is established, and the accuracy of voice recognition is improved. Inputting voice training data into a CTC model of a DeepSpeech voice recognition system (an end-to-end automatic voice recognition system), training and optimizing the CTC model (a time sequence class classification model based on a neural network), and calculating the output loss of the CTC model, wherein the loss is as follows: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum. Then, the speech recognition sequence z is recognized and judged by combining a language model, the sequence z is input into the language model, the probability that the current word is a wrongly written word is judged by the language model, the output probability value w of each single word in the sequence z is calculated, then the output probability value w of each single word is added into the CTC model again, and a new loss calculation formula is obtained, wherein the new loss calculation formula is as follows: and L (S) -ln Π p (z | x) ═ w ═ Σ ln p (z | x) × (x, z) ∈ S, parameters in the corresponding CTC voice model are optimized to obtain an optimized voice recognition model, and voice data are recognized through the optimized voice recognition model, so that the accuracy of voice recognition is improved.
The method combines the CTC model with the existing language recognition model, optimizes the parameters of the voice recognition model and reduces the loss value, so that when a new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Based on the first aspect, in some embodiments of the present invention, the method for inputting the speech recognition sequence into the preset language model to obtain the output probability value of each individual character in the speech recognition sequence includes the following steps:
inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
inputting the output data of the Embedding layer into an output layer with a softmax function in the language model, and calculating the output probability value of each single character according to the softmax function; the softmax function is:
Figure GDA0003577041930000031
wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Based on the first aspect, in some embodiments of the present invention, the method for performing optimization adjustment on the CTC model according to the output probability value of each word to obtain an optimized speech recognition model includes the following steps:
adding the output probability value of each single character into a CTC model, and adjusting a loss function of the CTC model to obtain a new loss function;
and optimizing and adjusting the CTC model according to the new loss function to obtain an optimized voice recognition model.
Based on the first aspect, in some embodiments of the present invention, the new loss function is: l (S) ═ ln pi p (z | x) ═ w ═ Σ ln p (z | x) · (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value for each individual word.
In a second aspect, an embodiment of the present invention provides a recognition model optimization system for improving speech recognition accuracy, including an initial output module, a single-word probability module, and a model optimization module, where:
the initial output module is used for inputting voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
the single character probability module is used for inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence;
and the model optimization module is used for optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and the words have the same tone and different characters or the same character and different tones after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, a reasonable voice recognition model is established, and the accuracy of voice recognition is improved. Inputting voice training data into a CTC model of a DeepSpeech voice recognition system (an end-to-end automatic voice recognition system) through an initial output module, training and optimizing the CTC model (a time sequence class classification model based on a neural network), and calculating the output loss of the CTC model, wherein the loss is as follows: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum. Then, the single character probability module is combined with a language model to recognize and judge a voice recognition sequence z, the sequence z is input into the language model, the current probability that the character is a wrongly written character is judged through the language model, the output probability value w of each single character in the sequence z is calculated, then the output probability value w of each single character is added into the CTC model again through the model optimization module, and a new loss calculation formula is obtained: and L (S) -ln Π p (z | x) ═ w ═ Σ ln p (z | x) × (x, z) ∈ S, parameters in the corresponding CTC voice model are optimized to obtain an optimized voice recognition model, and voice data are recognized through the optimized voice recognition model, so that the accuracy of voice recognition is improved.
The system combines the CTC model with the existing language recognition model, optimizes the parameters of the voice recognition model and reduces the loss value, so that when new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Based on the second aspect, in some embodiments of the invention, the single-word probability module includes a sequence input sub-module, a matrix output sub-module, and a probability calculation sub-module, where:
the sequence input sub-module is used for inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
the matrix output submodule is used for establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
the probability calculation submodule is used for inputting the output data of the Embedding layer into an output layer with a softmax function in the language model and calculating the output probability value of each single character according to the softmax function; the softmax function is:
Figure GDA0003577041930000061
wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Based on the second aspect, in some embodiments of the invention, the model optimization module includes a function adjustment submodule and an identification model submodule, wherein:
the function adjusting submodule is used for adding the output probability value of each single character into the CTC model and adjusting the loss function of the CTC model to obtain a new loss function;
and the recognition model submodule is used for optimizing and adjusting the CTC model according to the new loss function so as to obtain an optimized voice recognition model.
Based on the second aspect, in some embodiments of the invention, the new loss function is: l (S) — ln Π p (z | x) ═ w —, Σ ln p (z | x) — (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value of each individual word.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory for storing one or more programs; a processor. The program or programs, when executed by a processor, implement the method of any of the first aspects as described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method according to any one of the first aspect described above.
The embodiment of the invention at least has the following advantages or beneficial effects:
the embodiment of the invention provides a recognition model optimization method and system for improving voice recognition accuracy, aiming at solving the technical problem that recognition accuracy is not high due to the fact that homonyms and different characters or homonyms and different voices occur in characters after voice recognition in the prior art, and aiming at improving the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model to optimize the CTC model, establish a reasonable voice recognition model and improve the accuracy of voice recognition. And adding the output probability value of each single character into the CTC model again to obtain a new loss calculation formula, optimizing parameters in the corresponding CTC voice model, and reducing the loss value, so that when a new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flowchart of a recognition model optimization method for improving speech recognition accuracy according to an embodiment of the present invention;
FIG. 2 is a schematic block diagram of a recognition model optimization system for improving speech recognition accuracy according to an embodiment of the present invention;
fig. 3 is a block diagram of an electronic device according to an embodiment of the present invention.
Icon: 100. an initial output module; 200. a single word probability module; 210. a sequence input submodule; 220. a matrix output submodule; 230. a probability calculation submodule; 300. a model optimization module; 310. a function adjusting submodule; 320. identifying a model submodule; 101. a memory; 102. a processor; 103. a communication interface.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Examples
As shown in fig. 1, in a first aspect, an embodiment of the present invention provides a recognition model optimization method for improving speech recognition accuracy, including the following steps:
s1, inputting voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
in some embodiments of the present invention, the voice training data is input into a CTC model of a deep speech recognition system (end-to-end automatic speech recognition system), the CTC model (a neural network-based time series class classification model) is trained and optimized, and the CTC model output loss is calculated, where loss is: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum.
S2, inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence;
in some embodiments of the present invention, after obtaining the speech recognition sequence z, the speech recognition sequence z is recognized and determined by combining with the language model, the sequence z is input into the language model, the probability that the current word is a wrongly-written word is determined by the language model, and the output probability value w of each single word in the sequence z is calculated, so that the speech recognition model is subsequently optimized according to the probability value.
And S3, optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model.
In some embodiments of the present invention, after obtaining the output probability value w of each word, the output probability value of each word is added to the CTC model again, and a new loss calculation formula is obtained as follows: l (S) ═ ln pi (z | x) ═ w ═ Σ ln p (z | x) · (x, z) ∈ S, the loss finally calculated is the difference between the result predicted by the value and the value before reality, the output loss value is reduced, the accuracy of recognition can be improved, the parameters in the corresponding CTC speech model are optimized to obtain the optimized speech recognition model, and the speech data is recognized by the optimized speech recognition model, so that the accuracy of speech recognition is improved.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and different in tone or different in tone in the words after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, parameters are optimized, and loss values are reduced, so that when new voice is input into the voice recognition model, accurate results can be predicted, and the accuracy of voice recognition is improved.
Based on the first aspect, in some embodiments of the present invention, the method for inputting the speech recognition sequence into the preset language model to obtain the output probability value of each individual character in the speech recognition sequence includes the following steps:
inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
inputting the output data of the Embedding layer into an output layer with a softmax function in the language model, and calculating the output probability value of each single character according to the softmax function; the softmax function is:
Figure GDA0003577041930000111
wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Calculating the probability value of each single character firstly requires inputting the index sequence of each single character in the speech recognition sequence into a preset language model, the speech recognition sequence is text data, a corresponding word vector matrix is established according to the number of the single characters included in the text data and the index sequence (index sequence) corresponding to each electron, the word vector matrix is the output matrix of an Embedding layer, then tanh is used as an activation function, and finally the output data of the Embedding layer is sent to the output layer with softmax to output the probability. The BatchNormalization in DeepSpeech is replaced by Layer Normalization, the BatchNormalization is on the batch and is used for normalizing NHW, and the Layer Normalization is in the channel direction and is used for normalizing CHW and mainly has obvious effect on RNN.
The input is an index sequence of a sequence of words, e.g. the word ' this ' has an index in the dictionary (size | V | of 10, the word ' this ' has an index of 23, the ' test ' has an index of 65, then the sentence "this is a test" which is a test ' prediction ' test ', and the index sequence of the above word within the window size is 10,23, 65. The Embedding layer (Embedding) is a matrix with the size of | V | xK (note that the size of K is set by itself, the matrix is equivalent to a word vector initialized randomly and is updated in bp, the part is a word vector after the training of the neural network is completed, and the matrix formed by splicing the 10 th, 23 th and 65 th row vectors is taken out to be the output of the Embedding layer. The hidden layer accepts output of the spliced Embedding layer as input, tanh is used as an activation function, and finally the input is sent to an output layer with softmax, the probability is output, and the optimization target is to enable the softmax value corresponding to the word to be predicted to be maximum.
Based on the first aspect, in some embodiments of the present invention, the method for optimizing and adjusting the CTC model according to the output probability values of the individual words to obtain an optimized speech recognition model includes the following steps:
adding the output probability value of each single character into a CTC model, and adjusting a loss function of the CTC model to obtain a new loss function; the new loss function is: l (S) — ln Π p (z | x) ═ w — lnp (z | x) × (x, z) ∈ S, where x is the speech training data given the input, z is the speech recognition sequence output, S is the training set, and w is the output probability value for each individual word. The loss function can be interpreted as: given a sample, the product of the probabilities of outputting the correct label, p (z | x) p (z | x) representing the probability of outputting the speech recognition sequence z given the input speech training data x, S being the training set.
And optimizing and adjusting the CTC model according to the new loss function to obtain an optimized voice recognition model.
The optimization of the voice recognition model mainly aims to reduce the difference between the result predicted by model recognition and the result before reality, namely the loss value, the smaller the loss value, the higher the accuracy and the smaller the loss, and the accurate result can be predicted by increasing the output probability value of each single word to the loss function when new voice is input into the model, thereby finally improving the accuracy.
As shown in fig. 2, in a second aspect, an embodiment of the present invention provides a recognition model optimization system for improving speech recognition accuracy, including an initial output module 100, a unigram probability module 200, and a model optimization module 300, wherein:
an initial output module 100, configured to input voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
a single word probability module 200, configured to input the voice recognition sequence into a preset language model, so as to obtain an output probability value of each single word in the voice recognition sequence;
and the model optimization module 300 is configured to optimize and adjust the CTC model according to the output probability values of the individual characters to obtain an optimized speech recognition model.
In order to solve the technical problem that recognition accuracy is low due to the fact that words are wrongly recognized and the words have the same tone and different characters or the same character and different tones after voice recognition in the prior art, in order to improve the accuracy of voice recognition, a CTC model of a deep speech recognition system is combined with a language model, the CTC model is optimized, a reasonable voice recognition model is established, and the accuracy of voice recognition is improved. Inputting the voice training data into a CTC model of a deep speech recognition system (end-to-end automatic speech recognition system) through an initial output module 100, training and optimizing the CTC model (a time sequence class classification model based on a neural network), and calculating the output loss of the CTC model, wherein the loss is as follows: l (S) -ln Π p (z | x) - Σ lnp (z | x), (x, z) ∈ S, and CTC finally calculates the most likely output speech recognition sequence z, which is a text datum. Then the word probability module 200 combines the language model to identify and judge the speech recognition sequence z, inputs the sequence z into the language model, judges the probability that the current word is a wrongly written word through the language model, calculates the output probability value w of each word in the sequence z, and then adds the output probability value w of each word into the CTC model again through the model optimization module 300, so as to obtain a new loss calculation formula as follows: and L (S) -ln Π p (z | x) ═ w ═ Σ ln p (z | x) × (x, z) ∈ S, parameters in the corresponding CTC voice model are optimized to obtain an optimized voice recognition model, and voice data are recognized through the optimized voice recognition model, so that the accuracy of voice recognition is improved.
The system combines the CTC model with the existing language recognition model, optimizes the parameters of the voice recognition model and reduces the loss value, so that when new voice is input into the voice recognition model, an accurate result can be predicted, and the accuracy of voice recognition is improved.
Based on the second aspect, in some embodiments of the present invention, as shown in fig. 2, the single-word probability module 200 includes a sequence input sub-module 210, a matrix output sub-module 220, and a probability calculation sub-module 230, wherein:
the sequence input sub-module 210 is configured to input the index sequence of each individual character in the speech recognition sequence into a preset language model;
the matrix output sub-module 220 is configured to establish a word vector matrix according to the number of the individual characters in the speech recognition sequence and the index sequence of each individual character through a language model, and use the word vector matrix as output data of an Embedding layer;
the probability calculation submodule 230 is configured to input output data of the Embedding layer into an output layer with a softmax function in the language model, and calculate an output probability value of each individual character according to the softmax function; the softmax function is:
Figure GDA0003577041930000141
wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence.
Calculating the probability value of each single character firstly requires inputting the index sequence of each single character in the speech recognition sequence into a preset language model, the speech recognition sequence is text data, a corresponding word vector matrix is established according to the number of the single characters included in the text data and the index sequence (index sequence) corresponding to each electron, the word vector matrix is the output matrix of an Embedding layer, then tanh is used as an activation function, and finally the output data of the Embedding layer is sent to the output layer with softmax to output the probability.
Based on the second aspect, as shown in fig. 2, in some embodiments of the present invention, the model optimization module 300 includes a function adjustment submodule 310 and an identification model submodule 320, wherein:
the function adjusting submodule 310 is configured to add the output probability value of each individual character to the CTC model, and adjust the loss function of the CTC model to obtain a new loss function; the new loss function is: l (S) — ln Π p (z | x) ═ w ═ Σ lnp (z | x) × (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value for each individual word.
And the recognition model sub-module 320 is configured to perform optimization and adjustment on the CTC model according to the new loss function to obtain an optimized speech recognition model.
The optimization of the voice recognition model mainly aims to reduce the difference between the result predicted by model recognition and the result before reality, namely the loss value, the smaller the loss value, the higher the accuracy and the smaller the loss, and the accurate result can be predicted by increasing the output probability value of each single word to the loss function when new voice is input into the model, thereby finally improving the accuracy.
In a third aspect, as shown in fig. 3, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The one or more programs, when executed by the processor 102, implement the method of any of the first aspects as described above.
Also included is a communication interface 103, and the memory 101, processor 102 and communication interface 103 are electrically connected to each other, directly or indirectly, to enable transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules, and the processor 102 executes the software programs and modules stored in the memory 101 to thereby execute various functional applications and data processing. The communication interface 103 may be used for communicating signaling or data with other node devices.
The Memory 101 may be, but is not limited to, a Random Access Memory 101 (RAM), a Read Only Memory 101 (ROM), a Programmable Read Only Memory 101 (PROM), an Erasable Read Only Memory 101 (EPROM), an electrically Erasable Read Only Memory 101 (EEPROM), and the like.
The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor 102 may be a general-purpose Processor 102, including a Central Processing Unit (CPU) 102, a Network Processor 102 (NP), and the like; but may also be a Digital Signal processor 102 (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components.
In the embodiments provided in the present application, it should be understood that the disclosed method and system and method can be implemented in other ways. The method and system embodiments described above are merely illustrative, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by the processor 102, implements the method according to any one of the first aspect described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 101 (ROM), a Random Access Memory 101 (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (4)

1. A recognition model optimization method for improving speech recognition accuracy is characterized by comprising the following steps:
inputting voice training data into a CTC model of a DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence; the method comprises the following steps: inputting the index sequence of each single character in the voice recognition sequence into a preset language model; establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer in the language model; inputting the output data of the Embedding layer into an output layer with a softmax function in the language model, and calculating the output probability value of each single character according to the softmax function; the softmax function is:
Figure FDA0003577041920000011
wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence;
optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model; the method comprises the following steps: adding the output probability value of each single character into a CTC model, and adjusting a loss function of the CTC model to obtain a new loss function; optimizing and adjusting the CTC model according to the new loss function to obtain an optimized voice recognition model; the new loss function is: l (S) — ln Π p (z | x) ═ w ═ Σ lnp (z | x) × (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value for each individual word.
2. The utility model provides a promote speech recognition accuracy's recognition model optimization system which characterized in that includes initial output module, individual character probability module and model optimization module, wherein:
the initial output module is used for inputting voice training data into a CTC model of the DeepSpeech voice recognition system to obtain a voice recognition sequence output by the CTC model;
the single character probability module is used for inputting the voice recognition sequence into a preset language model to obtain the output probability value of each single character in the voice recognition sequence; the single-word probability module comprises a sequence input submodule, a matrix output submodule and a probability calculation submodule, wherein:
the sequence input sub-module is used for inputting the index sequence of each single character in the voice recognition sequence into a preset language model;
the matrix output submodule is used for establishing a word vector matrix according to the number of the single characters in the voice recognition sequence and the index sequence of each single character through a language model, and taking the word vector matrix as output data of an Embedding layer;
the probability calculation submodule is used for inputting the output data of the Embedding layer into an output layer with a softmax function in the language model and calculating the output probability value of each single character according to the softmax function; the softmax function is:
Figure FDA0003577041920000021
wherein, W is the output probability value of each single character, and V is the number of the single characters in the voice recognition sequence;
the model optimization module is used for optimizing and adjusting the CTC model according to the output probability value of each single character to obtain an optimized voice recognition model; the model optimization module comprises a function adjustment submodule and an identification model submodule, wherein: the function adjusting submodule is used for adding the output probability value of each single character into the CTC model and adjusting the loss function of the CTC model to obtain a new loss function; the recognition model submodule is used for optimizing and adjusting the CTC model according to the new loss function so as to obtain an optimized voice recognition model; the new loss function is: l (S) — ln Π p (z | x) ═ w —, Σ ln p (z | x) — (x, z) ∈ S, where x is given input speech training data, z is the output speech recognition sequence, S is the training set, and w is the output probability value of each individual word.
3. An electronic device, comprising:
a memory for storing one or more programs;
a processor;
the one or more programs, when executed by the processor, implement the method of claim 1.
4. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of claim 1.
CN202110487124.7A 2021-05-04 2021-05-04 Recognition model optimization method and system for improving speech recognition accuracy Active CN113327581B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110487124.7A CN113327581B (en) 2021-05-04 2021-05-04 Recognition model optimization method and system for improving speech recognition accuracy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110487124.7A CN113327581B (en) 2021-05-04 2021-05-04 Recognition model optimization method and system for improving speech recognition accuracy

Publications (2)

Publication Number Publication Date
CN113327581A CN113327581A (en) 2021-08-31
CN113327581B true CN113327581B (en) 2022-05-24

Family

ID=77414234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110487124.7A Active CN113327581B (en) 2021-05-04 2021-05-04 Recognition model optimization method and system for improving speech recognition accuracy

Country Status (1)

Country Link
CN (1) CN113327581B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272988A (en) * 2018-09-30 2019-01-25 江南大学 Audio recognition method based on multichannel convolutional neural networks
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN110390093A (en) * 2018-04-20 2019-10-29 普天信息技术有限公司 A kind of language model method for building up and device
CN111145729A (en) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium
CN111951785A (en) * 2019-05-16 2020-11-17 武汉Tcl集团工业研究院有限公司 Voice recognition method and device and terminal equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11593655B2 (en) * 2018-11-30 2023-02-28 Baidu Usa Llc Predicting deep learning scaling

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110390093A (en) * 2018-04-20 2019-10-29 普天信息技术有限公司 A kind of language model method for building up and device
CN109272988A (en) * 2018-09-30 2019-01-25 江南大学 Audio recognition method based on multichannel convolutional neural networks
CN111951785A (en) * 2019-05-16 2020-11-17 武汉Tcl集团工业研究院有限公司 Voice recognition method and device and terminal equipment
CN110287480A (en) * 2019-05-27 2019-09-27 广州多益网络股份有限公司 A kind of name entity recognition method, device, storage medium and terminal device
CN111145729A (en) * 2019-12-23 2020-05-12 厦门快商通科技股份有限公司 Speech recognition model training method, system, mobile terminal and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于语音识别的英语声学检测系统研究;李侠等;《自动化技术与应用》;20191225(第12期);全文 *

Also Published As

Publication number Publication date
CN113327581A (en) 2021-08-31

Similar Documents

Publication Publication Date Title
CN109887484B (en) Dual learning-based voice recognition and voice synthesis method and device
US7421387B2 (en) Dynamic N-best algorithm to reduce recognition errors
US20230186912A1 (en) Speech recognition method, apparatus and device, and storage medium
CN108536670B (en) Output sentence generation device, method, and program
WO2022142041A1 (en) Training method and apparatus for intent recognition model, computer device, and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CA3123387C (en) Method and system for generating an intent classifier
CN110569505A (en) text input method and device
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN112489655A (en) Method, system and storage medium for correcting error of speech recognition text in specific field
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN116956835B (en) Document generation method based on pre-training language model
CN113673228A (en) Text error correction method, text error correction device, computer storage medium and computer program product
Krantz et al. Language-agnostic syllabification with neural sequence labeling
CN112632956A (en) Text matching method, device, terminal and storage medium
KR100542757B1 (en) Automatic expansion Method and Device for Foreign language transliteration
CN112530402A (en) Voice synthesis method, voice synthesis device and intelligent equipment
CN113327581B (en) Recognition model optimization method and system for improving speech recognition accuracy
CN116187304A (en) Automatic text error correction algorithm and system based on improved BERT
CN116340507A (en) Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution
CN112966501B (en) New word discovery method, system, terminal and medium
CN115270818A (en) Intention identification method and device, storage medium and computer equipment
CN112528653B (en) Short text entity recognition method and system
CN111626059B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant