CN114360525A

CN114360525A - Voice recognition method and system

Info

Publication number: CN114360525A
Application number: CN202210018178.3A
Authority: CN
Inventors: 姜松
Original assignee: Beijing Hengtianruixun Technology Co ltd
Current assignee: Beijing Hengtianruixun Technology Co ltd
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-15

Abstract

The invention discloses a voice recognition method and a voice recognition system, and relates to the technical field of voice recognition. The method comprises the following steps: constructing a voice recognition model based on a Transformer model and a WFST model; acquiring an audio signal to be identified; detecting a voice signal in an audio signal to be recognized to obtain a target voice signal; carrying out conversion processing on a target voice signal to obtain a voice characteristic vector sequence; carrying out transformation processing on the voice feature vector sequence to obtain a target voice feature sequence; inputting a target voice feature sequence into a voice recognition model; identifying a target voice characteristic sequence through a Transformer model in a voice identification model, and outputting a phoneme sequence; and inputting the phoneme sequence into a WFST model in a voice recognition model, outputting a Chinese character sequence and finishing voice recognition. The invention can effectively reduce the data purchasing cost and simultaneously ensure the accuracy of voice recognition.

Description

Voice recognition method and system

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition method and a voice recognition system.

Background

Speech recognition is essentially the process of converting an audio sequence to a text sequence, i.e. finding the most probable text sequence given a speech input. Based on Bayesian principle, the speech recognition problem can be decomposed into conditional probability of the given text sequence appearing the speech and prior probability of the text sequence appearing, the model obtained by modeling the conditional probability is an acoustic model, and the model obtained by modeling the prior probability of the text sequence appearing is a language model.

An acoustic model is the output of converting speech into an acoustic representation, i.e. the probability of finding that a given speech originates from an acoustic symbol. For acoustic symbols, the most direct expression is the phrase, but in the case of an insufficient amount of training data, it is difficult to obtain a good model. The phrase is composed of a plurality of continuous pronunciations of phonemes, and the phonemes have clear definitions and limited number. Thus, in speech recognition, it is common to convert acoustic models into a model of speech sequence to pronunciation sequence (phoneme) and a pronunciation sequence to a dictionary of output text sequence.

The most common acoustic modeling approach to speech recognition is the Hidden Markov Model (HMM). Under HMM, states are hidden variables, speech is an observed value, and the jump between states conforms to the markov assumption. The state transition probability density is mostly modeled by adopting geometric distribution, and a Gaussian Mixture Model (GMM) is commonly used as a model for fitting the observation probability from a hidden variable to an observed value. Based on the development of deep learning, models such as a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN) and the like are applied to the modeling of the observation probability, and a very good effect is achieved.

The prior art speech recognition technology is mainly realized by a single technology framework, such as establishing an end-to-end neural network model by using a deep learning framework, such as pyrrch, keras and the like, or realizing a highly integrated HCLG model by using a special asr development tool Kaldi. In the former technique, the model is regarded as a black box, and the speech and text labels are trained, and the input speech directly obtains the text. The latter technique is to fill the training data and parameters strictly according to the frame specification, and the intermediate links are coupled very tightly, so the frame is difficult to modify, and the fusion of other advanced models is limited.

Disclosure of Invention

In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a voice recognition method and system, which can effectively reduce data procurement cost and ensure accuracy of voice recognition.

The embodiment of the invention is realized by the following steps:

in a first aspect, an embodiment of the present invention provides a speech recognition method, including the following steps:

constructing a voice recognition model based on a Transformer model and a WFST model;

acquiring an audio signal to be identified;

detecting a voice signal in an audio signal to be identified by adopting a VAD (voice activity detection) method to obtain a target voice signal;

carrying out conversion processing on a target voice signal by adopting an FBANK algorithm to obtain a voice characteristic vector sequence;

transforming the voice feature vector sequence by adopting a convolutional neural network to obtain a target voice feature sequence;

inputting a target voice feature sequence into a voice recognition model;

identifying a target voice characteristic sequence through a Transformer model in a voice identification model, and outputting a phoneme sequence;

and inputting the phoneme sequence into a WFST model in a voice recognition model, outputting a Chinese character sequence and finishing voice recognition.

In order to solve the technical problems that the model data purchasing cost is high and the model framework is difficult to modify in the existing voice recognition in the prior art, the invention connects a Transformer model and a WFST model, wherein the Transformer is used as an acoustic model from voice to a phoneme sequence, the sequence to sequence model based on the attention mechanism is utilized, the WFST with extremely high decoding speed is used as a language model from the phoneme sequence to a Chinese character sequence, the large expense for purchasing training data is not needed, the model training can be carried out based on the conventional data, and the data purchasing cost is greatly reduced. Modifying parameters and an input interface of the Transformer model to enable the Transformer model to accept voice data as input characteristics; the phoneme sequence (the sequential arrangement of initial consonant, vowel and tone) is used as input and output symbols between the front model and the rear model, so that the consistency between the input and the output of data of the two models is ensured, the recognition efficiency is greatly improved, and the accuracy of voice recognition is greatly improved on the basis of the voice recognition model obtained by combining the two models.

Based on the first aspect, in some embodiments of the present invention, the above method for constructing a speech recognition model based on a Transformer model and a WFST model includes the following steps:

modifying and adjusting parameters of the Transformer model and an input interface to obtain a target acoustic model;

constructing a target WFST model based on a pronunciation dictionary and an N-Gram statistical model;

and constructing a voice recognition model based on the target acoustic model and the target WFST model.

Based on the first aspect, in some embodiments of the invention, the parameters include a hidden layer dimension and a multi-head attention mechanism head number.

Based on the first aspect, in some embodiments of the present invention, the method for detecting a speech signal in an audio signal to be recognized by using a VAD detection method to obtain a target speech signal includes the following steps:

judging whether each frame of audio signal in the audio signal to be identified is a voice signal by adopting a VAD (voice activity detection) method, if so, extracting and taking the frame of audio signal as a target voice signal, and judging the next frame of audio signal until all audio signals are judged; if not, eliminating the frame of audio signal, and judging the next frame of audio signal until all audio signals are judged.

Based on the first aspect, in some embodiments of the present invention, the method for transforming a target speech signal by using an FBANK algorithm to obtain a speech feature vector sequence includes the following steps:

carrying out frequency spectrum calculation on each frame of signal in the target voice signal by adopting an FBANK algorithm to obtain a multidimensional vector corresponding to each frame of signal;

and integrating the multi-dimensional vectors corresponding to the frame signals to obtain a voice characteristic vector sequence.

Based on the first aspect, in some embodiments of the present invention, the speech recognition method further includes the steps of:

and acquiring and training the voice recognition model according to the voice data set and the text data set to obtain a target voice recognition model.

In some embodiments of the invention according to the first aspect, the phoneme sequence comprises an initial, a final and a tone.

In a second aspect, an embodiment of the present invention provides a speech recognition system, including a model building module, an audio obtaining module, a target detecting module, a speech transforming module, a feature inputting module, a speech recognition module, and a phoneme recognition module, where:

the model building module is used for building a voice recognition model based on a Transformer model and a WFST model;

the audio acquisition module is used for acquiring an audio signal to be identified;

the target detection module is used for detecting the voice signal in the audio signal to be identified by adopting a VAD detection method so as to obtain a target voice signal;

the voice conversion module is used for converting the target voice signal by adopting an FBANK algorithm to obtain a voice characteristic vector sequence;

the feature transformation module is used for transforming the voice feature vector sequence by adopting a convolutional neural network to obtain a target voice feature sequence;

the characteristic input module is used for inputting the target voice characteristic sequence into the voice recognition model;

the voice recognition module is used for recognizing the target voice characteristic sequence through a Transformer model in the voice recognition model and outputting a phoneme sequence;

and the phoneme recognition module is used for inputting the phoneme sequence into a WFST model in the voice recognition model, outputting the Chinese character sequence and finishing the voice recognition.

In order to solve the technical problems that model data purchasing cost is high and a model frame is difficult to modify in the existing voice recognition in the prior art, the system links a transform model and a WFST model through combination of a plurality of modules such as a model building module, an audio acquisition module, a target detection module, a voice transformation module, a feature input module, a voice recognition module and a phoneme recognition module, wherein the transform is used as an acoustic model from a voice to a phoneme sequence, the sequence to sequence model based on an attention mechanism is utilized, the WFST with extremely high decoding speed is used as a language model from the phoneme sequence to a Chinese character sequence, large expense is not needed to purchase training data, model training can be carried out based on conventional data, and the data purchasing cost is greatly reduced. Modifying parameters and an input interface of the Transformer model to enable the Transformer model to accept voice data as input characteristics; the phoneme sequence (the sequential arrangement of initial consonant, vowel and tone) is used as input and output symbols between the front model and the rear model, so that the consistency between the input and the output of data of the two models is ensured, the recognition efficiency is greatly improved, and the accuracy of voice recognition is greatly improved on the basis of the voice recognition model obtained by combining the two models.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory for storing one or more programs; a processor. The program or programs, when executed by a processor, implement the method of any of the first aspects as described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method according to any one of the first aspect described above.

The embodiment of the invention at least has the following advantages or beneficial effects:

the embodiment of the invention provides a voice recognition method and a system, which solve the technical problems that the model data purchase cost is high and the model frame is difficult to modify in the prior art, and the invention connects a Transformer model and a WFST model, wherein the Transformer is used as an acoustic model from voice to phoneme sequence, the sequence based on attention mechanism is used as a sequence model, and the WFST with extremely high decoding speed is used as a language model from phoneme sequence to Chinese character sequence, so that the large expense for purchasing training data is not needed, the model training can be carried out based on conventional data, and the data purchase cost is greatly reduced. Modifying parameters and an input interface of the Transformer model to enable the Transformer model to accept voice data as input characteristics; the phoneme sequence (the sequential arrangement of initial consonant, vowel and tone) is used as input and output symbols between the front model and the rear model, so that the consistency between the input and the output of data of the two models is ensured, the recognition efficiency is greatly improved, and the accuracy of voice recognition is greatly improved on the basis of the voice recognition model obtained by combining the two models.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of model building in a speech recognition method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a voice signal transformation process in a voice recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a speech recognition system according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Icon: 100. a model building module; 200. an audio acquisition module; 300. a target detection module; 400. a voice conversion module; 500. a feature transformation module; 600. a feature input module; 700. a voice recognition module; 800. a phoneme recognition module; 101. a memory; 102. a processor; 103. a communication interface.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Examples

As shown in fig. 1, in a first aspect, an embodiment of the present invention provides a speech recognition method, including the following steps:

s1, constructing a voice recognition model based on a Transformer model and a WFST model;

s2, acquiring an audio signal to be identified;

s3, detecting the voice signal in the audio signal to be recognized by adopting a VAD detection method to obtain a target voice signal;

s4, converting the target voice signal by adopting an FBANK algorithm to obtain a voice characteristic vector sequence;

s5, transforming the voice feature vector sequence by adopting a convolutional neural network to obtain a target voice feature sequence;

s6, inputting the target voice feature sequence into the voice recognition model;

s7, recognizing the target voice feature sequence through a Transformer model in the voice recognition model, and outputting a phoneme sequence; the phoneme sequence comprises an initial consonant, a final and a tone.

And S8, inputting the phoneme sequence into a WFST model in the voice recognition model, outputting a Chinese character sequence and finishing voice recognition.

Firstly, constructing a mixed speech recognition model comprising a Transformer model and a WFST model, so as to perform rapid and accurate speech recognition through the mixed model in the following, and outputting an accurate character sequence; after the model is built, acquiring an audio signal to be identified, and detecting and judging the audio signal to be identified by adopting a VAD (voice activity detection) method because the acquired audio signal to be identified may have noise, and removing the noise to obtain a target audio signal; and then, converting the target voice signal by adopting an FBANK algorithm, obtaining 1 multidimensional vector for each frame of signal, and connecting to obtain a vector sequence to obtain a voice characteristic vector sequence. In order to ensure that the subsequent speech recognition model can recognize the input sequence, the convolution neural network is adopted to carry out transformation processing on the speech feature vector sequence so as to obtain a target speech feature sequence which accords with the model input. Inputting the target voice characteristic sequence into a voice recognition model, firstly recognizing the target voice characteristic sequence through a Transformer model in the model, outputting a phoneme sequence, then analyzing, recognizing and converting the phoneme sequence through a WFST model in the voice recognition model, outputting a final recognition result, outputting a Chinese character sequence, and finishing voice recognition.

As shown in fig. 2, according to the first aspect, in some embodiments of the present invention, the above method for constructing a speech recognition model based on a Transformer model and a WFST model includes the following steps:

s11, modifying and adjusting parameters of the Transformer model and an input interface to obtain a target acoustic model; the parameters include hidden layer dimensions and multi-head attention mechanism number.

S12, constructing a target WFST model based on the pronunciation dictionary and the N-Gram statistical model;

and S13, constructing a voice recognition model based on the target acoustic model and the target WFST model.

The invention adopts a Transformer structure to construct an acoustic model, the Transformer model is originally used for language translation and is a sequence-to-sequence model based on an attention mechanism, and an original input sequence is a character string. Meanwhile, the invention improves the Transformer model, modifies and adjusts the parameters of the Transformer model and the input interface to obtain the target acoustic model, and optimizes the parameters of hidden layer dimension, head number of a multi-head attention mechanism and the like, so that the target acoustic model is more consistent with a voice recognition task. The target acoustic model comprises an encoder and a decoder, wherein voice characteristic data sequentially enter from the encoder and then enter the decoder, and are output as a phoneme sequence, and the phoneme is used for decomposing pinyin and comprises three parts of initial consonants, vowels and tones, and corresponds to the pronunciation of each word. A piece of speech is therefore passed through the target acoustic model and output as a sequence of phonemes. At the output of the decoder, a cluster search algorithm is used, and only the characters 5 before the probability ranking are taken for adaptation after each decoding.

The language model constructed by the invention combines an N-Gram statistical model and a pronunciation dictionary. Since the input is a sequence of phonemes, a composite model is built: and a Weighted Finite State Transducer (WFST) for establishing a conversion relation between the phoneme sequence and the Chinese character sequence. The WFST needs to be constructed in two parts, namely a pronunciation dictionary (called L) which is corresponding to Chinese words to phonemes, an N-gram model (called G) which is corresponding to (the first N) Chinese character sequences to (the N +1) Chinese characters, the two models are compounded to obtain a model LG, the LG is a directed graph model, each node is a Chinese character and a corresponding phoneme, and directed arcs exist between the nodes to represent the front-back connection relation of the 2 Chinese characters. The value attached to the arc indicates the weight of the 2-node connection, and the weight is the statistical result of the data and corresponds to the conditional probability of each Chinese character. The larger the weight is, the higher the joint probability that the 2 Chinese characters and the plurality of Chinese characters become a word or a sentence is. Finally, the Viterbi (Viterbi) algorithm is used to decode the input phoneme sequence to obtain the Chinese character sequence.

The voice recognition model constructed based on the target acoustic model and the target WFST model greatly improves the adaptability and also improves the accuracy of data recognition.

The audio signal to be identified is divided into a frame sequence according to the time length (10ms), VAD (active voice detection) judges whether the frame is voice or noise, and continuous voice signals extracted from the audio signal are only needed to be identified. VAD analyzes 6 sub-band energies of the spectrum, each sub-band energy is considered as a weighted sum of 2 gaussian random variables, so that it is modeled by a Gaussian Mixture Model (GMM), and parameters of the model are estimated by using data samples through an EM algorithm. After the parameters are determined, the type of a frame signal can be judged, and whether the frame signal is a voice signal or noise is judged; the mean and variance of the current data sample are then used to update the parameters of the model for analysis of the next frame. By eliminating the noise, the redundancy of subsequent identification data is greatly reduced, and more accurate voice data is provided for the subsequent identification.

As shown in fig. 3, according to the first aspect, in some embodiments of the present invention, the method for transforming a target speech signal by using FBANK algorithm to obtain a speech feature vector sequence includes the following steps:

s41, performing frequency spectrum calculation on each frame of signal in the target voice signal by adopting an FBANK algorithm to obtain a multidimensional vector corresponding to each frame of signal;

and S42, integrating the multi-dimensional vectors corresponding to the frame signals to obtain a speech feature vector sequence.

The FBANK algorithm is adopted to transform the voice signals, 1 multidimensional vector is obtained for each frame of signal, and the vector sequences are formed by connecting the vectors. The FBANK is to perform spectrum calculation on the frame signal to obtain the proportion of each frequency component, and the division interval of the frequency is according to Mel frequency interval, which accords with the resolution capability of human ears to sound frequency. The frequency spectrum obtained by FFT has rich frequency components and still has information redundancy, so that the simplified characteristic vector is obtained after the frequency spectrum is subjected to Mel filtering and grouped statistics according to 40 frequency bands.

In order to ensure the recognition effect of the model, before the speech recognition is carried out through the speech recognition model, a large number of speech data sets and text data sets, such as Aishell test sets, are obtained firstly, then the model is trained on the basis of the data sets, and finally a target speech recognition model capable of accurately recognizing speech is obtained.

As shown in fig. 4, in a second aspect, an embodiment of the present invention provides a speech recognition system, which includes a model building module 100, an audio obtaining module 200, an object detecting module 300, a speech transformation module 400, a feature transformation module 500, a feature input module 600, a speech recognition module 700, and a phoneme recognition module 800, where:

the model building module 100 is used for building a voice recognition model based on a Transformer model and a WFST model;

the audio acquisition module 200 is configured to acquire an audio signal to be identified;

the target detection module 300 is configured to detect a voice signal in an audio signal to be recognized by using a VAD detection method to obtain a target voice signal;

a voice transformation module 400, configured to transform the target voice signal by using an FBANK algorithm to obtain a voice feature vector sequence;

the feature transformation module 500 is configured to transform the voice feature vector sequence by using a convolutional neural network to obtain a target voice feature sequence;

a feature input module 600, configured to input a target speech feature sequence into a speech recognition model;

the speech recognition module 700 is configured to recognize a target speech feature sequence through a Transformer model in a speech recognition model, and output a phoneme sequence;

and the phoneme recognition module 800 is used for inputting the phoneme sequence into a WFST model in the voice recognition model, outputting the Chinese character sequence and completing voice recognition.

In order to solve the technical problems that model data purchasing cost is high and a model frame is difficult to modify in the existing voice recognition in the prior art, the system links a Transformer model and a WFST model through combination of a plurality of modules such as a model construction module 100, an audio acquisition module 200, an object detection module 300, a voice transformation module 400, a feature transformation module 500, a feature input module 600, a voice recognition module 700 and a phoneme recognition module 800, wherein the Transformer is used as an acoustic model of a voice-to-phoneme sequence, the sequence-to-sequence model based on attention mechanism is utilized, the WFST with extremely high decoding speed is used as a language model of a phoneme sequence-to-Chinese character sequence, large expense is not required to purchase training data, model training can be performed based on conventional data, and the data purchasing cost is greatly reduced. Modifying parameters and an input interface of the Transformer model to enable the Transformer model to accept voice data as input characteristics; the phoneme sequence (the sequential arrangement of initial consonant, vowel and tone) is used as input and output symbols between the front model and the rear model, so that the consistency between the input and the output of data of the two models is ensured, the recognition efficiency is greatly improved, and the accuracy of voice recognition is greatly improved on the basis of the voice recognition model obtained by combining the two models.

As shown in fig. 5, in a third aspect, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The one or more programs, when executed by the processor 102, implement the method of any of the first aspects as described above.

Also included is a communication interface 103, and the memory 101, processor 102 and communication interface 103 are electrically connected to each other, directly or indirectly, to enable transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules, and the processor 102 executes the software programs and modules stored in the memory 101 to thereby execute various functional applications and data processing. The communication interface 103 may be used for communicating signaling or data with other node devices.

The Memory 101 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like.

The processor 102 may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc.; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In the embodiments provided in the present application, it should be understood that the disclosed method and system and method can be implemented in other ways. The method and system embodiments described above are merely illustrative, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by the processor 102, implements the method according to any one of the first aspect described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A speech recognition method, comprising the steps of:

acquiring an audio signal to be identified;

inputting a target voice feature sequence into a voice recognition model;

2. The method of claim 1, wherein the method of constructing the speech recognition model based on the Transformer model and the WFST model comprises the following steps:

3. A speech recognition method according to claim 2, wherein the parameters include hidden layer dimensions and a multi-head attention mechanism number.

4. The method according to claim 1, wherein the method for detecting the voice signal in the audio signal to be recognized by using VAD detection method to obtain the target voice signal comprises the following steps:

5. The speech recognition method of claim 1, wherein the step of transforming the target speech signal using the FBANK algorithm to obtain the sequence of speech feature vectors comprises the steps of:

6. A speech recognition method according to claim 1, further comprising the steps of:

7. A speech recognition method according to claim 1, wherein the phoneme sequence comprises an initial, a final and a tone.

8. The utility model provides a speech recognition system, its characterized in that includes model construction module, audio acquisition module, target detection module, speech transformation module, characteristic input module, speech recognition module and phoneme recognition module, wherein:

9. An electronic device, comprising:

a memory for storing one or more programs;

a processor;

the one or more programs, when executed by the processor, implement the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.