CN111951796A

CN111951796A - Voice recognition method and device, electronic equipment and storage medium

Info

Publication number: CN111951796A
Application number: CN202010838352.XA
Authority: CN
Inventors: 单亚慧; 李�杰
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-17
Anticipated expiration: 2040-08-19
Also published as: CN111951796B

Abstract

The present disclosure relates to a voice recognition method and apparatus, an electronic device, and a storage medium, wherein the voice recognition method includes: acquiring an original voice signal; denoising the original voice signal to obtain an enhanced voice signal; respectively extracting the voice features of the original voice signal to obtain first voice features, and extracting the voice features of the enhanced voice signal to obtain second voice features; jointly processing the first voice characteristic and the second voice characteristic by using a pre-trained acoustic model to obtain a combined state sequence; and decoding the combined state sequence to obtain a voice recognition result. The original voice signal and the voice signal after noise reduction are processed together, so that the accuracy of voice recognition is improved.

Description

Voice recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.

Background

With the continuous development of artificial intelligence, more and more intelligent devices and fields start to apply voice recognition technology to carry out human-computer interaction and the like.

In the related art, to realize more robust speech recognition, before speech recognition, noise reduction is performed on speech, and then the speech with noise removed is sent to a recognition system for recognition. Alternatively, the recognition effect of the speech recognition system on speech containing noise is improved by optimizing the acoustic model in the recognition system.

Although the two methods can improve the recognition effect of the speech containing noise to a certain extent, in the first method, the problem of distortion is caused when noise is reduced by a speech noise reduction system, so that the recognition effect of the speech containing pure speech and speech with high signal-to-noise ratio is inversely poor, and the method of simply optimizing an acoustic model has a very limited improvement on the recognition effect of the speech with low signal-to-noise ratio. Therefore, the existing voice recognition method cannot achieve good recognition effect on the voice with high signal-to-noise ratio and the voice with low signal-to-noise ratio.

Disclosure of Invention

The present disclosure provides a speech recognition method and apparatus, an electronic device, and a storage medium, to at least solve the problem in the related art that it is not possible to simultaneously and accurately recognize speech with a high signal-to-noise ratio and speech signals with a low signal-to-noise ratio. The technical scheme disclosed by the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a speech recognition method, including:

acquiring an original voice signal;

denoising the original voice signal to obtain an enhanced voice signal;

respectively extracting the voice features of the original voice signal to obtain first voice features, and extracting the voice features of the enhanced voice signal to obtain second voice features;

jointly processing the first voice characteristic and the second voice characteristic by using a pre-trained acoustic model to obtain a combined state sequence;

and decoding the combined state sequence to obtain a voice recognition result.

Optionally, in the foregoing speech recognition method, before the jointly processing the first speech feature and the second speech feature by using a pre-trained acoustic model to obtain a combined state sequence, the method further includes:

splicing the first voice feature and the second voice feature to obtain a spliced voice feature;

wherein, the jointly processing the first voice feature and the second voice feature by using the acoustic model trained in advance to obtain a combined state sequence comprises:

and processing the spliced voice features by using the acoustic model to obtain a combined state sequence corresponding to the spliced voice features.

Optionally, in the foregoing speech recognition method, the acoustic model includes a public network and two sub-networks, and the jointly processing the first speech feature and the second speech feature by using the acoustic model trained in advance to obtain a combined state sequence includes:

calculating the first voice feature in a sub-network of the acoustic model trained in advance, and calculating the second voice feature in another sub-network of the acoustic model to obtain the optimized first voice feature and the optimized second voice feature;

and utilizing a pre-trained public network of the acoustic model to jointly calculate the optimized first voice feature and the optimized second voice feature to obtain a combined state sequence.

Optionally, in the above speech recognition method, the training method of the acoustic model includes:

respectively training two sub-networks of the acoustic model layer by layer; the training sample of one sub-network of the acoustic model is the voice feature of an original voice signal, and the training sample of the other sub-network of the acoustic model is the voice feature of an enhanced voice signal after noise reduction;

and taking the outputs of the two sub-networks of the trained acoustic model as the inputs of the public network, and training the public network layer by layer.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

an acquisition unit configured to perform acquisition of an original voice signal;

the noise reduction unit is configured to perform noise reduction on the original voice signal to obtain an enhanced voice signal;

a feature extraction unit configured to perform respective extraction of speech features of the original speech signal to obtain a first speech feature, and extraction of speech features of the enhanced speech signal to obtain a second speech feature;

the feature processing unit is configured to perform common processing on the first voice feature and the second voice feature by using the acoustic model trained in advance to obtain a combined state sequence;

and the decoding unit is configured to decode the combined state sequence to obtain a voice recognition result.

Optionally, in the above speech recognition apparatus, the speech recognition apparatus further includes:

the splicing unit is configured to splice the first voice feature and the second voice feature to obtain a spliced voice feature;

wherein the feature processing unit includes:

and the first feature processing unit is configured to execute processing on the spliced voice features by using the acoustic model to obtain a combined state sequence corresponding to the spliced voice features.

Optionally, in the above-mentioned speech recognition apparatus, the acoustic model includes a public network and two sub-networks, and the feature processing unit includes:

a second feature processing unit, configured to compute the first speech feature in one sub-network of the acoustic model trained in advance, compute the second speech feature in another sub-network of the acoustic model to obtain the optimized first speech feature and the optimized second speech feature, and compute the optimized first speech feature and the optimized second speech feature together by using a common network of the acoustic model trained in advance to obtain a combined state sequence.

a sub-network training unit configured to perform layer-by-layer training on two sub-networks of the acoustic model, respectively; the training sample of one sub-network of the acoustic model is the voice feature of an original voice signal, and the training sample of the other sub-network of the acoustic model is the voice feature of an enhanced voice signal after noise reduction;

and the public network training unit is configured to perform layer-by-layer training on the public network by taking the outputs of the two sub-networks of the trained acoustic model as the input of the public network.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a speech recognition method as claimed in any one of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an electronic device, enable the electronic device to perform a speech recognition method as in any one of the above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product for performing any one of the above-described speech recognition methods when the computer program product is executed.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of obtaining an original voice signal, reducing noise of the original voice signal to obtain an enhanced voice signal, then respectively extracting voice features of the original voice signal and the enhanced voice signal to obtain voice features corresponding to the original voice signal and voice features corresponding to the enhanced voice signal, commonly processing the voice features corresponding to the original voice signal and the voice features corresponding to the enhanced voice signal by utilizing a pre-trained acoustic model to obtain a combined state sequence, and finally decoding the bound state sequence to obtain a voice recognition result. Because the acoustic model processes the enhanced voice signal after noise reduction and the original voice signal with the retained noise together, the problem of inaccurate identification caused by distortion after noise reduction can be avoided, and the acoustic model trained in advance is optimized, so that accurate identification can be carried out on voice with high signal-to-noise ratio or voice with low signal-to-noise ratio.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of speech recognition according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating a process by which a speech recognition method is implemented in a speech recognition system according to an example embodiment;

FIG. 3 is a flow diagram illustrating another method of speech recognition according to an example embodiment;

FIG. 4 is a block diagram illustrating an acoustic model in accordance with an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method of training an acoustic model in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating a speech recognition apparatus according to an example embodiment;

FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flow chart illustrating a speech recognition method according to an exemplary embodiment, as shown in fig. 1, the speech recognition method comprising the following steps.

In step S101, an original speech signal is acquired.

Wherein, the original speech signal refers to an audio signal that has not been subjected to noise reduction processing. Specifically, the audio signal may be an audio signal recorded by a microphone and not subjected to processing, or an audio signal that has been recorded in advance and not subjected to noise reduction processing, but subjected to simple preprocessing, such as a mute signal from which the head and tail ends have been cut off, or an audio signal that has been subjected to format conversion but not subjected to noise reduction processing. Therefore, the original voice signal may be acquired by a microphone at present, or the original voice signal stored in advance may be acquired from a hard disk or a memory.

In step S102, noise reduction is performed on the original speech signal to obtain an enhanced speech signal.

It should be noted that, since the subsequent steps are also applied to the original speech signal, it is necessary to perform noise reduction on the original speech signal to obtain the enhanced speech signal while the original speech signal is retained. Specifically, an original speech signal is copied, and then noise reduction processing is performed on one of the original speech signals, so that the original speech signal is retained while the enhanced speech signal is obtained.

Because the environment for collecting the speech signal is usually not an ideal environment, the obtained original speech signal usually has a certain noise, and the smaller the signal-to-noise ratio of the original speech signal is, i.e. the smaller the ratio of the speech signal to the noise to be recognized in the original speech signal is, the greater the influence on speech recognition is, so that it is necessary to reduce the noise of the original speech signal to obtain an enhanced speech signal and reduce the influence of the noise on the recognition result. However, noise reduction can cause distortion, so that the embodiment of the present disclosure performs speech recognition by using the original speech signal without noise reduction and the enhanced speech signal obtained after noise reduction.

Specifically, the original speech signal may be denoised by a denoising model. Optionally, the noise reduction model may implement noise reduction based on an adaptive filter, and certainly, may also implement speech noise reduction based on a noise reduction algorithm such as a spectral subtraction method or a wiener filtering method.

In step S103, the speech features of the original speech signal are respectively extracted to obtain a first speech feature, and the speech features of the enhanced speech signal are extracted to obtain a second speech feature.

It should be noted that before feature extraction is performed on an original speech signal or an enhanced speech signal, a speech signal needs to be framed, that is, a long speech signal is divided into multiple frames of relatively short speech signals, and the frame length is usually 20 milliseconds to 50 milliseconds. Specifically, in order to avoid that the signal at the frame-to-frame connection is weakened after the framing is performed, and thus the information at the connection is lost, an overlapping portion needs to exist between two adjacent frames of voice signals when the framing is performed. Specifically, at a position shifted by a preset time length from a start time point of a previous frame of voice signal, a next frame of voice signal with a preset frame length is divided, that is, a difference between two adjacent frames of voice signals is different by the preset time length. Wherein the preset time length of the movement (frame shift) is smaller than the frame length of the voice signal per frame, and is generally set to 10 milliseconds.

Specifically, the speech features of the original speech signal are extracted, and the speech features of each frame of the original speech signal are extracted to obtain the speech features of each frame of the original speech signal. Similarly, the speech features of the enhanced speech signal are extracted, specifically, the speech features of each frame of the enhanced speech signal are extracted to obtain the speech features of each frame of the enhanced speech signal. Alternatively, the extracted speech features may be Mel-Frequency Cepstral Coefficients (MFCC). Specifically, windowing is performed on the voice signal after framing, and then fast fourier transform is performed on data corresponding to the voice signal in the window to obtain a corresponding frequency spectrum. And processing the obtained frequency spectrum through a Mel (Mel) filter bank to obtain a Mel (Mel) frequency spectrum, and finally performing cepstrum analysis on the Mel (Mel) frequency spectrum to obtain a Mel frequency cepstrum coefficient corresponding to the voice signal. Of course, extracting mel-frequency cepstrum coefficients is only one of the optional ways, and other types of features may be extracted, such as FBank feature parameters, Perceptual Linear prediction (plp) feature parameters, and the like. However, since the speech features of the original speech signal and the speech features of the enhanced speech signal need to be processed together, the same type of speech features need to be extracted for the original speech signal and the enhanced speech signal.

In step S104, the first speech feature and the second speech feature are processed together by using the acoustic model trained in advance, so as to obtain a combined state sequence.

It should be noted that, in the embodiment of the present disclosure, the first speech feature and the second speech feature are considered as a whole by using the acoustic model to be processed together, rather than processing the first speech feature and the second speech feature separately and sequentially. Therefore, when the acoustic model is trained, the acoustic model is also trained through a combination of the first speech features of a plurality of original speech signals and the second speech features corresponding to the enhanced speech signals after the original speech signals are subjected to noise reduction. And continuously adjusting parameters of the acoustic model until a combined state sequence consistent with the known state sequence of the training sample can be output, namely the acoustic model outputs a state sequence corresponding to the first voice feature and the second voice feature.

Specifically, the acoustic model determines a state sequence corresponding to the speech feature corresponding to each frame of speech signal through operation in the process of processing the speech feature. The state sequence refers to a sequence with a precedence order formed by a plurality of states, wherein the states in the speech recognition can be understood as speech units finer than phonemes. More specifically, the pronunciation of a word is composed of phonemes, for example, in english, a commonly used phoneme set is a set of 39 phonemes of the acai tomimelong university, while in chinese, generally, the whole initials and finals are directly used as the phoneme set, and the state refers to a phonetic representation one level lower than the phonemes, that is, the phonemes are composed of states, and one phoneme is generally divided into 3 states.

It should be further noted that, in the implementation of speech recognition based on monophonic element construction, there are a small number of modeling units and the pronunciation of a phoneme is influenced by the context in which the phoneme is located, so nowadays, the modeling is generally based on triphone, that is, the modeling is performed in consideration of the influence of the previous phoneme and the next phoneme of a phoneme, and the state sequence obtained by binding the states corresponding to the three phonemes is the combined state sequence.

Optionally, in order to enable the acoustic model to process the first speech feature and the second speech feature together, in another embodiment of the present disclosure, before performing step S104, the method further includes: and splicing the first voice characteristic and the second voice characteristic to obtain a spliced voice characteristic.

Specifically, the second speech feature may be spliced to the rear end of the first speech feature, so as to obtain a higher-dimensional spliced speech feature. For example, if the first speech feature has 1024 dimensions and the corresponding second speech feature also has 1024 dimensions, the spliced speech feature obtained after splicing is 2048 dimensions. And after the spliced voice features are obtained, inputting the spliced voice features into a pre-trained acoustic model, and thus processing the spliced voice features by using the acoustic model. Because the input is the integral voice characteristic formed by splicing the first voice characteristic and the second voice characteristic, the common processing of the first voice characteristic and the second voice characteristic can be effectively ensured, and the first voice characteristic and the second voice characteristic are not processed respectively. In this case, the specific implementation manner of step S104 is: and processing the spliced voice features by using an acoustic model to obtain a combined state sequence corresponding to the spliced voice features.

In step S105, the combined state sequence is decoded to obtain a speech recognition result.

The decoding process specifically comprises: and determining the phoneme corresponding to the state according to the corresponding relation between the phoneme and the state, and then searching the word corresponding to the phoneme by using the preset corresponding relation between the phoneme and the word in the dictionary. Because a phoneme may correspond to a plurality of words due to existence of polyphonic characters, and the obtained words are relatively independent and form complete words and sentences, a phrase sequence with the maximum probability corresponding to the state sequence needs to be calculated and obtained by further utilizing a language model trained in advance based on a linguistic correlation theory to serve as a recognition result. Therefore, the role of the language model can be simply understood as resolving the problem of polyphone, and after the acoustic model gives the pronunciation sequence, the character string sequence with the highest probability is found from the candidate character sequences. Optionally, decoding the combined state sequence may be implemented using viterbi algorithm decoding.

Therefore, the speech recognition method provided by the present disclosure is a process implemented in an actual speech recognition system, as shown in fig. 2, first obtaining an original speech signal, then performing noise reduction processing on the original speech signal through a noise reduction model to obtain an enhanced speech signal, respectively inputting the enhanced speech signal and the original speech signal into a feature processing model for feature processing, thereby outputting a first speech signal corresponding to the original speech signal and a second speech signal corresponding to the enhanced speech signal, then inputting the first speech signal and the second speech signal into an acoustic model together for processing, thereby obtaining a state sequence corresponding to the first speech signal and the second speech signal together, and finally performing decoding search on the state sequence based on the speech model, that is, decoding the state sequence, thereby outputting a final recognition result.

The speech recognition method provided by the embodiment of the present disclosure includes, after an original speech signal is obtained, denoising the original speech signal to obtain an enhanced speech signal, then extracting speech features of the original speech signal and the enhanced speech signal respectively to obtain a speech feature corresponding to the original speech signal and a speech feature corresponding to the enhanced speech signal, performing common processing on the speech feature corresponding to the original speech signal and the speech feature corresponding to the enhanced speech signal by using a pre-trained acoustic model to obtain a combined state sequence, and finally decoding the combined state sequence to obtain a speech recognition result. Because the acoustic model is used for carrying out common processing on the enhanced voice signal after noise reduction and the original voice signal with the retained noise, the problem of inaccurate recognition caused by distortion after noise reduction can be avoided, and the acoustic model which is trained in advance is optimized, so that accurate recognition can be carried out on the voice with high signal-to-noise ratio or the voice with low signal-to-noise ratio.

FIG. 3 is a flow diagram illustrating another speech recognition method according to an exemplary embodiment, as shown in FIG. 3, the speech recognition method including the following steps.

In step S301, an original speech signal is acquired.

It should be noted that, the specific implementation manner of step S301 may refer to step S101 in the foregoing method embodiment, and details are not described here.

In step S302, noise reduction is performed on the original speech signal to obtain an enhanced speech signal.

It should be noted that, the specific implementation manner of step S302 may refer to step S102 in the foregoing method embodiment, and details are not described here.

In step S303, the speech features of the original speech signal are respectively extracted to obtain a first speech feature, and the speech features of the enhanced speech signal are extracted to obtain a second speech feature.

It should be noted that, the specific implementation manner of step S303 may refer to step S103 in the foregoing method embodiment, and details are not described here.

In step S304, a first speech feature is calculated in a sub-network of the pre-trained acoustic model, and a second speech feature is calculated in another sub-network of the acoustic model, so as to obtain an optimized first speech feature and an optimized second speech feature.

It should be noted that, in the embodiment of the present disclosure, the acoustic model for processing the first speech feature and the second speech feature is composed of two sub-networks and a common network. Referring specifically to fig. 4, the acoustic model includes a sub-network 401, a sub-network 402, and a public network 403. The two sub-networks are used for respectively carrying out optimization processing on the first voice characteristic and the second voice characteristic, so that the first voice characteristic and the second voice characteristic better reflect the characteristics of an original voice signal with noise and an enhanced voice signal, the state sequence output by the acoustic model can be more accurate, and the finally obtained recognition result is more accurate. The sub-networks may be one or more layers of deep neural network models, for example, deep neural network models such as a Time Delay Neural Network (TDNN) or a Convolutional Neural Network (CNN). The public network may be a long short term memory network (LSTM), or a Blstm network, among other networks that may be used to implement acoustic models.

Alternatively, since the first speech feature refers to a speech feature corresponding to the original speech signal and the second speech feature refers to a speech feature corresponding to the enhanced speech signal, the first speech feature is distinguishable from the second speech feature, and thus the sub-networks 401 and 402 can be two different sub-networks. The two sub-networks are respectively constructed and trained for the characteristics of the first voice feature and the second voice feature, so that better optimization effect can be achieved. At this time, the first speech feature and the second speech feature need to be respectively input into the corresponding sub-networks for calculation, and cannot be input at will, so that the optimized first speech feature and the optimized second speech feature are obtained.

Certainly, the two sub-networks may be completely identical, and at this time, the two sub-networks may process the first speech feature and the second speech feature at the same time, so that it is avoided that the public network is used for performing the subsequent processing after the first speech feature and the second speech feature are sequentially optimized. At this time, one sub-network can be randomly selected for the first speech feature and the second speech feature respectively, so as to calculate the speech features by using the sub-network, and obtain the optimized first speech feature and the optimized second speech feature. Therefore, the first voice characteristic and the second voice characteristic are respectively processed through the two sub-networks, mutual interference is avoided, and the first voice characteristic and the second voice characteristic are optimized optimally.

Specifically, fig. 5 is a flowchart illustrating a training method of the acoustic model according to an exemplary embodiment, and as shown in fig. 5, the method includes the following steps.

In step S501, two sub-networks of the acoustic model are trained layer by layer, where a training sample of one sub-network of the acoustic model is a speech feature of an original speech signal, and a training sample of the other sub-network of the acoustic model is a speech feature of an enhanced speech signal after noise reduction.

In particular, the present disclosure co-trains two sub-networks in a joint learning manner. In the process of the joint training, one sub-network takes the voice features of the original voice signals without noise reduction as training samples to carry out sequence so as to optimize the voice features of the input original voice signals in the subsequent process, and the other sub-network takes the voice features of the enhanced voice signals after noise reduction as training samples so as to optimize the voice features of the input enhanced voice signals in the subsequent use process.

Optionally, a layer-by-layer greedy training algorithm may be employed to train two sub-networks of the acoustic model layer-by-layer. Specifically, the main idea of the greedy training algorithm layer by layer is to train only one layer of the network each time, that is, we train a network including only one hidden layer first, and train a network having two hidden layers only after the training of the network is finished, so that all layers are trained in the same way. In each step we fix the already trained top k-1 layer and then add the kth layer, i.e. the already trained top k-1 output is taken as the kth input, the training of each layer can be supervised, but more usually is done using an unsupervised party, e.g. by an automatic encoder. The weights obtained from the individual training of these layers are used to initialize the final network weights, and the entire network is then fine-tuned, i.e., all layers are put together to optimize the training error on the labeled training set.

In the embodiment of the disclosure, the acoustic model includes two sub-networks, so that the two sub-networks can be effectively combined through joint training to form an overall acoustic model, and the influence of the two sub-networks is effectively considered. And aiming at a complex neural network model, a layer-by-layer greedy training algorithm is adopted for layer-by-layer training, so that the method can be more convenient, faster and more accurate.

In step S502, the two sub-network outputs of the trained acoustic model are used as the input of the common network, and the common network is trained layer by layer.

Alternatively, a layer-by-layer greedy training algorithm may also be employed to train the public network layer-by-layer. Specifically, after two sub-networks are trained, a public network needs to be trained further, at this time, the two sub-networks and the public network are regarded as an integral network, the two sub-networks can be regarded as the front k-1 trained layers, the first layer of the public network is a k layer, then, a layer-by-layer greedy training algorithm is adopted to train the public network layer by layer, namely, the output of the two sub-networks of the trained acoustic model is used as the input of the public network, and the optimized acoustic model is obtained. The input of the two sub-networks in the training process is also the speech feature of the original speech signal without noise reduction and the speech feature of the enhanced speech signal with noise reduction. After the optimized acoustic mode is obtained, the optimized network parameters are used as initial values of the whole acoustic model, and the whole acoustic model is subjected to fine tuning until convergence. Specifically, after the acoustic features of corresponding training samples are respectively input into two sub-networks, the error output by the public network is obtained, error back propagation is performed, the error of each layer in the acoustic model is determined, and each parameter of the whole acoustic model is finely adjusted based on a back propagation loss function until the back propagation loss function is converged, so that the completely trained acoustic model is obtained.

In step S305, the optimized first speech feature and the optimized second speech feature are jointly calculated by using a pre-trained public network of the acoustic model, so as to obtain a combined state sequence.

Specifically, the optimized first speech features output by the two sub-networks of the acoustic model and the optimized second speech features are input into a public network of the acoustic model as a whole to be calculated together, so as to obtain a combined state sequence.

It should be noted that, the specific implementation manner of step S305 may refer to step S104 in the foregoing method embodiment accordingly, and details are not described here again.

In step S306, the combined state sequence is decoded to obtain a speech recognition result.

It should be noted that, the specific implementation manner of step S306 may refer to step S105 in the foregoing method embodiment, and details are not described here.

The speech recognition method provided by the embodiment of the disclosure includes denoising an original speech signal to obtain an enhanced speech signal after obtaining the original speech signal, extracting speech features of the original speech signal and the enhanced speech signal respectively to obtain speech features corresponding to the original speech signal and speech features corresponding to the enhanced speech signal, optimizing the speech features corresponding to the original speech signal and the speech features corresponding to the enhanced speech signal by using two sub-networks of a pre-trained acoustic model, so that the speech features better embody the features of the corresponding speech signal, thereby improving the accuracy of a recognition result, then jointly processing the speech features corresponding to the optimized original speech signal and the speech features corresponding to the enhanced speech signal through a public network of the acoustic model to obtain a combined state sequence, and finally, decoding the combined state sequence to obtain a voice recognition result. Because the acoustic model is used for carrying out common processing on the enhanced voice signal after noise reduction and the original voice signal with the retained noise, the problem of inaccurate recognition caused by distortion after noise reduction can be avoided, and the acoustic model which is trained in advance is optimized, so that accurate recognition can be carried out on the voice with high signal-to-noise ratio or the voice with low signal-to-noise ratio.

FIG. 6 illustrates a speech recognition apparatus according to an example embodiment. Referring to fig. 6, the apparatus includes: acquisition section 601, noise reduction section 602, feature extraction section 603, feature processing section 604, and decoding section 605.

An acquisition unit 601 configured to perform acquisition of an original speech signal.

A noise reduction unit 602 configured to perform noise reduction on the original speech signal, resulting in an enhanced speech signal.

A feature extraction unit 603 configured to perform respective extraction of speech features of the original speech signal resulting in a first speech feature and extraction of speech features of the enhanced speech signal resulting in a second speech feature.

The feature processing unit 604 is configured to perform joint processing on the first speech feature and the second speech feature by using a pre-trained acoustic model, so as to obtain a combined state sequence.

A decoding unit 605 configured to perform decoding on the combined state sequence to obtain a speech recognition result.

Optionally, in a speech recognition apparatus provided in another embodiment, the speech recognition apparatus further includes: and (7) splicing units.

And the splicing unit is configured to splice the first voice feature and the second voice feature to obtain a spliced voice feature.

The feature processing unit in the speech recognition device specifically includes: and the first feature processing unit is configured to execute processing on the spliced voice features by using the acoustic model to obtain a combined state sequence corresponding to the spliced voice features.

Optionally, in another embodiment, the acoustic model includes a public network and two sub-networks, and the feature processing unit of the speech recognition apparatus provided in this embodiment specifically includes: and the second feature processing unit is configured to calculate the first voice feature in one sub-network of the pre-trained acoustic model, calculate the second voice feature in the other sub-network of the acoustic model to obtain an optimized first voice feature and an optimized second voice feature, and calculate the optimized first voice feature and the optimized second voice feature together by using a public network of the pre-trained acoustic model to obtain a combined state sequence.

Optionally, in a speech recognition apparatus provided in another embodiment, the speech recognition apparatus further includes: a sub-network training unit.

A sub-network training unit configured to perform layer-by-layer training on two sub-networks of the acoustic model, respectively.

The training sample of one sub-network of the acoustic model is the speech feature of the original speech signal, and the training sample of the other sub-network of the acoustic model is the speech feature of the enhanced speech signal after noise reduction.

And the public network training unit is configured to perform layer-by-layer training on the public network by taking the outputs of the two sub-networks of the trained acoustic model as the inputs of the public network.

It should be noted that, for the specific working process of each unit in the speech recognition apparatus shown in the foregoing embodiment, reference may be made to the specific implementation process of the corresponding step in the foregoing method embodiment, and details are not described here again.

The speech recognition device provided by the embodiment of the disclosure performs noise reduction on an original speech signal through the noise reduction unit after the acquisition unit acquires the original speech signal to obtain an enhanced speech signal, then extracts speech features of the original speech signal and the enhanced speech signal through the feature extraction unit respectively to obtain speech features corresponding to the original speech signal and speech features corresponding to the enhanced speech signal, performs common processing on the speech features corresponding to the original speech signal and the speech features corresponding to the enhanced speech signal by using a pre-trained acoustic model through the feature processing unit to obtain a combined state sequence, and finally decodes the combined state sequence to obtain a speech recognition result. Because the acoustic model processes the enhanced voice signal after noise reduction and the original voice signal with the retained noise together, the problem of inaccurate identification caused by distortion after noise reduction can be avoided, and the acoustic model trained in advance is optimized, so that accurate identification can be carried out on voice with high signal-to-noise ratio or voice with low signal-to-noise ratio.

FIG. 7 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment. Referring to fig. 7, the electronic device includes: a processor 701 and a memory 702 for storing processor-executable instructions.

Wherein the processor 701 is configured to execute instructions to implement a speech recognition method as in any of the embodiments described above.

Another embodiment of the present disclosure provides a storage medium, wherein when executed by a processor of an electronic device, instructions of the storage medium enable the electronic device to perform a speech recognition method as in any one of the above embodiments.

Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Another embodiment of the present disclosure provides a computer program product, which when executed, is configured to perform the speech recognition method provided in any one of the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A speech recognition method, comprising:

acquiring an original voice signal;

denoising the original voice signal to obtain an enhanced voice signal;

and decoding the bound state sequence to obtain a voice recognition result.

2. The speech recognition method of claim 1, wherein before jointly processing the first speech feature and the second speech feature using a pre-trained acoustic model to obtain a combined state sequence, the method further comprises:

3. The speech recognition method of claim 1, wherein the acoustic model comprises a public network and two sub-networks, and wherein the jointly processing the first speech feature and the second speech feature by using the acoustic model trained in advance to obtain a combined state sequence comprises:

4. The speech recognition method of claim 3, wherein the training method of the acoustic model comprises:

5. A speech recognition apparatus, comprising:

6. The speech recognition device of claim 5, further comprising:

wherein the feature processing unit includes:

7. The speech recognition device of claim 5, wherein the acoustic model comprises a common network and two sub-networks, and wherein the feature processing unit comprises:

8. The speech recognition apparatus of claim 7, further comprising:

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech recognition method of any of claims 1 to 4.

10. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech recognition method of any of claims 1 to 4.