CN111951796B - Speech recognition method and device, electronic equipment and storage medium - Google Patents

Speech recognition method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111951796B
CN111951796B CN202010838352.XA CN202010838352A CN111951796B CN 111951796 B CN111951796 B CN 111951796B CN 202010838352 A CN202010838352 A CN 202010838352A CN 111951796 B CN111951796 B CN 111951796B
Authority
CN
China
Prior art keywords
voice
feature
acoustic model
sub
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010838352.XA
Other languages
Chinese (zh)
Other versions
CN111951796A (en
Inventor
单亚慧
李�杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202010838352.XA priority Critical patent/CN111951796B/en
Publication of CN111951796A publication Critical patent/CN111951796A/en
Application granted granted Critical
Publication of CN111951796B publication Critical patent/CN111951796B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure relates to a voice recognition method and device, an electronic device and a storage medium, wherein the voice recognition method comprises the following steps: acquiring an original voice signal; noise reduction is carried out on the original voice signal, and an enhanced voice signal is obtained; extracting the voice characteristics of the original voice signals respectively to obtain first voice characteristics, and extracting the voice characteristics of the enhanced voice signals to obtain second voice characteristics; jointly processing the first voice feature and the second voice feature by utilizing a pre-trained acoustic model to obtain a combined state sequence; and decoding the combined state sequences to obtain a voice recognition result. The original voice signal and the noise-reduced voice signal are processed together, so that the accuracy of voice recognition is improved.

Description

Speech recognition method and device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of voice recognition, and in particular relates to a voice recognition method and device, electronic equipment and a storage medium.
Background
With the continuous development of artificial intelligence, more and more intelligent devices and fields begin to apply voice recognition technology to perform man-machine interaction and the like.
In the related art, in order to achieve more robust speech recognition, the noise of the speech is reduced before the speech recognition, and then the speech after the noise is removed is sent to a recognition system for recognition. Alternatively, the recognition effect of the voice recognition system on the voice containing noise is improved by optimizing an acoustic model in the recognition system.
Although the two methods can improve the recognition effect of the noise-containing voice to a certain extent, in the first method, the problem that distortion is caused by noise reduction through a voice noise reduction system is solved, so that the recognition effect of the pure voice and the voice with high signal to noise ratio is inversely deteriorated, and the improvement of the recognition effect of the voice with low signal to noise ratio by a method of purely optimizing an acoustic model is very limited. Therefore, the existing voice recognition method cannot achieve a good recognition effect on high-signal-to-noise ratio voice and low-signal-to-noise ratio voice.
Disclosure of Invention
The disclosure provides a voice recognition method and device, electronic equipment and storage medium, which at least solve the problem that voice with high signal to noise ratio and voice signals with low signal to noise ratio cannot be accurately recognized at the same time in the related technology. The technical scheme disclosed by the disclosure is as follows:
according to a first aspect of embodiments of the present disclosure, there is provided a voice recognition method, including:
acquiring an original voice signal;
noise reduction is carried out on the original voice signal, and an enhanced voice signal is obtained;
extracting the voice characteristics of the original voice signals respectively to obtain first voice characteristics, and extracting the voice characteristics of the enhanced voice signals to obtain second voice characteristics;
jointly processing the first voice feature and the second voice feature by utilizing a pre-trained acoustic model to obtain a combined state sequence;
and decoding the combined state sequences to obtain a voice recognition result.
Optionally, in the above voice recognition method, before the jointly processing the first voice feature and the second voice feature by using a pre-trained acoustic model to obtain a combined state sequence, the method further includes:
splicing the first voice feature and the second voice feature to obtain a spliced voice feature;
the method comprises the steps of utilizing the pre-trained acoustic model to jointly process the first voice feature and the second voice feature to obtain a combined state sequence, wherein the combined state sequence comprises the following steps:
and processing the spliced voice features by using the acoustic model to obtain a combined state sequence corresponding to the spliced voice features.
Optionally, in the above voice recognition method, the acoustic model includes a public network and two sub-networks, and the processing the first voice feature and the second voice feature together by using the pre-trained acoustic model to obtain a combined state sequence includes:
calculating the first voice feature by utilizing one sub-network of the pre-trained acoustic model, and calculating the second voice feature by utilizing the other sub-network of the acoustic model to obtain the optimized first voice feature and the optimized second voice feature;
and jointly calculating the optimized first voice feature and the optimized second voice feature by utilizing a public network of the pre-trained acoustic model to obtain a combined state sequence.
Optionally, in the above voice recognition method, the training method of the acoustic model includes:
respectively carrying out layer-by-layer training on the two sub-networks of the acoustic model; the training sample of one sub-network of the acoustic model is the voice characteristic of an original voice signal, and the training sample of the other sub-network of the acoustic model is the voice characteristic of an enhanced voice signal after noise reduction;
and taking the two trained sub-network outputs of the acoustic model as the input of the public network, and carrying out layer-by-layer training on the public network.
According to a second aspect of embodiments of the present disclosure, there is provided a voice recognition apparatus, comprising:
an acquisition unit configured to perform acquisition of an original voice signal;
the noise reduction unit is configured to perform noise reduction on the original voice signal to obtain an enhanced voice signal;
a feature extraction unit configured to perform extraction of the speech features of the original speech signal, respectively, to obtain a first speech feature, and extraction of the speech features of the enhanced speech signal, to obtain a second speech feature;
the feature processing unit is configured to perform joint processing on the first voice feature and the second voice feature by utilizing the pre-trained acoustic model to obtain a combined state sequence;
and the decoding unit is configured to decode the combined state sequences to obtain a voice recognition result.
Optionally, in the above voice recognition device, the voice recognition device further includes:
the splicing unit is configured to splice the first voice feature and the second voice feature to obtain a spliced voice feature;
wherein the feature processing unit includes:
and the first feature processing unit is configured to execute processing on the spliced voice features by using the acoustic model to obtain a combined state sequence corresponding to the spliced voice features.
Optionally, in the above voice recognition apparatus, the acoustic model includes a public network and two sub-networks, and the feature processing unit includes:
the second feature processing unit is configured to calculate the first voice feature by using one sub-network of the acoustic model trained in advance, calculate the second voice feature by using the other sub-network of the acoustic model to obtain the optimized first voice feature and the optimized second voice feature, and calculate the optimized first voice feature and the optimized second voice feature together by using a public network of the acoustic model trained in advance to obtain a combined state sequence.
Optionally, in the above voice recognition device, the voice recognition device further includes:
a sub-network training unit configured to perform layer-by-layer training of the two sub-networks of the acoustic model, respectively; the training sample of one sub-network of the acoustic model is the voice characteristic of an original voice signal, and the training sample of the other sub-network of the acoustic model is the voice characteristic of an enhanced voice signal after noise reduction;
and the public network training unit is configured to perform layer-by-layer training on the public network by taking two sub-network outputs of the trained acoustic model as inputs of the public network.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement a speech recognition method as claimed in any one of the preceding claims.
According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform a speech recognition method as described in any one of the above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product for performing the speech recognition method of any one of the above, when the computer program product is executed.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
after the original voice signal is obtained, the original voice signal is subjected to noise reduction to obtain an enhanced voice signal, then the voice characteristics of the original voice signal and the voice characteristics of the enhanced voice signal are respectively extracted to obtain the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal, the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal are jointly processed by utilizing a pre-trained acoustic model to obtain a combined state sequence, and finally the bound state sequence is decoded to obtain a voice recognition result. Because the acoustic model jointly processes the noise-reduced enhanced voice signal and the noise-preserved original voice signal, the problem of inaccurate recognition caused by distortion after noise reduction can be avoided, and the pre-trained acoustic model is optimized, so that accurate recognition can be performed on high-signal-to-noise ratio voice or low-signal-to-noise ratio voice.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a flowchart illustrating a method of speech recognition according to an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a process in which a speech recognition method is implemented in a speech recognition system, according to an example embodiment;
FIG. 3 is a flowchart illustrating another speech recognition method according to an exemplary embodiment;
FIG. 4 is a block diagram of an acoustic model shown in accordance with an exemplary embodiment;
FIG. 5 is a flowchart illustrating a method of training an acoustic model according to an exemplary embodiment;
FIG. 6 is a block diagram of a speech recognition device, according to an example embodiment;
fig. 7 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
Fig. 1 is a flowchart illustrating a voice recognition method according to an exemplary embodiment, and as shown in fig. 1, the voice recognition method includes the following steps.
In step S101, an original speech signal is acquired.
Wherein the original speech signal refers to an audio signal that has not undergone noise reduction processing. Specifically, the audio signal may be recorded by a microphone and not subjected to noise reduction, or may be a pre-recorded audio signal not subjected to noise reduction, but subjected to simple pre-processing, such as a mute signal from which the head and tail ends have been cut off, or an audio signal subjected to format conversion but not subjected to noise reduction. Thus, the original voice signal may be acquired through a microphone currently, or may be acquired from a hard disk or a memory and stored in advance.
In step S102, the original speech signal is noise reduced to obtain an enhanced speech signal.
It should be noted that, since the subsequent steps also need to be applied to the original speech signal, the original speech signal needs to be noise reduced to obtain the enhanced speech signal while the original speech signal is retained. Specifically, an original speech signal is copied, and then noise reduction processing is performed on one of the original speech signals, so that an enhanced speech signal is obtained while the original speech signal is maintained.
Because the environment where the voice signal is collected is not ideal, the obtained original voice signal usually has certain noise, and the smaller the signal-to-noise ratio of the original voice signal, that is, the smaller the ratio of the voice signal to the noise required to be recognized in the original voice signal, the larger the influence on voice recognition, so that the noise of the original voice signal needs to be reduced to obtain an enhanced voice signal, and the influence of the noise on the recognition result is reduced. However, noise reduction causes distortion, so that the embodiments of the present disclosure perform speech recognition using both the original speech signal without noise reduction and the enhanced speech signal obtained after noise reduction.
Specifically, the original speech signal may be denoised by a noise reduction model. Optionally, the noise reduction model may be based on an adaptive filter to implement noise reduction, and of course, may also be based on a noise reduction algorithm such as spectral subtraction or wiener filtering to implement noise reduction.
In step S103, the speech features of the original speech signal are extracted respectively to obtain a first speech feature, and the speech features of the enhanced speech signal are extracted to obtain a second speech feature.
It should be noted that, before the feature extraction of the original speech signal or the enhanced speech signal, the speech signal needs to be framed, i.e. a longer speech signal is divided into multiple frames of relatively shorter speech signals, and the frame length is usually taken to be 20 ms to 50 ms. In particular, in order to avoid that after framing, signals at the connection between frames are weakened, so that information at the connection is lost, and therefore, when framing is performed, overlapping portions need to exist between two adjacent frames of voice signals. Specifically, the next frame of voice signal with the preset frame length is divided at the position of moving for the preset time length at the starting time point of the previous frame of voice signal, namely, the phase difference between two adjacent frames of voice signals is preset for the preset time length. Wherein the preset time length of the movement (frame movement) is smaller than the frame length of each frame of the speech signal, typically set to 10 milliseconds.
Specifically, extracting the voice characteristics of the original voice signal, and obtaining the voice characteristics of each frame of the original voice signal for extracting the voice characteristics of each frame of the original voice signal. Similarly, the voice characteristics of the enhanced voice signal, specifically, the voice characteristics of each frame of the enhanced voice signal are extracted, so as to obtain the voice characteristics of each frame of the enhanced voice signal. Alternatively, the extracted speech features may be Mel frequency cepstral coefficients (Mel-Frequency Cepstral Coefficients, MFCC). Specifically, windowing is performed on the voice signal after framing, and then, fast Fourier transformation is performed on data corresponding to the voice signal in the window, so that a corresponding frequency spectrum is obtained. And processing the obtained frequency spectrum through a Mel (Mel) filter bank to obtain Mel (Mel) frequency spectrum, and performing cepstrum analysis on the Mel (Mel) frequency spectrum to obtain Mel frequency cepstrum coefficients corresponding to the voice signals. Of course, extracting mel-frequency cepstrum coefficients is only one alternative, and other types of features may be extracted, such as FBank feature parameters, perceptual linear prediction (Perceptual Linear Predict ive, plp) feature parameters, and the like. However, since the voice features of the original voice signal and the voice features of the enhanced voice signal need to be processed together, the same type of voice features need to be extracted for the original voice signal and the enhanced voice signal.
In step S104, the first speech feature and the second speech feature are jointly processed by using a pre-trained acoustic model, so as to obtain a combined state sequence.
It should be noted that, in the embodiment of the present disclosure, the first voice feature and the second voice feature are treated as a whole by using the acoustic model to perform the common processing, instead of sequentially processing the first voice feature and the second voice feature respectively. Therefore, when the acoustic model is trained, the acoustic model is trained by the combination of the first voice features of the plurality of original voice signals and the second voice features corresponding to the enhanced voice signals after the noise reduction of the original voice signals. And continuously adjusting parameters of the acoustic model until a combined state sequence consistent with the known state sequence of the training sample can be output, namely the acoustic model outputs a state sequence which corresponds to the first voice feature and the second voice feature together.
Specifically, the processing procedure of the acoustic model on the voice features is to determine a state sequence corresponding to the voice features corresponding to each frame of voice signal through operation. The state sequence is a sequence of states with a sequence order, wherein states in speech recognition can be understood as more detailed speech units than phonemes. More specifically, the pronunciation of a word is made up of phones, for example, for English, a common phone set is a phone set made up of 39 phones from Kanji Mero university, whereas for Chinese it is common to use all initials and finals directly as phone sets, whereas state refers to a phonetic representation that is one level smaller than phones, i.e., phones are made up of states, typically dividing a phone into 3 states.
It should be further noted that, in implementing speech recognition based on single-phoneme construction, there may be a small number of modeling units and the phonemic pronunciation is affected by the context in which the modeling units are located, so that three-phoneme modeling is generally performed nowadays, that is, modeling is performed by considering the influence of the previous phoneme and the next phoneme of a phoneme, and the binding of states corresponding to the three phonemes results in a combined state sequence.
Optionally, in order to enable the acoustic model to jointly process the first speech feature and the second speech feature, in another embodiment of the present disclosure, before performing step S104, the method further includes: and splicing the first voice feature and the second voice feature to obtain a spliced voice feature.
Specifically, the second voice feature may be spliced to the rear end of the first voice feature, so as to obtain a spliced voice feature with a higher dimension. For example, the first voice feature has 1024 dimensions, and the corresponding second voice feature also has 1024 dimensions, so that the spliced voice feature obtained after splicing is 2048 dimensions. After the spliced voice features are obtained, the spliced voice features are input into a pre-trained acoustic model, so that the acoustic model is utilized to process the spliced voice features. Because the integral voice features of the first voice feature and the second voice feature are input, the first voice feature and the second voice feature can be effectively ensured to be processed together instead of being processed respectively. In this case, the specific embodiment of step S104 is as follows: and processing the spliced voice features by using an acoustic model to obtain a combined state sequence corresponding to the spliced voice features.
In step S105, the combined state sequences are decoded to obtain a speech recognition result.
The decoding process specifically comprises the following steps: and determining the phonemes corresponding to the states according to the correspondence between the phonemes and the states, and searching the words corresponding to the phonemes by utilizing the correspondence between the phonemes and the words in the preset dictionary. Because of the existence of polyphones, one phoneme may correspond to a plurality of words, and the obtained words are relatively independent and form complete words and sentences, a language model trained in advance based on linguistic correlation theory needs to be further utilized to calculate and obtain a phrase sequence with the highest probability corresponding to the state sequence as a recognition result. Therefore, the function of the language model can be simply understood as the problem of eliminating the polyphones, and after the acoustic model gives the pronunciation sequence, the character string sequence with the highest probability is found out from the candidate character sequences. Alternatively, the combined state sequence may be decoded using a viterbi algorithm decoding implementation.
Therefore, in the voice recognition method provided by the present disclosure, in the process implemented in the actual voice recognition system, as shown in fig. 2, an original voice signal is first obtained, then the original voice signal is subjected to noise reduction processing through a noise reduction model to obtain an enhanced voice signal, the enhanced voice signal and the original voice signal are respectively input into a feature processing model to be subjected to feature processing, so as to output a first voice signal corresponding to the original voice signal and a second voice signal corresponding to the enhanced voice signal, then the first voice signal and the second voice signal are jointly input into an acoustic model to be processed, so as to obtain a state sequence jointly corresponding to the first voice signal and the second voice signal, and finally the state sequence is decoded and searched based on the voice model, namely, the state sequence is decoded, so that a final recognition result is output.
According to the voice recognition method provided by the embodiment of the disclosure, after an original voice signal is obtained, the original voice signal is subjected to noise reduction to obtain an enhanced voice signal, then the voice characteristics of the original voice signal and the voice characteristics of the enhanced voice signal are respectively extracted to obtain the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal, the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal are jointly processed by utilizing a pre-trained acoustic model to obtain a combined state sequence, and finally the combined state sequence is decoded to obtain a voice recognition result. Because the acoustic model is used for jointly processing the noise-reduced enhanced voice signal and the original voice signal with reserved noise, the problem of inaccurate recognition caused by distortion after noise reduction can be avoided, and the pre-trained acoustic model is optimized, so that accurate recognition can be performed on high-signal-to-noise ratio voice or low-signal-to-noise ratio voice.
Fig. 3 is a flowchart illustrating another voice recognition method according to an exemplary embodiment, and as shown in fig. 3, the voice recognition method includes the following steps.
In step S301, an original speech signal is acquired.
It should be noted that, the specific implementation manner of step S301 may refer to step S101 in the above method embodiment accordingly, and will not be described herein again.
In step S302, the original speech signal is noise reduced to obtain an enhanced speech signal.
It should be noted that, the specific implementation manner of step S302 may refer to step S102 in the above method embodiment accordingly, and will not be described herein again.
In step S303, the speech features of the original speech signal are extracted to obtain a first speech feature, and the speech features of the enhanced speech signal are extracted to obtain a second speech feature.
It should be noted that, the specific implementation manner of step S303 may refer to step S103 in the above method embodiment accordingly, and will not be described herein again.
In step S304, the first speech feature is calculated by using one sub-network of the pre-trained acoustic model, and the second speech feature is calculated by using another sub-network of the acoustic model, so as to obtain the optimized first speech feature and the optimized second speech feature.
It should be noted that, in the embodiment of the present disclosure, the acoustic model for processing the first speech feature and the second speech feature is composed of two sub-networks and one public network. Referring specifically to fig. 4, the acoustic model includes a sub-network 401, a sub-network 402, and a public network 403. The two sub-networks are used for respectively carrying out optimization processing on the first voice feature and the second voice feature, so that the first voice feature and the second voice feature better reflect the features of the original voice signal with noise and the enhanced voice signal, the state sequence output by the acoustic model can be more accurate, and the finally obtained recognition result is more accurate. The subnetwork may be a one-layer or multi-layer deep neural network model, for example, a deep neural network model such as a Time Delay Neural Network (TDNN) or a Convolutional Neural Network (CNN). The public network may be a long short term memory network (LSTM), or a Blstm network, among other networks that may be used to implement the acoustic model.
Alternatively, since the first speech feature refers to the speech feature corresponding to the original speech signal and the second speech feature refers to the speech feature corresponding to the enhanced speech signal, the first speech feature is different from the second speech feature, and thus the subnetwork 401 and the subnetwork 402 may be two different subnetworks. The two sub-networks are respectively constructed and trained for the characteristics of the first voice characteristic and the second voice characteristic, so that a better optimization effect can be achieved. At this time, the first voice feature and the second voice feature need to be input into corresponding sub-networks respectively for calculation, and cannot be input at will, so that the optimized first voice feature and the optimized second voice feature are obtained.
Of course, the two sub-networks may be completely identical, and at this time, the two sub-networks may process the first voice feature and the second voice feature at the same time, so that it is avoided that the public network is used for subsequent processing after the first voice feature and the second voice feature are optimized in sequence. At this time, the first voice feature and the second voice feature may be randomly selected to be input through a sub-network, so as to calculate the voice feature by using the sub-network, and obtain the optimized first voice feature and the optimized second voice feature. Therefore, the first voice feature and the second voice feature are respectively processed through the two sub-networks, mutual interference is avoided, and the first voice feature and the second voice feature are optimally optimized.
Specifically, fig. 5 is a flowchart illustrating a training method of the acoustic model according to an exemplary embodiment, as shown in fig. 5, including the following steps.
In step S501, the two sub-networks of the acoustic model are trained layer by layer, where the training sample of one sub-network of the acoustic model is the speech feature of the original speech signal, and the training sample of the other sub-network of the acoustic model is the speech feature of the enhanced speech signal after noise reduction.
Specifically, the present disclosure employs a joint learning approach to co-train two subnetworks. In the combined training process, one of the subnetworks takes the voice characteristics of the original voice signal which is not subjected to noise reduction as a training sample to perform a sequence so as to optimize the voice characteristics of the input original voice signal at the subsequent stage, and the other subnetwork takes the voice characteristics of the enhanced voice signal after noise reduction as a training sample so as to optimize the voice characteristics of the input enhanced voice signal at the subsequent stage.
Alternatively, a layer-by-layer greedy training algorithm may be employed to train the two sub-networks of the acoustic model layer-by-layer. Specifically, the main idea of the layer-by-layer greedy training algorithm is to train only one layer in the network at a time, i.e. we train a network with only one hidden layer first, start training a network with two hidden layers after the training of the network of this layer is finished, and so on to train all layers. In each step we fix the already trained top k-1 layer and then add the k-th layer, i.e. the output of the already trained top k-1 as the k-th input, the training of each layer can be supervised but more typically with an unsupervised party, e.g. by an automatic encoder. The weights obtained by training the layers individually are used to initialize the final network weights and then fine-tune the entire network, i.e. putting all layers together, optimizing the training errors on the labeled training set.
In the embodiment of the disclosure, the acoustic model comprises two sub-networks, so that the two sub-networks can be effectively combined through joint training to form an integral acoustic model, and the influence of the two sub-networks is effectively considered. And aiming at a complex neural network model, a layer-by-layer greedy training algorithm is adopted for layer-by-layer training, so that the training can be more convenient and accurate.
In step S502, the two sub-network outputs of the trained acoustic model are used as inputs to the public network, and the public network is trained layer by layer.
Alternatively, a layer-by-layer greedy training algorithm may also be employed to train the public network layer-by-layer. Specifically, after training the two sub-networks, the public network needs to be further trained, at this time, the two sub-networks and the public network are regarded as an integral network, then the two sub-networks can be regarded as a trained first k-1 layer, the first layer of the public network is a k layer, then the public network is trained layer by adopting a layer greedy training algorithm, namely, the output of the two sub-networks of the trained acoustic model is used as the input of the public network, and the optimized acoustic model is obtained. Wherein the inputs to the two subnetworks during training are also the speech features of the original speech signal without noise reduction and the speech features of the enhanced speech signal with noise reduction. After the optimized acoustic mode is obtained, the optimized network parameters are used as initial values of the whole acoustic model, and the whole acoustic model is finely adjusted until convergence. Specifically, after acoustic features of corresponding training samples are respectively input into two sub-networks, errors output by a public network are obtained, error counter-propagation is carried out, errors of all layers in the acoustic model are determined, fine adjustment is carried out on all parameters of the whole acoustic model based on a counter-propagation loss function until the counter-propagation loss function converges, and therefore a completely trained acoustic model is obtained.
In step S305, the optimized first speech feature and the optimized second speech feature are jointly calculated by using the public network of the pre-trained acoustic model, so as to obtain a combined state sequence.
Specifically, the optimized first voice features and the optimized second voice features output by the two sub-networks of the acoustic model are taken as a whole and input into a public network of the acoustic model together for common calculation, so that a combined state sequence is obtained.
It should be noted that, the specific implementation manner of step S305 may refer to step S104 in the above method embodiment accordingly, and will not be described herein again.
In step S306, the combined state sequence is decoded to obtain a speech recognition result.
It should be noted that, the specific implementation manner of step S306 may refer to step S105 in the above method embodiment accordingly, and will not be described herein again.
According to the voice recognition method provided by the embodiment of the disclosure, after an original voice signal is obtained, the original voice signal is subjected to noise reduction to obtain an enhanced voice signal, then the voice characteristics of the original voice signal and the voice characteristics of the enhanced voice signal are respectively extracted to obtain the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal, the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal are optimized by utilizing two sub-networks of an acoustic model trained in advance, so that the voice characteristics better embody the characteristics of the corresponding voice signal, the accuracy of a recognition result can be improved, then the voice characteristics corresponding to the optimized original voice signal and the voice characteristics corresponding to the enhanced voice signal are jointly processed through a public network of the acoustic model to obtain a combined state sequence, and finally the combined state sequence is decoded to obtain the voice recognition result. Because the acoustic model is used for jointly processing the noise-reduced enhanced voice signal and the original voice signal with reserved noise, the problem of inaccurate recognition caused by distortion after noise reduction can be avoided, and the pre-trained acoustic model is optimized, so that accurate recognition can be performed on high-signal-to-noise ratio voice or low-signal-to-noise ratio voice.
Fig. 6 is a voice recognition apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus includes: an acquisition unit 601, a noise reduction unit 602, a feature extraction unit 603, a feature processing unit 604, and a decoding unit 605.
The acquisition unit 601 is configured to perform acquisition of an original speech signal.
The noise reduction unit 602 is configured to perform noise reduction on the original speech signal, resulting in an enhanced speech signal.
The feature extraction unit 603 is configured to perform extracting the speech features of the original speech signal, respectively, to obtain a first speech feature, and extracting the speech features of the enhanced speech signal, to obtain a second speech feature.
The feature processing unit 604 is configured to perform a common processing of the first speech feature and the second speech feature with a pre-trained acoustic model, resulting in a combined state sequence.
The decoding unit 605 is configured to perform decoding on the combined state sequence to obtain a speech recognition result.
Optionally, in the voice recognition device provided in another embodiment, the voice recognition device further includes: and a splicing unit.
And the splicing unit is configured to splice the first voice feature and the second voice feature to obtain spliced voice features.
The characteristic processing unit in the voice recognition device specifically comprises: the first feature processing unit is configured to execute processing on the spliced voice features by using the acoustic model to obtain a combined state sequence corresponding to the spliced voice features.
Optionally, in another embodiment, the acoustic model includes a public network and two sub-networks, and the feature processing unit of the voice recognition device provided in this embodiment specifically includes: the second feature processing unit is configured to calculate the first voice feature by using one sub-network of the pre-trained acoustic model, calculate the second voice feature by using the other sub-network of the acoustic model to obtain the optimized first voice feature and the optimized second voice feature, and calculate the optimized first voice feature and the optimized second voice feature together by using the public network of the pre-trained acoustic model to obtain the combined state sequence.
Optionally, in the voice recognition device provided in another embodiment, the voice recognition device further includes: and a sub-network training unit.
And a sub-network training unit configured to perform layer-by-layer training of the two sub-networks of the acoustic model, respectively.
The training samples of one sub-network of the acoustic model are the voice characteristics of the original voice signal, and the training samples of the other sub-network of the acoustic model are the voice characteristics of the enhanced voice signal after noise reduction.
And the public network training unit is configured to perform layer-by-layer training on the public network by taking two sub-network outputs of the trained acoustic model as inputs of the public network.
It should be noted that, the specific working process of each unit in the voice recognition device shown in the foregoing embodiment may correspondingly refer to the specific implementation process of the corresponding step in the foregoing method embodiment, which is not described herein again.
According to the voice recognition device provided by the embodiment of the disclosure, after the original voice signal is acquired by the acquisition unit, the original voice signal is subjected to noise reduction through the noise reduction unit to obtain the enhanced voice signal, then the voice characteristics of the original voice signal and the enhanced voice signal are respectively extracted by the characteristic extraction unit to obtain the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal, the characteristic processing unit utilizes the pre-trained acoustic model to jointly process the voice characteristics corresponding to the original voice signal and the voice characteristics corresponding to the enhanced voice signal to obtain the combined state sequence, and finally the combined state sequence is decoded to obtain the voice recognition result. Because the acoustic model jointly processes the noise-reduced enhanced voice signal and the noise-preserved original voice signal, the problem of inaccurate recognition caused by distortion after noise reduction can be avoided, and the pre-trained acoustic model is optimized, so that accurate recognition can be performed on high-signal-to-noise ratio voice or low-signal-to-noise ratio voice.
Fig. 7 is a block diagram of an electronic device, according to an example embodiment. Referring to fig. 7, the electronic device includes: the processor 701 and a memory 702 for storing processor executable instructions.
Wherein the processor 701 is configured to execute instructions to implement the speech recognition method as in any of the embodiments described above.
Another embodiment of the present disclosure provides a storage medium that, when executed by a processor of an electronic device, enables the electronic device to perform a method of speech recognition as in any of the embodiments described above.
Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.
Another embodiment of the present disclosure provides a computer program product for performing the speech recognition method provided by any one of the embodiments above, when the computer program product is executed.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (6)

1. A method of speech recognition, comprising:
acquiring an original voice signal;
noise reduction is carried out on the original voice signal, and an enhanced voice signal is obtained;
extracting the voice characteristics of the original voice signals respectively to obtain first voice characteristics, and extracting the voice characteristics of the enhanced voice signals to obtain second voice characteristics;
jointly processing the first voice feature and the second voice feature by utilizing a pre-trained acoustic model to obtain a combined state sequence; the acoustic model comprises a public network and two sub-networks, wherein the first voice characteristic is calculated in one sub-network of the acoustic model trained in advance, and the second voice characteristic is calculated in the other sub-network of the acoustic model, so that the optimized first voice characteristic and the optimized second voice characteristic are obtained; the optimized first voice feature and the optimized second voice feature are jointly calculated by utilizing a public network of the acoustic model which is trained in advance, and a combined state sequence is obtained;
and decoding the combined state sequences to obtain a voice recognition result.
2. The method of claim 1, wherein the training method of the acoustic model comprises:
respectively carrying out layer-by-layer training on the two sub-networks of the acoustic model; the training sample of one sub-network of the acoustic model is the voice characteristic of an original voice signal, and the training sample of the other sub-network of the acoustic model is the voice characteristic of an enhanced voice signal after noise reduction;
and taking the two trained sub-network outputs of the acoustic model as the input of the public network, and carrying out layer-by-layer training on the public network.
3. A speech recognition apparatus, comprising:
an acquisition unit configured to perform acquisition of an original voice signal;
the noise reduction unit is configured to perform noise reduction on the original voice signal to obtain an enhanced voice signal;
a feature extraction unit configured to perform extraction of the speech features of the original speech signal, respectively, to obtain a first speech feature, and extraction of the speech features of the enhanced speech signal, to obtain a second speech feature;
the feature processing unit is configured to perform joint processing on the first voice feature and the second voice feature by utilizing a pre-trained acoustic model to obtain a combined state sequence;
the decoding unit is configured to perform decoding on the combined state sequences to obtain a voice recognition result;
wherein the acoustic model comprises a public network and two sub-networks, the feature processing unit comprises:
the second feature processing unit is configured to calculate the first voice feature by using one sub-network of the acoustic model trained in advance, calculate the second voice feature by using the other sub-network of the acoustic model to obtain the optimized first voice feature and the optimized second voice feature, and calculate the optimized first voice feature and the optimized second voice feature together by using a public network of the acoustic model trained in advance to obtain a combined state sequence.
4. A speech recognition device according to claim 3, characterized in that the speech recognition device further comprises:
a sub-network training unit configured to perform layer-by-layer training of the two sub-networks of the acoustic model, respectively; the training sample of one sub-network of the acoustic model is the voice characteristic of an original voice signal, and the training sample of the other sub-network of the acoustic model is the voice characteristic of an enhanced voice signal after noise reduction;
and the public network training unit is configured to perform layer-by-layer training on the public network by taking two sub-network outputs of the trained acoustic model as inputs of the public network.
5. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the speech recognition method of any one of claims 1 to 2.
6. A storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech recognition method of any one of claims 1 to 2.
CN202010838352.XA 2020-08-19 2020-08-19 Speech recognition method and device, electronic equipment and storage medium Active CN111951796B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010838352.XA CN111951796B (en) 2020-08-19 2020-08-19 Speech recognition method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010838352.XA CN111951796B (en) 2020-08-19 2020-08-19 Speech recognition method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111951796A CN111951796A (en) 2020-11-17
CN111951796B true CN111951796B (en) 2024-03-12

Family

ID=73358824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010838352.XA Active CN111951796B (en) 2020-08-19 2020-08-19 Speech recognition method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111951796B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114913845B (en) * 2021-02-09 2024-05-24 北京小米移动软件有限公司 Speech recognition method, training method and device of speech recognition model
CN112735397B (en) * 2021-03-18 2021-07-23 北京世纪好未来教育科技有限公司 Voice feature processing method and device, electronic equipment and storage medium
CN113938188B (en) * 2021-09-02 2022-09-27 华中科技大学 Construction method and application of optical signal-to-noise ratio monitoring model
CN114333769B (en) * 2021-09-29 2024-03-01 腾讯科技(深圳)有限公司 Speech recognition method, computer program product, computer device and storage medium
CN115424628B (en) * 2022-07-20 2023-06-27 荣耀终端有限公司 Voice processing method and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001175275A (en) * 1999-12-16 2001-06-29 Seiko Epson Corp Acoustic subword model generating method and speech recognizing device
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium
KR20180127890A (en) * 2017-05-22 2018-11-30 삼성전자주식회사 Method and apparatus for user adaptive speech recognition
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN110931028A (en) * 2018-09-19 2020-03-27 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001175275A (en) * 1999-12-16 2001-06-29 Seiko Epson Corp Acoustic subword model generating method and speech recognizing device
KR20180127890A (en) * 2017-05-22 2018-11-30 삼성전자주식회사 Method and apparatus for user adaptive speech recognition
CN108564940A (en) * 2018-03-20 2018-09-21 平安科技(深圳)有限公司 Audio recognition method, server and computer readable storage medium
CN110931028A (en) * 2018-09-19 2020-03-27 北京搜狗科技发展有限公司 Voice processing method and device and electronic equipment
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance

Also Published As

Publication number Publication date
CN111951796A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN111951796B (en) Speech recognition method and device, electronic equipment and storage medium
CN106683677B (en) Voice recognition method and device
CN110827801A (en) Automatic voice recognition method and system based on artificial intelligence
KR101120765B1 (en) Method of speech recognition using multimodal variational inference with switching state space models
Kinoshita et al. Text-informed speech enhancement with deep neural networks.
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
CN113192535B (en) Voice keyword retrieval method, system and electronic device
CN111640456A (en) Overlapped sound detection method, device and equipment
CN112908301B (en) Voice recognition method, device, storage medium and equipment
CN111883181A (en) Audio detection method and device, storage medium and electronic device
JP5713818B2 (en) Noise suppression device, method and program
CN111968622A (en) Attention mechanism-based voice recognition method, system and device
CN110648655B (en) Voice recognition method, device, system and storage medium
Nguyen et al. Feature adaptation using linear spectro-temporal transform for robust speech recognition
CN112216270B (en) Speech phoneme recognition method and system, electronic equipment and storage medium
CN114550706A (en) Smart campus voice recognition method based on deep learning
CN103035244B (en) Voice tracking method capable of feeding back loud-reading progress of user in real time
KR101122591B1 (en) Apparatus and method for speech recognition by keyword recognition
Do et al. Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end speech recognition
CN113160796B (en) Language identification method, device and equipment for broadcast audio and storage medium
CN115547345A (en) Voiceprint recognition model training and related recognition method, electronic device and storage medium
CN112420022B (en) Noise extraction method, device, equipment and storage medium
Koc Acoustic feature analysis for robust speech recognition
Ebrahim Kafoori et al. Robust recognition of noisy speech through partial imputation of missing data
JP2006145694A (en) Voice recognition method, system implementing the method, program, and recording medium for the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant