CN113516973B - Non-autoregressive speech recognition network, method and equipment based on bidirectional context - Google Patents

Non-autoregressive speech recognition network, method and equipment based on bidirectional context Download PDF

Info

Publication number
CN113516973B
CN113516973B CN202111066812.2A CN202111066812A CN113516973B CN 113516973 B CN113516973 B CN 113516973B CN 202111066812 A CN202111066812 A CN 202111066812A CN 113516973 B CN113516973 B CN 113516973B
Authority
CN
China
Prior art keywords
decoder
recognition result
speech
encoder
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111066812.2A
Other languages
Chinese (zh)
Other versions
CN113516973A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Eeasy Electronic Tech Co ltd
Original Assignee
Zhuhai Eeasy Electronic Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Eeasy Electronic Tech Co ltd filed Critical Zhuhai Eeasy Electronic Tech Co ltd
Priority to CN202111066812.2A priority Critical patent/CN113516973B/en
Publication of CN113516973A publication Critical patent/CN113516973A/en
Application granted granted Critical
Publication of CN113516973B publication Critical patent/CN113516973B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

The invention is suitable for the technical field of human language processing, and provides a non-autoregressive speech recognition network, a method, equipment and a storage medium based on bidirectional context, wherein the speech recognition network provided by the invention adopts a Transformer encoder-decoder structure, an encoder of the speech recognition network is used for carrying out primary recognition on input speech characteristics to obtain a primary recognition result, a decoder of the speech recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final speech recognition result, the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, thereby fully utilizing the language information, improving the speech recognition effect, and compared with a method of respectively utilizing unidirectional language information by using two unidirectional decoders, the structure is more efficient and uniform.

Description

Non-autoregressive speech recognition network, method and equipment based on bidirectional context
Technical Field
The invention belongs to the technical field of human language processing, and particularly relates to a non-autoregressive speech recognition network, a method, equipment and a storage medium based on bidirectional context.
Background
The voice recognition is widely applied to the scenes of vehicle-mounted application, voice awakening, man-machine communication, intelligent home and the like. The input of the speech recognition model is speech, and the output is characters in the speech content. The traditional speech recognition technology is generally an autoregressive decoding mode, namely, the output of characters is serial, the method has higher precision, but the speed is far from meeting the requirement of real-time property. In contrast, the character prediction of the non-autoregressive method is parallel, and can meet the requirement of real-time performance, but the non-autoregressive method cannot better model language information, and generally needs to determine the length of an output sequence in advance before decoding, and compared with the autoregressive method, the length prediction is difficult, and the recognition accuracy is low. In the past, a great number of methods for improving the non-autoregressive speech recognition capability have emerged in academia and industry. The method commonly applied in the industry at present is based on CTC (connecting Temporal classification) (Alex Graves, Santiago Fernandez, et al. connecting Temporal classification: labeling unsegmented sequence data with a repeating Temporal network [ C ]. International reference on Machine Learning, 2006: 369 376.), but the CTC method only models the input speech features, resulting in strong conditional independence assumption between output words, unable to utilize the mutual linguistic information between the output words, and the computational complexity of the CTC method is the square of the input speech frame length, and high computational complexity. In recent years, with the mutual fusion of the methods in each field, the Transformer (Ashish varwani, Noam shaker, Niki Parmar, Jakob uszkorit, lilon Jones, et al. Attention is all you connected [ C ]. Conference and Workshop on Neural Information Processing Systems, 2017: 5998-.
The present invention focuses on the non-autoregressive speech recognition problem. In the speech recognition problem, whether speech information and language information can be fully utilized determines the quality of a recognition result, but the non-autoregressive method generally has low utilization rate of the language information and poor recognition result. To improve the performance of non-autoregressive methods, Yosuke et al (Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsuori Kobayashi. Mask CTC: Nonautoregensive end-to-end ASR with CTC and Mask prediction [ C ] reference of the International Speech Communication Association,2020: 3655 and 3659.) propose to use part of the text output by the encoder as input to the decoder, to re-predict the masked text by masking the low text and using the language information provided by the non-masked text. Tian et al (Zhengkun Tian, Jiangyan Yi, Jianhua Tao, Ye Bai, Shuai Zhang, et al, Spike-triggered non-auto-regenerative transform for end-to-end Speech recognition [ C ]. Conference of the International Speech Communication Association,2020: 5026-5030.) directly use part of Speech coding features output by an encoder as input of a decoder in order to accelerate recognition speed. To reduce length prediction errors, Yosuke et al (Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, and Tetsuori Kobayashi. Improved mask-ctc for non-autoregesived-to-end ASR
[C] International Conference on Acoustics, Speech and Signal Processing,2021: 8363- > 8367) designs a length prediction decoder which can dynamically adjust the length of an output sequence in the decoding process, thereby reducing length prediction errors. Furthermore, to further reduce the difficulty of modeling the decoder and increase the textual information available to the decoder, Song et al (Xingchen Song, zhongwu, Yiheng Huang, Chao Weng, Dan Su, Helen Meng. non-autonomous predictive transmitter apparatus with ctc-enhanced decoder input [ C ]. International Conference on Acoustics, Speech and Signal Processing,2021: 5894-. However, the above method either needs to mask part of the language information according to the probability or directly uses only one-way language information, so as to limit the language information that can be used by the decoder, resulting in waste of language information.
In auto-regressive and streaming Speech recognition, there are studies on bi-directional linguistic information, such as those proposed by Dong et al (Dong M, HE D, LUO C, et al. Transformator with a bidirectional decoder for Speech recognition [ C ]. Conference of the International Speech Communication Association,2020: 1773-. Other methods use two completely separate decoders to model the unidirectional linguistic information, but they are structurally complex and also cause loss of the reverse linguistic information.
Disclosure of Invention
The invention aims to provide a non-autoregressive speech recognition network, a method, equipment and a storage medium based on bidirectional context, aiming at solving the problem that language information cannot be fully utilized due to the fact that unidirectional context is used for prediction in the non-autoregressive speech recognition method.
In one aspect, the present invention provides a bi-directional context-based non-autoregressive speech recognition network, wherein the speech recognition network employs a transform encoder-decoder architecture, wherein,
the encoder of the voice recognition network is used for carrying out primary recognition on the input voice characteristics to obtain a primary recognition result;
and the decoder of the voice recognition network is used for adjusting the preliminary recognition result by utilizing the bidirectional language information provided by the preliminary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through preset attention masks applied to each multi-head self-attention layer of the decoder.
Preferably, the attention mask is a two-dimensional matrix, and the elements of the main diagonal of the two-dimensional matrix are all 0, and the elements outside the main diagonal are all 1.
Preferably, the decoder takes a position code as Q of the decoder's first multi-headed self-attention layer and inputs the same K and V into each of the decoder's multi-headed self-attention layers.
In another aspect, the present invention further provides a speech recognition method based on a bidirectional context non-autoregressive speech recognition network, where the method includes:
performing primary recognition on input voice features through a trained encoder of the voice recognition network to obtain a primary recognition result;
and adjusting the initial recognition result through a decoder of the trained voice recognition network, and outputting a final voice recognition result, wherein the decoder adjusts the initial recognition result by utilizing the bidirectional language information provided by the initial recognition result.
Preferably, before the initial recognition of the input speech by the trained speech recognition network encoder, the method further includes:
and performing joint training on a decoder and an encoder of the voice recognition network by using a training set until the loss value of the voice recognition network is minimum, so as to obtain the trained voice recognition network.
Preferably, the encoder and the decoder joint loss function is as follows:
Figure 335531DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 540247DEST_PATH_IMAGE002
for the purpose of the joint loss function,
Figure 457257DEST_PATH_IMAGE003
a loss of classification for the connection timing of the encoder,
Figure 532660DEST_PATH_IMAGE004
for the cross-entropy loss of the decoder,
Figure 712975DEST_PATH_IMAGE005
is a hyper-parameter.
Preferably, the decoder adjusts the preliminary recognition result by using an adaptive iteration-stopping mechanism.
Preferably, the preliminary recognition result includes a word sequence length, and the decoder outputs the voice recognition result in parallel based on the word sequence length.
In another aspect, the present invention also provides a speech recognition device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method as described above when executing the computer program.
In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.
In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to a first multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method of utilizing unidirectional language information by two unidirectional decoders respectively.
Drawings
FIG. 1A is a schematic structural diagram of a bi-directional context-based non-autoregressive speech recognition network according to an embodiment of the present invention;
FIG. 1B is a flowchart of an implementation of the method for learning bi-directional context and other methods using a decoder according to an embodiment of the present invention;
FIG. 1C is a diagram illustrating an exemplary structure of a transform-based decoder after being improved according to an embodiment of the present invention;
FIG. 1D is a diagram illustrating an example of a structure of an attention mask according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an implementation of a bi-directional context-based non-autoregressive speech recognition method according to a second embodiment of the present invention; and
fig. 3 is a schematic structural diagram of a speech recognition device according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:
the first embodiment is as follows:
fig. 1A illustrates a structure of a bidirectional context-based non-autoregressive speech recognition network according to an embodiment of the present invention, and for convenience of description, only the relevant portions of the embodiment of the present invention are shown, which is detailed as follows:
in the speech recognition problem, the decoder in the transform structure can use the linguistic information in the input text sequence to predict the output. The autoregressive speech recognition method is to predict the speech by using the language information provided by the characters output before the current position, and because the autoregressive method is a serial decoding method, only the character information output before the time can be used. Many non-autoregressive methods use unidirectional linguistic information for prediction, however, non-autoregressive methods output text sequences in parallel, and the use of unidirectional linguistic information results in waste of reverse linguistic information. Therefore, the non-autoregressive speech recognition network based on bidirectional context provided by the embodiment adopts a Transformer encoder-decoder structure to perform speech recognition by using bidirectional language information, and the whole speech recognition network can be trained and tested end to end. The bidirectional context, i.e., bidirectional language information, proposed in this embodiment includes two directions, i.e., left to right and right to left.
As shown in fig. 1A, the speech recognition network provided in this embodiment mainly includes an encoder 11 and a decoder 12 connected in sequence, where the encoder 11 is configured to perform preliminary recognition on input speech to obtain a preliminary recognition result, and the decoder 12 is configured to adjust the preliminary recognition result by using the bidirectional language information provided by the preliminary recognition result, and output a final speech recognition result. Wherein the decoder utilizes the bi-directional language information through a preset attention mask applied to each multi-headed self-attention layer of the decoder. FIG. 1B is a diagram of an example of learning bi-directional context and other methods using a decoder according to this embodiment, where in FIG. 1B, y1、y2And y3For a character, eos (end of sequence) is an end marker, sos (start of sequence) is a start marker, fig. 1B (a) is to use a unidirectional decoder to learn a left-to-right context, fig. 1B (B) is to use a unidirectional decoder to learn a right-to-left context, and fig. 1B (c) is to use a decoder to learn a bidirectional context as provided in this embodiment.
In a specific implementation, the encoder 11 of the Speech recognition network may adopt an encoder structure in a Speech Transformer (Linhao Dong, Shuang Xu, and Bo Xu. Speech-Transformer: A no-prediction sequence-to-sequence model for Speech recognition [ C ]. International Conference on Acoustics, Speech and Signal Processing, 2018: 5884 Processing 5888.), wherein the encoder mainly comprises a self-attention layer and a full connection layer, and outputs encoded Speech features and a text sequence by extracting global features of input Speech features; the decoder takes the speech features and the character sequence encoded by the encoder as input, and predicts the recognition result by further extracting the speech information and the language information. When the recognition result is predicted by further extracting the voice information and the language information, each position can update the self by utilizing the bidirectional language information, and even if the original input sequence has the wrong recognition result, the decoder can adjust the self recognition result according to other characters in the input sequence. Further, the output sequence of the decoder can be re-input into the decoder and identified in an iterative manner to further reduce the character error rate at the expense of slightly sacrificing decoding speed. Because the encoder does not model the language information, i.e. the output words have stronger conditional independent assumptions, and the decoder can eliminate the conditional independent assumptions by using the language information provided by the input word sequence, and output more accurate recognition results.
The number of iterations may be preset, and preferably, the decoder adjusts the preliminary recognition result by using an adaptive iteration stop mechanism to improve the decoding speed. Specifically, the adaptive iteration stop mechanism may be understood as that, when the output and the input of the decoder of the current iteration are completely the same, the iteration automatically stops, and because the language information that can be utilized at each position is the same as that of the current iteration at the next iteration, the iteration result does not change. The adaptive iteration stop mechanism effectively improves the decoding speed.
Preferably, the speech recognition network further comprises a convolution down-sampling layer, wherein the convolution down-sampling layer is used for performing down-sampling on the input speech signal, and inputting the speech characteristics obtained after the down-sampling into the encoder, so that redundant frames in the speech signal are eliminated through the down-sampling, and the computational complexity of the whole network is reduced. In a specific implementation, the speech signal may be first subjected to a convolution downsampling layer to perform N-fold downsampling, for example, 4-fold downsampling, and the speech feature obtained after downsampling is used as the input of the encoder.
In the embodiment of the invention, the speech recognition needs to solve two problems, namely the prediction problem of the length and the recognition problem of the output characters. For non-autoregressive speech recognition, all words are output in parallel, and the decoder needs to determine the length of the whole output sequence in advance before decoding. Preferably, the preliminary recognition result includes a text sequence length, and the decoder outputs the voice recognition result in parallel based on the text sequence length, thereby implementing real-time operation. Wherein, the encoder may use a CTC (connection timing Classification) loss so that the encoder may predict the length of the sequence during testing, and the decoder may use a CE (Cross Entropy) loss so that the decoder may use the bidirectional language information provided by the sequence output by the encoder to perform re-prediction. Since the decoder takes the speech features output by the encoder as part of the input, the encoder and decoder can be jointly trained, so that preferably the encoder and decoder joint loss function is as follows:
Figure 936146DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 125687DEST_PATH_IMAGE006
in order to be a function of the joint loss,
Figure 852335DEST_PATH_IMAGE007
the loss of the classification for the connection timing of the encoder,
Figure 485442DEST_PATH_IMAGE008
in order to be a cross-entropy loss for the decoder,
Figure 445176DEST_PATH_IMAGE009
is a hyper-parameter. I.e., using CTC loss and CE loss co-training.
In the embodiment of the present invention, each position in the decoder can be predicted by using a bidirectional context, but the bidirectional context causes a problem of information leakage. In short, the information leakage problem refers to that the output of the decoder can see the relevant information of the input during training, and the information leakage can cause that the decoder cannot predict the input again during testing, so that the ability of adjusting the result by using the language information is lost.
To prevent information leakage, the decoder preferably takes the position code as Q for the first multi-headed self-attention layer of the decoder and inputs the same K and V into each of the multi-headed self-attention layers of the decoder. In a specific implementation, as shown in fig. 1C, the Queries (query, Q), Keys (K), and Values (V) input by the decoder can be improved on the basis of the original transform decoder, and fig. 1C includes character encoding, position encoding, multi-head source attention layer, and multi-head self-attention layer, and since there is residual connection in the decoder, the mapping for position encoding is obtained
Figure 923562DEST_PATH_IMAGE010
And can be combined with
Figure 770295DEST_PATH_IMAGE011
As the Q of the first multi-headed self-attention layer of the decoder.
Figure 817711DEST_PATH_IMAGE012
Wherein the content of the first and second substances,
Figure 749895DEST_PATH_IMAGE011
for the purpose of a linear mapping of the position coding,
Figure 31972DEST_PATH_IMAGE013
for linear mapping, P is positional encoding (positional encoding).
Then, the sameKAndVeach multi-headed self-attention layer input to the decoder:
Figure 982480DEST_PATH_IMAGE014
,1 ≤iI
wherein the content of the first and second substances,Ithe total number of layers of the multi-headed self-attention layer of the decoder,ic is character embedding (character embedding) for the current number of multi-headed self-attention layers. Of multiple source attention layers of a decoderKAndVmay be determined based on the encoding state.
In addition, to prevent information leakage, the above-described attention mask is also used to make language information relating to its own position invisible for each position. As shown in fig. 1D, the attention mask is preferably a two-dimensional matrix, the main diagonal elements of the matrix are all 0, and the remaining elements are all 1, that is, the attention mask makes each position have an attention weight of 0 to itself, so as to prevent information leakage.
In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method that two unidirectional decoders respectively utilize unidirectional language information.
Example two:
fig. 2 shows an implementation flow of a bidirectional context-based non-autoregressive speech recognition method according to a second embodiment of the present invention, which is implemented according to the first embodiment of the present invention, and for convenience of description, only the relevant parts of the second embodiment of the present invention are shown, and the following details are described:
in step S201, an encoder of the trained speech recognition network performs a preliminary recognition on the input speech features to obtain a preliminary recognition result.
The embodiment of the invention is applicable to a voice recognition device, the voice recognition device can be a mobile phone, a tablet personal computer, a wearable device, an intelligent sound box, a vehicle-mounted device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA) and other devices, and the embodiment of the invention does not limit the specific type of the voice recognition device.
In the embodiment of the invention, before the initial recognition of the input speech features is carried out through the encoder of the trained speech recognition network, the speech recognition network needs to be trained, and when the speech recognition network is trained, preferably, a training set is used for carrying out combined training on the decoder and the encoder of the speech recognition network until the loss value of the speech recognition network is minimum, so that the trained speech recognition network is obtained, thereby realizing end-to-end training and reducing the training complexity and the running time. Wherein joint training trains both the encoder and the decoder.
When the speech recognition network is trained, the loss values of the speech recognition network can be calculated by weighted summation using the respective loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights. Preferably, the encoder and decoder joint loss function is as follows:
Figure 957389DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 111290DEST_PATH_IMAGE002
in order to be a function of the joint loss,
Figure 446325DEST_PATH_IMAGE003
the loss of the classification for the connection timing of the encoder,
Figure 267650DEST_PATH_IMAGE004
in order to be a cross-entropy loss for the decoder,
Figure 413461DEST_PATH_IMAGE005
is a hyper-parameter. I.e., using CTC loss and CE loss co-training.
When the voice recognition network is trained, preferably, two data amplification strategies, namely frequency spectrum enhancement and speed perturbation, are used for data amplification on data in a training set so as to enhance the robustness of the voice recognition network.
In step S202, the preliminary recognition result is adjusted by the decoder of the trained speech recognition network, and the final speech recognition result is output, wherein the decoder adjusts the preliminary recognition result by using the bidirectional language information provided by the preliminary recognition result.
In the embodiment of the invention, when the initial recognition result is adjusted by the decoder of the trained voice recognition network, each position can update the self by utilizing the bidirectional language information, and even if the original input sequence has an erroneous recognition result, the decoder can adjust the self recognition result according to other characters in the input sequence. Further, the output sequence of the decoder can be re-input into the decoder and identified in an iterative manner to further reduce the character error rate at the expense of slightly sacrificing decoding speed. Preferably, the decoder adjusts the preliminary recognition result by using an adaptive iteration stopping mechanism to improve the decoding speed. Specifically, the adaptive iteration stop mechanism may be understood as that when the output and the input of the decoder of the current iteration are completely the same, the iteration is automatically stopped.
Preferably, the preliminary recognition result includes a text sequence length, and the decoder outputs the voice recognition result in parallel based on the text sequence length, thereby implementing real-time operation.
In the embodiment of the invention, the initial recognition result is obtained by initially recognizing the input voice characteristics through the encoder of the trained voice recognition network, the initial recognition result is adjusted through the decoder of the trained voice recognition network, and the final voice recognition result is output, wherein the decoder adjusts the initial recognition result by utilizing the bidirectional language information provided by the initial recognition result, so that the language information is fully utilized, the recognition effect is improved, and the structure is more efficient and uniform compared with a method of respectively utilizing the unidirectional language information by using two unidirectional decoders.
The license plate location method provided by the embodiment is further verified and explained by combining an experimental example as follows:
(1) corpus used in this experimental example:
the Aishell1 corpus is a Hill Shell Mandarin open source Speech corpus, which is part of the Hill Shell Mandarin Chinese Speech database AISHELL-ASR 0009. In a quiet indoor environment, 400 speakers from different accent areas in China participate in recording, a high-fidelity microphone (44.1 kHz, 16-bit) with 16kHz audio down-sampling is used for manufacturing, and the recording time is 178 hours. And (4) transcription and labeling by professional voice proofreaders, and passing strict quality inspection. The text accuracy of the corpus is more than 95%, and the corpus is divided into a training set, a development set and a test set.
The Magicdata corpus is published by MAGIC DATA technical company and is a Chinese Mandarin speech corpus. The corpus recording process is carried out in a quiet indoor environment, 1000 Chinese mainland Mandarin native language persons participate in recording, and recording is carried out by using a mobile phone for 755 hours. The text accuracy of the corpus is more than 98%, and the corpus is also divided into a training set, a development set and a test set.
(2) Description of the experiments:
during model training, the experimental example uses two data amplification strategies of frequency spectrum enhancement and speed disturbance to enhance the robustness of the model. The experimental example adopts a pytorch1.7.0 deep learning framework, and is trained by using an Adam optimization strategy and a gradient accumulation strategy, wherein momentum parameters are set to be beta _1=0.9 and beta _2= 0.999. The initial learning rate was set to 0.0001 and the training batch was 32. All experiments were performed on a machine containing 4 NVIDIA Titan XP GPUs.
The corpus used in this example is two open-source mandarin chinese corpuses, and the speech in both corpuses is clean speech. In training, CTC and CE loss joint training is used, with the CTC loss weight set to 0.3 and the CE loss weight set to 0.7. A plurality of dropout layers exist in the network, and the drop probabilities are all set to be 0.1.
(3) The experimental results are as follows:
to evaluate the effectiveness of this example, this example performed speech recognition tests in the above-mentioned corpus. The method provided by the embodiment is compared with the existing mainstream autoregressive and non-autoregressive voice recognition method, and comprises KERMIT, LASO, ST-NAR, Masked-NAT, CASS-NAT, CTC-enhanced Transformer, TS-NAT and AR Transformer.
The experimental results are shown in tables 1 and 2, and tables 1 and 2 are the experimental results of Aishell1 corpus and Magicdata corpus, respectively, wherein NAT-BC (Non-autoregressive transform with bidirectional conditions) represents the method described in this embodiment. The result shows that the character error rate of the method described in this embodiment under different corpora is lower than that of all other non-autoregressive methods, and the recognition speed is significantly faster than that of the autoregressive method under the condition of keeping the character error rate similar to that of the autoregressive method, so that the requirement of real-time performance can be met. The character error rate comprises three errors of insertion, replacement and deletion, the real-time rate is numerically equal to the time spent by a computer for processing a voice signal in unit time, and the relative speed is the speed of a model relative to an autoregressive Transformer model.
Figure DEST_PATH_IMAGE016A
TABLE 1
Figure 162980DEST_PATH_IMAGE017
TABLE 2
To further verify the advantage of using a bi-directional context in the method described in this example, the bi-directional context in the decoder was replaced with a uni-directional context, and comparative experiments were performed on the Magicdata corpus. The properties are shown in Table 3. As can be seen from Table 3, the bi-directional context achieves a lower character error rate for different iterations, where the character error rate for multiple iterations is lower than the character error rate for one iteration. It is noted that by iterating multiple times, the character error rate of the bidirectional context is reduced more than that of the unidirectional context, which further highlights the superiority of the bidirectional context.
Figure DEST_PATH_IMAGE018
TABLE 3
Example three:
fig. 3 shows a structure of a speech recognition apparatus according to a third embodiment of the present invention, and for convenience of explanation, only the parts related to the third embodiment of the present invention are shown.
The speech recognition device 3 of an embodiment of the present invention comprises a processor 30, a memory 31 and a computer program 32 stored in the memory 31 and executable on the processor 30. The processor 30, when executing the computer program 32, implements the steps in the above-described method embodiments, such as the steps S201 to S202 shown in fig. 2.
In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method that two unidirectional decoders respectively utilize unidirectional language information.
Example four:
in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiment, for example, steps S201 to S202 shown in fig. 2.
In the embodiment of the invention, the voice recognition network adopts a Transformer encoder-decoder structure, an encoder of the voice recognition network is used for carrying out primary recognition on input voice characteristics to obtain a primary recognition result, a decoder of the voice recognition network is used for adjusting the primary recognition result by utilizing bidirectional language information provided by the primary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through a preset attention mask applied to each multi-head self-attention layer of the decoder, so that the language information is fully utilized, the voice recognition effect is improved, and the structure is more efficient and uniform compared with a method that two unidirectional decoders respectively utilize unidirectional language information.
The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. A bi-directional context based non-autoregressive speech recognition network, wherein the speech recognition network employs a Transformer encoder-decoder architecture, and wherein:
the encoder of the voice recognition network is used for carrying out primary recognition on the input voice characteristics to obtain a primary recognition result;
and the decoder of the voice recognition network is used for adjusting the preliminary recognition result by utilizing the bidirectional language information provided by the preliminary recognition result and outputting a final voice recognition result, wherein the decoder utilizes the bidirectional language information through preset attention masks applied to each multi-head self-attention layer of the decoder.
2. The voice recognition network of claim 1, wherein the attention mask is a two-dimensional matrix having major diagonal elements of 0 and other elements outside the major diagonal of 1.
3. The speech recognition network of claim 1, wherein the decoder uses a position code as a query Q for a first multi-headed self-attention layer of the decoder, and inputs the same key K and value V into each multi-headed self-attention layer of the decoder.
4. A method of speech recognition based on the bi-directional context based non-autoregressive speech recognition network of any of claims 1-3, the method comprising:
performing primary recognition on input voice features through a trained encoder of the voice recognition network to obtain a primary recognition result;
and adjusting the initial recognition result through a decoder of the trained voice recognition network, and outputting a final voice recognition result, wherein the decoder adjusts the initial recognition result by utilizing the bidirectional language information provided by the initial recognition result.
5. The method of claim 4, wherein prior to performing the preliminary recognition of the input speech by the trained speech recognition network encoder, further comprising:
and performing joint training on a decoder and an encoder of the voice recognition network by using a training set until the loss value of the voice recognition network is minimum, so as to obtain the trained voice recognition network.
6. The method of claim 5, wherein the encoder and the decoder joint loss function is as follows:
Figure 23312DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 612556DEST_PATH_IMAGE002
for the purpose of the joint loss function,
Figure 375982DEST_PATH_IMAGE003
a loss of classification for the connection timing of the encoder,
Figure 443295DEST_PATH_IMAGE004
for the cross-entropy loss of the decoder,
Figure 992088DEST_PATH_IMAGE005
is a hyper-parameter.
7. The method of claim 4, wherein the decoder adapts the preliminary recognition result using an adaptive stop iteration mechanism.
8. The method of claim 4, wherein the preliminary recognition result includes a word sequence length, the decoder outputting the speech recognition results in parallel based on the word sequence length.
9. A speech recognition device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 4 to 8 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 4 to 8.
CN202111066812.2A 2021-09-13 2021-09-13 Non-autoregressive speech recognition network, method and equipment based on bidirectional context Active CN113516973B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111066812.2A CN113516973B (en) 2021-09-13 2021-09-13 Non-autoregressive speech recognition network, method and equipment based on bidirectional context

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111066812.2A CN113516973B (en) 2021-09-13 2021-09-13 Non-autoregressive speech recognition network, method and equipment based on bidirectional context

Publications (2)

Publication Number Publication Date
CN113516973A CN113516973A (en) 2021-10-19
CN113516973B true CN113516973B (en) 2021-11-16

Family

ID=78063283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111066812.2A Active CN113516973B (en) 2021-09-13 2021-09-13 Non-autoregressive speech recognition network, method and equipment based on bidirectional context

Country Status (1)

Country Link
CN (1) CN113516973B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114220432A (en) * 2021-11-15 2022-03-22 交通运输部南海航海保障中心广州通信中心 Maritime single-side-band-based voice automatic monitoring method and system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970031A (en) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 Speech recognition system and method
CN111382582A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Neural machine translation decoding acceleration method based on non-autoregressive
CN111914178A (en) * 2020-08-19 2020-11-10 腾讯科技(深圳)有限公司 Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN112417901A (en) * 2020-12-03 2021-02-26 内蒙古工业大学 Non-autoregressive Mongolian machine translation method based on look-around decoding and vocabulary attention
CN113362813A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11604956B2 (en) * 2017-10-27 2023-03-14 Salesforce.Com, Inc. Sequence-to-sequence prediction using a neural network model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970031A (en) * 2019-12-16 2020-04-07 苏州思必驰信息科技有限公司 Speech recognition system and method
CN111382582A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Neural machine translation decoding acceleration method based on non-autoregressive
CN111914178A (en) * 2020-08-19 2020-11-10 腾讯科技(深圳)有限公司 Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN112417901A (en) * 2020-12-03 2021-02-26 内蒙古工业大学 Non-autoregressive Mongolian machine translation method based on look-around decoding and vocabulary attention
CN113362813A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment

Also Published As

Publication number Publication date
CN113516973A (en) 2021-10-19

Similar Documents

Publication Publication Date Title
WO2021104102A1 (en) Speech recognition error correction method, related devices, and readable storage medium
CN114444479B (en) End-to-end Chinese speech text error correction method, device and storage medium
CN112786005B (en) Information synthesis method, apparatus, electronic device, and computer-readable storage medium
WO2021143206A1 (en) Single-statement natural language processing method and apparatus, computer device, and readable storage medium
CN111539199B (en) Text error correction method, device, terminal and storage medium
CN113516973B (en) Non-autoregressive speech recognition network, method and equipment based on bidirectional context
Lohrenz et al. Multi-encoder learning and stream fusion for transformer-based end-to-end automatic speech recognition
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
CN114242071A (en) Low-resource voice recognition method and system and voice model training method
CN115455946A (en) Voice recognition error correction method and device, electronic equipment and storage medium
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
Zhang et al. Non-autoregressive transformer with unified bidirectional decoder for automatic speech recognition
Wei et al. Leveraging acoustic contextual representation by audio-textual cross-modal learning for conversational asr
CN112489651B (en) Voice recognition method, electronic device and storage device
CN116665675B (en) Voice transcription method, system, electronic equipment and storage medium
Tan et al. Four-in-One: a joint approach to inverse text normalization, punctuation, capitalization, and disfluency for automatic speech recognition
CN115270771B (en) Fine-grained self-adaptive Chinese spelling error correction method assisted by word-sound prediction task
CN112989794A (en) Model training method and device, intelligent robot and storage medium
CN114005434A (en) End-to-end voice confidence calculation method, device, server and medium
Lan et al. Dialogue act recognition using maximum entropy
CN113392645B (en) Prosodic phrase boundary prediction method and device, electronic equipment and storage medium
CN116453507B (en) Confidence model-based voice recognition optimization method, system and storage medium
Miyazaki et al. Structured state space decoder for speech recognition and synthesis
CN109543151B (en) Method for improving wording accuracy of Laos language
WO2024022541A1 (en) Voice recognition method and apparatus, device and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant