CN112509564A - End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism - Google Patents

End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism Download PDF

Info

Publication number
CN112509564A
CN112509564A CN202011101902.6A CN202011101902A CN112509564A CN 112509564 A CN112509564 A CN 112509564A CN 202011101902 A CN202011101902 A CN 202011101902A CN 112509564 A CN112509564 A CN 112509564A
Authority
CN
China
Prior art keywords
voice
encoder
training
decoder
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011101902.6A
Other languages
Chinese (zh)
Other versions
CN112509564B (en
Inventor
庞伟
王亮
陆生礼
狄敏
姚志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University Wuxi Institute Of Integrated Circuit Technology
Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Original Assignee
Southeast University Wuxi Institute Of Integrated Circuit Technology
Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University Wuxi Institute Of Integrated Circuit Technology, Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd filed Critical Southeast University Wuxi Institute Of Integrated Circuit Technology
Priority to CN202011101902.6A priority Critical patent/CN112509564B/en
Publication of CN112509564A publication Critical patent/CN112509564A/en
Application granted granted Critical
Publication of CN112509564B publication Critical patent/CN112509564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of a decoder, the attention relation between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally, different weights are assigned to the two training criteria in a weighting mode. The method can accelerate the convergence speed of the model, obtain more accurate alignment attribute, obtain internal relation between inputs and improve the accuracy and robustness of the voice recognition system.

Description

End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
Technical Field
The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, relates to a voice recognition technology, and belongs to the technical field of calculation, calculation and counting.
Background
In recent years, with the improvement of computing power, data accumulation and algorithm progress, the deep learning technology is gradually replacing the traditional machine learning probability-based research, and the fields of computer vision, natural language processing, computer hearing and the like are becoming the most important research focus in the artificial intelligence field at present, wherein the development of the speech recognition technology is beneficial to the red profit brought by the rapid development of the deep learning technology. Since 2011 Deep Neural Network (DNN) replaces Gaussian Mixture Model (GMM) to Model the observation probability of speech, speech recognition has succeeded in large-vocabulary continuous speech recognition, and the recognition effect has made the biggest breakthrough in the last 10 years. To date, many techniques have been developed on the way to search for speech recognition, so that the accuracy of speech recognition exceeds the human level. But behind this result are a number of complex techniques and deployment on the server side, using a large amount of storage space and computational resources, while also consuming a large amount of energy.
The existing voice recognition method mainly adopts a Long Short-Term Memory network (LSTM) to model voice, but the method has the defects that parallel training cannot be carried out, the next step of input can be carried out after the current input processing is finished, the training time is too Long, and the problems of gradient disappearance or dispersion and the like are easily caused; and when the input sequence is long, only 100 orders of magnitude information can be remembered, and 1000 orders of magnitude information or sequence cannot be remembered. The biggest drawback is that the hardware requirements are particularly high, and bandwidth binding computations need to be stored, which is a nightmare for hardware designers and ultimately limits the applicability of the solution.
Disclosure of Invention
The purpose of the invention is as follows: in order to overcome the defects in the prior art, the invention provides an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
an end-to-end speech recognition method based on connection time sequence classification and self-attention mechanism adopts a coder-decoder structure as a main body frame, the coder-decoder structure comprises a coder sub-network and a decoder sub-network, the method shares the same coder network, and a corresponding decoding sequence is directly output by using a speech recognition decoder after the coder network is trained; meanwhile, the high-level abstract features extracted by the encoder are used as a part of the input of the decoder and jointly trained with the decoder. The two sub-networks respectively use different training criteria, then different weights are given to the output of the loss function, and model parameters are updated to different degrees through the joint optimization of the two sub-networks.
The encoder-decoder architecture is implemented using a transformer architecture that enables deep modeling of input speech and input labels by stacking multiple layers of transformers and mining the linkage between speech and text.
The speech recognition decoder decodes using a beam search, the decoded result directly corresponds to the correct word or kanji sequence, and the decoded result is decoded by the CTC. The acoustic model is modeled by using large-granularity units such as English words or Chinese characters, and the English words or Chinese characters generally select common Chinese characters appearing in a training set and a test set as output classification results.
The method specifically comprises the following steps:
step 1, data preparation and feature extraction: collecting voice data to obtain a voice data set; firstly, feature extraction is carried out on voice in a voice data set by using a Mel filter bank algorithm; then, the voice is converted from a time domain to a frequency domain by using discrete Fourier transform, so that the characteristics of the voice can be better observed; the finally extracted features are used as the input of the network;
step 2, training an acoustic model: regarding the encoder sub-network as an acoustic model, inputting the voice features extracted from the training set into the encoder sub-network, acquiring the internal relation between each frame of voice signals through a self-attention mechanism, performing feature classification after full-connection mapping, and finally performing training by using a CTC training criterion;
the encoder sub-network comprises a position encoding layer, a multi-head attention layer, a feedforward network layer, a residual connection layer and a layer normalization layer, and finally, the CTC loss function is used for aligning an input speech frame with the output of the encoder; before inputting the characteristics into the encoder sub-network, adding position information to each frame of characteristics, inputting the characteristics into the encoder sub-network after adding the position information, and firstly carrying out multi-head attention calculation so as to obtain multi-level characteristics of a voice signal; then using the full connection layer to carry out feature mapping, wherein one part is used for the input of a decoder, and the other part is used for the output of an encoder; the output of the encoder is connected with a layer of full connection as a classification result of a final modeling unit, and the number of neurons of the full connection layer is equal to the number of the modeled Chinese characters plus one, namely a space unit;
step 3, training a language model: treating the decoder subnetwork as a language model; the language model is trained by using a cross entropy training criterion; the first self-attention layer of the decoder subnetwork input needs to be masked, i.e. the input at the later moment cannot be seen when processing the current moment; the second self-attention layer of the decoder subnetwork needs to receive the output of the encoder, so as to realize the attention relationship between the encoder and the decoder; the terminal of the decoder subnetwork uses softmax output and trains a language model with a cross entropy loss function; the overall loss value of the language model is weighted and summed by the respective loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights;
and 4, decoding: in the training process, the model parameter with the minimum loss value is saved and used as a trained model; decoding a beam searching mode, wherein the beam searching mode finds a path with the highest probability in the first n paths of the paths passing through the current node each time; when the last node is reached, taking the sequence with the maximum path probability as the final recognition result; during testing, the voice to be tested is subjected to feature extraction, then input into the trained model for processing, and finally sent to a decoder to obtain a final recognition result.
Preferably: the encoder network uses CTC training criterion, the decoder network uses cross entropy training criterion, the outputs of the two are weighted, the updating of different emphasis of model parameters is realized by adjusting weighting coefficients, and the loss function is that loss is lambda lossctc+(1-λ)*losssaWherein loss represents the total loss value of the model, lossctcLoss value, loss, representing CTC training criterionsaRepresents the loss value of the cross entropy training criterion, and lambda is the weighting adjustment factor.
Preferably: in the step 1, the voice frame with the frame length of 25ms and the frame shift of 10ms is taken as a unit for extraction.
Preferably: windowing is performed on the extracted voice in the step 1, so that the continuity of data of adjacent voice frames is guaranteed, and spectrum leakage is prevented.
Preferably: the features extracted in step 1 are 40-dimensional mel features.
Preferably: and (3) performing data enhancement on the voice data collected in the step (1), wherein the data enhancement comprises deletion operation on a voice time domain and a voice frequency domain.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements an end-to-end speech recognition method based on connection timing classification and a self-attention mechanism.
A terminal device, comprising: a memory, a processor and a computer program stored on the memory and running on the processor, the processor when executing the program implementing an end-to-end speech recognition method based on connection timing classification and a self-attention mechanism.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention provides an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism.
(2) The invention uses two training modes during training, but only uses one decoding mode during decoding, thereby improving the accuracy of speech recognition without increasing the burden of decoding.
Drawings
Fig. 1 is a schematic structural view of the present invention.
FIG. 2 is a flow chart of speech recognition according to the present invention.
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
An end-to-end speech recognition method based on connection time sequence Classification and Self-Attention Mechanism, the method uses a connection time sequence Classification (CTC) and Self-Attention Mechanism (SA) mixed Mechanism to directly model English words or Chinese characters, no pre-processing or post-processing is needed, and the output result directly corresponds to a correct English sequence or Chinese character sequence. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of a decoder, the attention relation between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally different weights are assigned to the two training criteria in a weighting mode to represent the tendency degree of updating the network parameters during reverse propagation. When the weight is 0 or 1, it can be considered as two different network models at all. According to the invention, the autonomous change of the modeling capacity of the model is realized by controlling the weight, and meanwhile, the convergence speed of the model is accelerated and the training time of the model is reduced by utilizing the pre-alignment characteristic of CTC. The model structure is shown in fig. 1 and includes an acoustic model, a language model, and a decoder. The identification method comprises the following four steps:
step 1, data preparation and feature extraction: and (2) using a network public voice data set, wherein the used public voice data set is an Aishell-1 voice data set and is divided into a training set, a verification set and a test set, then using a feature extraction algorithm to extract representative features of each sentence of voice, and specifically using a Mel filter bank algorithm to extract the features of the voice. Because the voice has short-time stationarity, the voice frame with the frame length of 25ms and the frame shift of 10ms is taken as a unit for extraction. Windowing is performed on the extracted voice, so that the continuity of data of adjacent voice frames is guaranteed, and spectrum leakage is prevented. The speech is then transformed from the time domain to the frequency domain using a discrete fourier transform so that the characteristics of the speech can be better observed. Since the perception of the human ear to the sound follows the variation of the mel frequency, that is, the result of the perception of the human ear to the sound is two times different, and then the difference is also two times represented on the mel frequency, the feature extraction is carried out on each frame of speech signal by using the mel frequency filter bank. The finally extracted features are 40-dimensional Mel features, and the Mel features are used as the input of the network.
And 2, training an acoustic model. As shown in fig. 1, the encoder subnetwork is regarded as an acoustic model, the extracted speech features are input into the encoder subnetwork, the intrinsic relation between each frame of speech signal is obtained through a self-attention mechanism, then feature classification is performed after full-connection mapping, and finally training is performed by using a CTC training criterion. The output of the acoustic model uses the CTC training criterion, and the training method does not need to align the input voice with the output sequence, thereby saving a large amount of manpower and material resources and improving the development efficiency.
The input voice features are features which have undergone data enhancement, and the data enhancement includes operations such as deletion in the voice time domain and the voice frequency domain. Before entering a network, the feature of position coding needs to be superimposed on the basis of feature dimension to indicate the relative position of the frame of speech, then the frame of speech is processed after entering the network, and finally the output dimension of the encoder is 6308 to represent the number of classification of Chinese characters output finally.
The encoder structure specifically comprises a position coding layer, a multi-head attention layer, a feedforward network layer, two residual error connections and a layer normalization layer, and finally, the CTC loss function is used for aligning an input speech frame with the output of the encoder, so that the overall training speed is accelerated. It is necessary to add location information to each frame feature before inputting the feature into the network because the attention mechanism associates features of the entire speech without remembering location information when acquiring the relationships between input features. After the position information is added, the characteristics can be input into the network, multi-head attention calculation is firstly carried out, and multi-convolution kernels similar to convolution are carried out, so that the aim of obtaining multi-level characteristics of the voice signal is fulfilled. To avoid the problem of gradient vanishing when updating parameters in the back-propagation, residual concatenation is added. The function of layer normalization is to stabilize data within a proper range when the data flows in the network of different layers. The fully-connected layer is then used for feature mapping, one for the input to the decoder and one for the output of the encoder. The output of the encoder is connected with a layer of full connection as a classification result of a final modeling unit, and the number of neurons of the full connection layer is equal to the number of the modeled Chinese characters plus one, namely a space unit.
And 3, training a language model. As shown in fig. 1, the decoder subnetwork is considered a language model. The language model is trained using cross-entropy training criteria. The decoder subnetwork includes a masking multi-head attention layer, a feed-forward network layer, three residual connections, and a layer normalization layer. The decoder sub-network is similar in structure to the encoder sub-network, and the process flow is also similar. The input of the decoder is a text sequence corresponding to the current training voice, the text sequence enters a decoder sub-network through embedded coding, the text features can also acquire the internal relation between every two texts through a self-attention mechanism, and finally feature fusion classification is carried out through full connection and softmax activation, and training is carried out according to a cross entropy training criterion.
The input is the superposition of embedded coding and position coding of a text sequence, and the special point of the step is that when the current moment is decoded, the input at the later moment needs to be covered to prevent data leakage and inaccurate identification result.
The decoder sub-network and encoder sub-network structure differs from each other by two points: firstly, a first self-attention layer input by a decoder subnetwork needs to be covered, namely, input at the later moment cannot be seen when the current moment is processed, so that data leakage is prevented, and the model effect is prevented from being influenced; second, the decoder subnetwork second needs to receive the encoder's output from the attention layer to achieve the attention relationship between the encoder and decoder. The end of the network uses the softmax output and trains the network model with a cross entropy loss function. The loss value of the model totality is calculated by weighted summation of the loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model effect is obtained by adjusting the weights.
And step four, decoding. And obtaining the model parameter with the minimum loss through the joint optimization of the two parameters, testing the data of the test set by using a decoding mode of beam search, and outputting a recognition result. Therefore, the model parameters with the minimum loss value are saved in the training process. The decoding uses a beam search mode, which finds the path with the highest probability in the first n paths of the paths passed by the current node each time, so as to ensure that the correct paths with low probability are not lost. And when the last node is reached, taking the sequence with the maximum path probability as the final recognition result. During testing, the voice is subjected to feature extraction, then input into a network for processing, and finally sent to a decoder to obtain a final recognition result.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements an end-to-end speech recognition method based on connection timing classification and a self-attention mechanism.
A terminal device, comprising: a memory, a processor and a computer program stored on the memory and running on the processor, the processor when executing the program implementing an end-to-end speech recognition method based on connection timing classification and a self-attention mechanism.
In summary, the present invention provides an end-to-end speech recognition method based on connection timing classification and self-attention mechanism, which utilizes independent assumption of CTC conditions to align speech and text quickly, thereby speeding up convergence, and utilizes SA to capture global association, thereby solving the problem of long distance dependence. Under the condition of not increasing the complexity of the model, the recognition effect better than that of a single model is obtained by adjusting the weight factor and combining the advantages of the weight factor and the weight factor.
The invention combines the mature CTC technology and the transform model based on the SA technology, thereby not only accelerating the training speed of the model, but also capturing the relation among all frames in the voice and improving the stability and the recognition rate of the voice recognition system.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (8)

1. An end-to-end speech recognition method based on connection timing classification and a self-attention mechanism is characterized by comprising the following steps:
step 1, data preparation and feature extraction: collecting voice data to obtain a voice data set; firstly, feature extraction is carried out on voice in a voice data set by using a Mel filter bank algorithm; then, the voice is converted from a time domain to a frequency domain by using discrete Fourier transform, so that the characteristics of the voice can be better observed; the finally extracted features are used as the input of the network;
step 2, training an acoustic model: regarding the encoder sub-network as an acoustic model, inputting the voice features extracted from the training set into the encoder sub-network, acquiring the internal relation between each frame of voice signals through a self-attention mechanism, performing feature classification after full-connection mapping, and finally performing training by using a CTC training criterion;
the encoder sub-network comprises a position encoding layer, a multi-head attention layer, a feedforward network layer, a residual connection layer and a layer normalization layer, and finally, the CTC loss function is used for aligning an input speech frame with the output of the encoder; before inputting the characteristics into the encoder sub-network, adding position information to each frame of characteristics, inputting the characteristics into the encoder sub-network after adding the position information, and firstly carrying out multi-head attention calculation so as to obtain multi-level characteristics of a voice signal; then using the full connection layer to carry out feature mapping, wherein one part is used for the input of a decoder, and the other part is used for the output of an encoder; the output of the encoder is connected with a layer of full connection as a classification result of a final modeling unit, and the number of neurons of the full connection layer is equal to the number of the modeled Chinese characters plus one, namely a space unit;
step 3, training a language model: treating the decoder subnetwork as a language model; the language model is trained by using a cross entropy training criterion; the first self-attention layer of the decoder subnetwork input needs to be masked, i.e. the input at the later moment cannot be seen when processing the current moment; the second self-attention layer of the decoder subnetwork needs to receive the output of the encoder, so as to realize the attention relationship between the encoder and the decoder; the terminal of the decoder subnetwork uses softmax output and trains a language model with a cross entropy loss function; the overall loss value of the language model is weighted and summed by the respective loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights;
and 4, decoding: in the training process, the model parameter with the minimum loss value is saved and used as a trained model; decoding a beam searching mode, wherein the beam searching mode finds a path with the highest probability in the first n paths of the paths passing through the current node each time; when the last node is reached, taking the sequence with the maximum path probability as the final recognition result; during testing, the voice to be tested is subjected to feature extraction, then input into the trained model for processing, and finally sent to a decoder to obtain a final recognition result.
2. The method of claim 1 for end-to-end speech recognition based on connection timing classification and a self-attention mechanism, wherein: the loss function is λ ═ λ -ctc+(1-λ)*losssaWherein loss represents the total loss value of the model, lossctcLoss value, loss, representing CTC training criterionsaRepresents the loss value of the cross entropy training criterion, and lambda is the weighting adjustment factor.
3. The method of claim 2, wherein the method comprises: in the step 1, the voice frame with the frame length of 25ms and the frame shift of 10ms is taken as a unit for extraction.
4. The method of claim 3, wherein the method comprises: windowing is performed on the extracted voice in the step 1, so that the continuity of data of adjacent voice frames is guaranteed, and spectrum leakage is prevented.
5. The method of claim 4, wherein the method comprises: the features extracted in step 1 are 40-dimensional mel features.
6. The method of claim 5, wherein the method comprises: and (3) performing data enhancement on the voice data collected in the step (1), wherein the data enhancement comprises deletion operation on a voice time domain and a voice frequency domain.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the speech recognition method of claim 1.
8. A terminal device, comprising: memory, processor and computer program stored on the memory and running on the processor, characterized in that the processor implements the speech recognition method according to claim 1 when executing the program.
CN202011101902.6A 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism Active CN112509564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011101902.6A CN112509564B (en) 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011101902.6A CN112509564B (en) 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Publications (2)

Publication Number Publication Date
CN112509564A true CN112509564A (en) 2021-03-16
CN112509564B CN112509564B (en) 2024-04-02

Family

ID=74953853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011101902.6A Active CN112509564B (en) 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Country Status (1)

Country Link
CN (1) CN112509564B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767926A (en) * 2021-04-09 2021-05-07 北京世纪好未来教育科技有限公司 End-to-end speech recognition two-pass decoding method and device
CN112863489A (en) * 2021-04-26 2021-05-28 腾讯科技(深圳)有限公司 Speech recognition method, apparatus, device and medium
CN113241075A (en) * 2021-05-06 2021-08-10 西北工业大学 Transformer end-to-end speech recognition method based on residual Gaussian self-attention
CN113257239A (en) * 2021-06-15 2021-08-13 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113257248A (en) * 2021-06-18 2021-08-13 中国科学院自动化研究所 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method
CN113362812A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113409772A (en) * 2021-06-15 2021-09-17 西北工业大学 Encoder and end-to-end voice recognition system based on local generation type attention mechanism and adopting same
CN113436616A (en) * 2021-05-28 2021-09-24 中国科学院声学研究所 Multi-field self-adaptive end-to-end voice recognition method, system and electronic device
CN113488028A (en) * 2021-06-23 2021-10-08 中科极限元(杭州)智能科技股份有限公司 Speech transcription recognition training decoding method and system based on rapid skip decoding
CN113688822A (en) * 2021-09-07 2021-11-23 河南工业大学 Time sequence attention mechanism scene image identification method
CN113707136A (en) * 2021-10-28 2021-11-26 南京南大电子智慧型服务机器人研究院有限公司 Audio and video mixed voice front-end processing method for voice interaction of service robot
CN113762356A (en) * 2021-08-17 2021-12-07 中山大学 Cluster load prediction method and system based on clustering and attention mechanism
CN113782029A (en) * 2021-09-22 2021-12-10 广东电网有限责任公司 Training method, device and equipment of speech recognition model and storage medium
CN113782007A (en) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 Voice recognition method and device, voice recognition equipment and storage medium
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113936641A (en) * 2021-12-17 2022-01-14 中国科学院自动化研究所 Customizable end-to-end system for Chinese-English mixed speech recognition
CN113990296A (en) * 2021-12-24 2022-01-28 深圳市友杰智新科技有限公司 Training method and post-processing method of voice acoustic model and related equipment
CN114596839A (en) * 2022-03-03 2022-06-07 网络通信与安全紫金山实验室 End-to-end voice recognition method, system and storage medium
CN116781417A (en) * 2023-08-15 2023-09-19 北京中电慧声科技有限公司 Anti-cracking voice interaction method and system based on voice recognition

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN110070879A (en) * 2019-05-13 2019-07-30 吴小军 A method of intelligent expression and phonoreception game are made based on change of voice technology
US20200135174A1 (en) * 2018-10-24 2020-04-30 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
US20200135174A1 (en) * 2018-10-24 2020-04-30 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system
CN110070879A (en) * 2019-05-13 2019-07-30 吴小军 A method of intelligent expression and phonoreception game are made based on change of voice technology

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767926B (en) * 2021-04-09 2021-06-25 北京世纪好未来教育科技有限公司 End-to-end speech recognition two-pass decoding method and device
CN112767926A (en) * 2021-04-09 2021-05-07 北京世纪好未来教育科技有限公司 End-to-end speech recognition two-pass decoding method and device
CN112863489A (en) * 2021-04-26 2021-05-28 腾讯科技(深圳)有限公司 Speech recognition method, apparatus, device and medium
CN112863489B (en) * 2021-04-26 2021-07-27 腾讯科技(深圳)有限公司 Speech recognition method, apparatus, device and medium
CN113241075A (en) * 2021-05-06 2021-08-10 西北工业大学 Transformer end-to-end speech recognition method based on residual Gaussian self-attention
CN113436616A (en) * 2021-05-28 2021-09-24 中国科学院声学研究所 Multi-field self-adaptive end-to-end voice recognition method, system and electronic device
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113257239A (en) * 2021-06-15 2021-08-13 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113409772A (en) * 2021-06-15 2021-09-17 西北工业大学 Encoder and end-to-end voice recognition system based on local generation type attention mechanism and adopting same
CN113257239B (en) * 2021-06-15 2021-10-08 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113257248A (en) * 2021-06-18 2021-08-13 中国科学院自动化研究所 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method
CN113488028A (en) * 2021-06-23 2021-10-08 中科极限元(杭州)智能科技股份有限公司 Speech transcription recognition training decoding method and system based on rapid skip decoding
CN113488028B (en) * 2021-06-23 2024-02-27 中科极限元(杭州)智能科技股份有限公司 Speech transcription recognition training decoding method and system based on fast jump decoding
CN113362812A (en) * 2021-06-30 2021-09-07 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113362812B (en) * 2021-06-30 2024-02-13 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113808573B (en) * 2021-08-06 2023-11-07 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113762356B (en) * 2021-08-17 2023-06-16 中山大学 Cluster load prediction method and system based on clustering and attention mechanism
CN113762356A (en) * 2021-08-17 2021-12-07 中山大学 Cluster load prediction method and system based on clustering and attention mechanism
CN113782007A (en) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 Voice recognition method and device, voice recognition equipment and storage medium
CN113688822A (en) * 2021-09-07 2021-11-23 河南工业大学 Time sequence attention mechanism scene image identification method
CN113782029A (en) * 2021-09-22 2021-12-10 广东电网有限责任公司 Training method, device and equipment of speech recognition model and storage medium
CN113782029B (en) * 2021-09-22 2023-10-27 广东电网有限责任公司 Training method, device, equipment and storage medium of voice recognition model
CN113707136A (en) * 2021-10-28 2021-11-26 南京南大电子智慧型服务机器人研究院有限公司 Audio and video mixed voice front-end processing method for voice interaction of service robot
CN113936641B (en) * 2021-12-17 2022-03-25 中国科学院自动化研究所 Customizable end-to-end system for Chinese-English mixed speech recognition
CN113936641A (en) * 2021-12-17 2022-01-14 中国科学院自动化研究所 Customizable end-to-end system for Chinese-English mixed speech recognition
CN113990296B (en) * 2021-12-24 2022-05-27 深圳市友杰智新科技有限公司 Training method and post-processing method of voice acoustic model and related equipment
CN113990296A (en) * 2021-12-24 2022-01-28 深圳市友杰智新科技有限公司 Training method and post-processing method of voice acoustic model and related equipment
CN114596839A (en) * 2022-03-03 2022-06-07 网络通信与安全紫金山实验室 End-to-end voice recognition method, system and storage medium
CN116781417A (en) * 2023-08-15 2023-09-19 北京中电慧声科技有限公司 Anti-cracking voice interaction method and system based on voice recognition
CN116781417B (en) * 2023-08-15 2023-11-17 北京中电慧声科技有限公司 Anti-cracking voice interaction method and system based on voice recognition

Also Published As

Publication number Publication date
CN112509564B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN112509564A (en) End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
US11062699B2 (en) Speech recognition with trained GMM-HMM and LSTM models
WO2018227781A1 (en) Voice recognition method, apparatus, computer device, and storage medium
CN110444208A (en) A kind of speech recognition attack defense method and device based on gradient estimation and CTC algorithm
CN109767759A (en) End-to-end speech recognition methods based on modified CLDNN structure
CN110189749A (en) Voice keyword automatic identifying method
WO2018166316A1 (en) Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN103345923A (en) Sparse representation based short-voice speaker recognition method
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
CN109754790A (en) A kind of speech recognition system and method based on mixing acoustic model
CN107039036A (en) A kind of high-quality method for distinguishing speek person based on autocoding depth confidence network
CN117095694B (en) Bird song recognition method based on tag hierarchical structure attribute relationship
CN114141238A (en) Voice enhancement method fusing Transformer and U-net network
CN113488060B (en) Voiceprint recognition method and system based on variation information bottleneck
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN115394287A (en) Mixed language voice recognition method, device, system and storage medium
Gao et al. ToneNet: A CNN Model of Tone Classification of Mandarin Chinese.
CN108806725A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN114944150A (en) Dual-task-based Conformer land-air communication acoustic model construction method
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Monteiro et al. On the performance of time-pooling strategies for end-to-end spoken language identification
Li et al. Bidirectional LSTM Network with Ordered Neurons for Speech Enhancement.
CN108831486B (en) Speaker recognition method based on DNN and GMM models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant