CN112509564B - End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism - Google Patents

End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism Download PDF

Info

Publication number
CN112509564B
CN112509564B CN202011101902.6A CN202011101902A CN112509564B CN 112509564 B CN112509564 B CN 112509564B CN 202011101902 A CN202011101902 A CN 202011101902A CN 112509564 B CN112509564 B CN 112509564B
Authority
CN
China
Prior art keywords
voice
encoder
training
self
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011101902.6A
Other languages
Chinese (zh)
Other versions
CN112509564A (en
Inventor
庞伟
王亮
陆生礼
狄敏
姚志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University-Wuxi Institute Of Integrated Circuit Technology
Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Original Assignee
Southeast University-Wuxi Institute Of Integrated Circuit Technology
Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University-Wuxi Institute Of Integrated Circuit Technology, Jiangsu Province Nanjing University Of Science And Technology Electronic Information Technology Co ltd filed Critical Southeast University-Wuxi Institute Of Integrated Circuit Technology
Priority to CN202011101902.6A priority Critical patent/CN112509564B/en
Publication of CN112509564A publication Critical patent/CN112509564A/en
Application granted granted Critical
Publication of CN112509564B publication Critical patent/CN112509564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models

Abstract

The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, which uses a mixed mechanism of the connection time sequence classification CTC and the self-attention mechanism SA to directly model English words or Chinese characters without preprocessing or post-processing, and the output result directly corresponds to a correct English sequence or Chinese character sequence. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of the decoder, the attention relationship between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally, the two training criteria are given different weights in a weighted mode. The invention not only can accelerate the convergence rate of the model and obtain more accurate alignment attribute, but also can acquire internal relation between inputs, thereby improving the accuracy and the robustness of the voice recognition system.

Description

End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
Technical Field
The invention discloses an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, relates to a voice recognition technology, and belongs to the technical field of calculation, calculation and counting.
Background
In recent years, with the improvement of computing power, accumulation of data and improvement of algorithms, deep learning technology is gradually replacing traditional machine learning based researches, and at present, fields of computer vision, natural language processing, computer hearing and the like are becoming the most important research hot spot in the field of artificial intelligence, wherein the development of voice recognition technology benefits from the rapid development of the deep learning technology. Since 2011, the deep neural network (Deep Neural Network, DNN) replaces the Gaussian mixture model (Gaussian Mixture Model, GMM) to model the observation probability of the voice, the voice recognition starts to succeed in large-vocabulary continuous voice recognition, and the recognition effect is the biggest breakthrough in the last 10 years. Hitherto, on the road of exploring speech recognition, many techniques have been developed so that the accuracy of speech recognition exceeds the human level. The result is that a plurality of complex technologies are adopted, and the server side deployment is performed, so that a large amount of storage space and calculation resources are used, and a large amount of energy is consumed.
The existing voice recognition method mainly adopts a Long Short-Term Memory (LSTM) to model voice, but the method has the defects that parallel training cannot be performed, next input can be performed after the current input processing is finished, the training time is too Long, gradient disappearance or dispersion is easy to occur, and the like; and when the input sequence is long, only 100 orders of magnitude information can be remembered, but 1000 orders of magnitude information or sequences cannot be remembered. The biggest disadvantage is that the hardware requirements are particularly high, requiring memory bandwidth binding calculations, which is a nightmare for the hardware designer, ultimately limiting the applicability of the solution.
Disclosure of Invention
The invention aims to: in order to overcome the defects in the prior art, the invention provides the end-to-end voice recognition method based on the connection time sequence classification and the self-attention mechanism, which uses CTC to align voices and SA to analyze the relation between voices, combines the advantages of the CTC and the SA, accelerates the training speed, simplifies the model and improves the robustness.
The technical scheme is as follows: in order to achieve the above purpose, the invention adopts the following technical scheme:
an end-to-end speech recognition method based on connection time sequence classification and self-attention mechanism adopts the structure of an encoder-decoder as a main body framework, wherein the structure of the encoder-decoder comprises an encoder sub-network and a decoder sub-network; at the same time, the high-level abstract features extracted by the encoder are used as a part of the input of the decoder to be trained jointly with the decoder. The two sub-networks respectively use different training criteria, then give different weights to the output of the loss function, and update the model parameters to different degrees through the joint optimization of the two sub-networks.
The encoder-decoder architecture is implemented using a converter transformer architecture that models the depth of the input speech and the input labels by stacking multiple layers of transformers and mines the links between speech and text.
The decoder for speech recognition decodes by means of a bundle search, the decoding result directly corresponds to the correct word or chinese character sequence, and the decoding result is decoded by CTC. The acoustic model is modeled by using large granularity units such as English words or Chinese characters, and the English words or Chinese characters generally select training sets, the Chinese characters appearing in test sets and common Chinese characters as output classification results.
The method specifically comprises the following steps:
step 1, data preparation and feature extraction: collecting voice data to obtain a voice data set; firstly, extracting characteristics of voices in a voice data set by using a Mel filter bank algorithm; then, the discrete Fourier transform is used for converting the voice from the time domain to the frequency domain, so that the characteristics of the voice can be observed better; the finally extracted features are used as the input of the network;
step 2, training an acoustic model: the encoder sub-network is regarded as an acoustic model, the voice features extracted by the training set are input into the encoder sub-network, the inherent relation between each frame of voice signals is obtained through a self-attention mechanism, then the feature classification is carried out after full-connection mapping, and finally the training is carried out by using a CTC training criterion;
the encoder sub-network comprises a position coding layer, a multi-head attention layer, a feedforward network layer, a residual connection layer and a layer normalization layer, and finally, an input voice frame is aligned with the output of the encoder by using a CTC loss function; before inputting the features into the encoder sub-network, adding position information to each frame of features, and inputting the features into the encoder sub-network after adding the position information, firstly performing multi-head attention calculation to obtain multi-level features of the voice signals; then using the full connection layer to perform feature mapping, one part is used for the input of the decoder and the other part is used for the output of the encoder; the encoder outputs and then connects a layer of full connection as the classification result of the final modeling unit, the neuron number of the full connection layer is equal to the number of the Chinese characters to be modeled and then added one, namely a space unit;
step 3, training a language model: treating the decoder subnetwork as a language model; the language model is trained by using cross entropy training criteria; the first self-attention layer of decoder sub-network inputs needs to be obscured, i.e. inputs at later moments cannot be seen when processing the current moment; the second self-attention layer of the decoder sub-network needs to receive the output of the encoder to realize the attention relationship between the encoder and the decoder; the end of the decoder subnetwork uses softmax output and trains the language model with a cross entropy loss function; the loss value of the language model is calculated by weighting and summing the loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights;
step 4, decoding: in the training process, saving model parameters with minimum loss values as a trained model; decoding uses a beam searching mode, and the beam searching mode finds a path with the highest probability in the first n paths of paths passing by reaching the current node each time; when the last node is reached, taking the sequence with the maximum path probability as the last recognition result; during testing, the voice to be tested is subjected to feature extraction, then is input into a trained model for processing, and finally is sent to a decoder to obtain a final recognition result.
Preferably: the encoder network uses CTC training criteria, the decoder network uses cross entropy training criteria, and the decoder network uses the CTC training criteriaThe output of the model parameter and the model parameter are weighted, the updating of different emphasis of the model parameter is realized by adjusting the weighting coefficient, and the loss function is loss=lambda ctc +(1-λ)*loss sa Where loss represents the total loss value of the model, loss ctc Loss value, loss, representing CTC training criteria sa Loss values representing cross entropy training criteria, λ is the weighting adjustment factor.
Preferably: in the step 1, the extraction is performed by taking a voice frame with a frame length of 25ms and a frame length of 10ms as a unit.
Preferably: in step 1, windowing is performed on the extracted voice, so that the data continuity of adjacent voice frames is ensured, and frequency spectrum leakage is prevented.
Preferably: the features extracted in step 1 are 40-dimensional mel features.
Preferably: and (3) carrying out data enhancement on the voice data acquired in the step (1), wherein the data enhancement comprises deletion operations on the voice time domain and the voice frequency domain.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements an end-to-end speech recognition method based on connection timing classification and self-attention mechanisms.
A terminal device, comprising: the system comprises a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor realizes an end-to-end voice recognition method based on connection time sequence classification and self-attention mechanism when executing the program.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention provides an end-to-end voice recognition method based on connection time sequence classification and a self-attention mechanism, which can not only accelerate the convergence speed of a model and obtain more accurate alignment attribute, but also acquire internal connection between inputs by creatively combining a CTC (computer to computer) and an SA (computer to computer) mechanism, thereby improving the accuracy and the robustness of a voice recognition system.
(2) The invention uses two training modes during training, but only uses one decoding mode during decoding, thereby improving the accuracy of voice recognition and simultaneously not increasing the burden of decoding.
Drawings
Fig. 1 is a schematic structural view of the present invention.
Fig. 2 is a flow chart of speech recognition of the present invention.
Detailed Description
The present invention is further illustrated in the accompanying drawings and detailed description which are to be understood as being merely illustrative of the invention and not limiting of its scope, and various equivalent modifications to the invention will fall within the scope of the appended claims to the skilled person after reading the invention.
An end-to-end speech recognition method based on connection time sequence classification and Self-attention mechanism uses a mixed mechanism of connection time sequence classification (Connectionist Temporal Classification, CTC) and Self-attention mechanism (Self-Attention Mechanism, SA) to directly model English words or Chinese characters without preprocessing or post-processing, and output results directly correspond to correct English sequences or Chinese character sequences. The method shares the same encoder network, the output of the encoder uses a CTC training criterion, the output of the encoder is also used as the input of the decoder, the attention relationship between the encoder and the decoder is realized, the decoder uses a cross entropy training criterion for training, and finally, the two training criteria are given different weights in a weighted mode, so that the tendency degree of updating network parameters in back propagation is represented. When the weight is 0 or 1, it can be completely regarded as two different network models. The invention realizes the autonomous change of the modeling capability of the model by controlling the weight, and simultaneously accelerates the convergence rate of the model and reduces the training time of the model by utilizing the prealignment characteristic of the CTC. The model structure is shown in fig. 1 and includes an acoustic model, a language model, and a decoder. The identification method comprises the following four steps:
step 1, data preparation and feature extraction: the method comprises the steps of using a network public voice data set, wherein the public voice data set is an Aishell-1 voice data set, dividing the Aishell-1 voice data set into a training set, a verification set and a test set, extracting representative characteristics from each sentence of voice by using a characteristic extraction algorithm, and specifically extracting the characteristics from the voice by using a Mel filter bank algorithm. Since the voice has short-time stationarity, the extraction is performed in units of voice frames with a frame length of 25ms and a frame length of 10 ms. Windowing is carried out on the extracted voice, so that the continuity of data of adjacent voice frames is ensured, and frequency spectrum leakage is prevented. The speech is then converted from the time domain to the frequency domain using a discrete fourier transform, so that the characteristics of the speech can be better observed. Since the perception of sound by the human ear follows the change in mel frequency, i.e., the result of the human ear's perception of sound is twice different, then the human ear's perception of sound is twice different in mel frequency, and thus each frame of speech signal is feature extracted using a mel frequency filter bank. The final extracted feature is a 40-dimensional mel feature, which is taken as an input to the network.
And 2, training an acoustic model. As shown in fig. 1, the encoder sub-network is regarded as an acoustic model, the extracted speech features are input to the encoder sub-network, the inherent relation between each frame of speech signals is obtained through a self-attention mechanism, then feature classification is performed after full connection mapping, and finally training is performed by using a CTC training criterion. The output of the acoustic model uses a CTC training criterion, and the training method does not need to align the input voice with the output sequence, so that a large amount of manpower and material resources are saved, and the development efficiency is improved.
The input voice features are features after data enhancement, and the data enhancement comprises operations such as deletion in the voice time domain and the frequency domain. Before entering the network, the features of position coding are required to be overlapped on the basis of feature dimensions to represent the relative position of the frame voice, then the frame voice enters the network and is processed, finally the output dimension of the encoder is 6308 and represents the final output Chinese character classification number, and since the CTC is used for decoding, a space is added on the basis of the Chinese character classification number, which means that when decoding, if the network cannot recognize the frame voice or the frame voice is a silence section, the network outputs a space for identification.
The structure of the encoder specifically comprises a position coding layer, a multi-head attention layer, a feedforward network layer, two residual error connections and a layer normalization layer, and finally, the input voice frame is aligned with the output of the encoder by using a CTC loss function, so that the overall training speed is accelerated. The location information needs to be added to each frame of features before the features are input to the network because the self-attention mechanism correlates features of the entire speech when acquiring the relationship between the input features, and does not remember the location information. After the position information is added, the characteristics can be input into a network, firstly, multi-head attention calculation is carried out, and the multi-convolution kernel similar to convolution is carried out, so that the multi-level characteristics of the voice signals are obtained. In order to avoid the problem of gradient extinction when back-propagating the update parameters, a residual connection is added. Layer normalization has the effect of stabilizing data within a suitable range as it flows through networks of different layers. The fully connected layer is then used for feature mapping, partly for the input of the decoder and partly for the output of the encoder. And after the output of the encoder, a layer of full connection is connected to serve as a classification result of the final modeling unit, and the number of neurons of the full connection layer is equal to the number of the Chinese characters to be modeled, and then one more is added, namely a space unit.
And 3, training a language model. As shown in fig. 1, the decoder subnetwork is considered as a language model. The language model is trained using cross entropy training criteria. The decoder subnetwork includes a mask multi-head attention layer, a feed forward network layer, three residual connections, and a layer normalization layer. The decoder sub-network is similar to the encoder sub-network in structure and the process flow is also similar. The input of the decoder is a text sequence corresponding to the current training voice, the text sequence enters a sub-network of the decoder through embedded coding, text features can acquire the internal relation between each text through a self-attention mechanism, and finally feature fusion classification is carried out through full connection by using softmax activation, so that the training is carried out through cross entropy training criteria.
The input is that the text sequence is overlapped with the position code through the embedded code, and the special point of the step is that when the current moment is decoded, the input at the later moment is required to be covered so as to prevent the phenomenon of inaccurate identification result caused by data leakage.
The decoder subnetwork differs from the encoder subnetwork structure in two points: firstly, a first self-attention layer input by a decoder sub-network needs to be masked, namely, the input at the later moment can not be seen when the current moment is processed, so that data leakage is prevented, and the model effect is influenced; secondly, the second self-attention layer of the decoder sub-network needs to receive the output of the encoder, so as to realize the attention relationship between the encoder and the decoder. The end of the network uses the softmax output and trains the network model with a cross entropy loss function. The loss value of the model overall is calculated by weighting and summing the loss values of the encoder and the decoder respectively, different weights represent different parameter updating degrees, and the best model effect is obtained by adjusting the weights.
And step four, decoding. And obtaining model parameters with minimum loss through joint optimization of the model parameters and the model parameters, testing data of a test set by using a decoding mode of beam searching, and outputting an identification result. Therefore, during training, the model parameters with the minimum loss value are saved. The decoding uses a beam search approach that finds the path with the highest probability of the first n paths that reach the current node, thus ensuring that the correct paths with the lower probability are not dropped. When the last node is reached, the sequence with the highest path probability is taken as the last recognition result. During testing, the voice is subjected to feature extraction, then is input into a network for processing, and finally is sent to a decoder to obtain a final recognition result.
A computer readable storage medium having stored thereon a computer program which when executed by a processor implements an end-to-end speech recognition method based on connection timing classification and self-attention mechanisms.
A terminal device, comprising: the system comprises a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor realizes an end-to-end voice recognition method based on connection time sequence classification and self-attention mechanism when executing the program.
In summary, the invention provides an end-to-end voice recognition method based on connection time sequence classification and self-attention mechanism, which uses the independent assumption of CTC conditions to quickly align voice and text, thereby accelerating convergence speed, and simultaneously using SA to capture the property of global connection, thereby solving the problem of long-distance dependence. Under the condition of not increasing the complexity of the model, better recognition effect than that of a single model is obtained by adjusting the weight factors and combining the advantages of the model and the weight factors.
The invention combines the mature CTC technology and the transducer model based on SA technology, thereby not only accelerating the training speed of the model, but also capturing the relation between all frames in the voice and improving the stability and recognition rate of the voice recognition system.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (8)

1. An end-to-end speech recognition method based on connection timing classification and self-attention mechanism, comprising the steps of:
step 1, data preparation and feature extraction: collecting voice data to obtain a voice data set; firstly, extracting characteristics of voices in a voice data set by using a Mel filter bank algorithm; then, the discrete Fourier transform is used for converting the voice from the time domain to the frequency domain, so that the characteristics of the voice can be observed better; the finally extracted features are used as the input of the network;
step 2, training an acoustic model: the encoder sub-network is regarded as an acoustic model, the voice features extracted by the training set are input into the encoder sub-network, the inherent relation between each frame of voice signals is obtained through a self-attention mechanism, then the feature classification is carried out after full-connection mapping, and finally the training is carried out by using a CTC training criterion;
the encoder sub-network comprises a position coding layer, a multi-head attention layer, a feedforward network layer, a residual connection layer and a layer normalization layer, and finally, an input voice frame is aligned with the output of the encoder by using a CTC loss function; before inputting the features into the encoder sub-network, adding position information to each frame of features, and inputting the features into the encoder sub-network after adding the position information, firstly performing multi-head attention calculation to obtain multi-level features of the voice signals; then using the full connection layer to perform feature mapping, one part is used for the input of the decoder and the other part is used for the output of the encoder; the encoder outputs and then connects a layer of full connection as the classification result of the final modeling unit, the neuron number of the full connection layer is equal to the number of the Chinese characters to be modeled and then added one, namely a space unit;
step 3, training a language model: treating the decoder subnetwork as a language model; the language model is trained by using cross entropy training criteria; the first self-attention layer of decoder sub-network inputs needs to be obscured, i.e. inputs at later moments cannot be seen when processing the current moment; the second self-attention layer of the decoder sub-network needs to receive the output of the encoder to realize the attention relationship between the encoder and the decoder; the end of the decoder subnetwork uses softmax output and trains the language model with a cross entropy loss function; the loss value of the language model is calculated by weighting and summing the loss values of the encoder and the decoder, different weights represent different parameter updating degrees, and the best model is obtained by adjusting the weights;
step 4, decoding: in the training process, saving model parameters with minimum loss values as a trained model; decoding uses a beam searching mode, and the beam searching mode finds a path with the highest probability in the first n paths of paths passing by reaching the current node each time; when the last node is reached, taking the sequence with the maximum path probability as the last recognition result; during testing, the voice to be tested is subjected to feature extraction, then is input into a trained model for processing, and finally is sent to a decoder to obtain a final recognition result.
2. The end-to-end speech recognition method based on connection timing classification and self-attention mechanism of claim 1, wherein: the loss function is loss=λ×loss ctc +(1-λ)*loss sa Where loss represents model totalLoss value of loss of (1) ctc Loss value, loss, representing CTC training criteria sa Loss values representing cross entropy training criteria, λ is the weighting adjustment factor.
3. The end-to-end speech recognition method based on connection timing classification and self-attention mechanism of claim 2, wherein: in the step 1, the extraction is performed by taking a voice frame with a frame length of 25ms and a frame length of 10ms as a unit.
4. The end-to-end speech recognition method based on connection timing classification and self-attention mechanism of claim 3, wherein: in step 1, windowing is performed on the extracted voice, so that the data continuity of adjacent voice frames is ensured, and frequency spectrum leakage is prevented.
5. The method for end-to-end speech recognition based on connection timing classification and self-attention mechanism of claim 4, wherein: the features extracted in step 1 are 40-dimensional mel features.
6. The method for end-to-end speech recognition based on connection timing classification and self-attention mechanism of claim 5, wherein: and (3) carrying out data enhancement on the voice data acquired in the step (1), wherein the data enhancement comprises deletion operations on the voice time domain and the voice frequency domain.
7. A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the speech recognition method of claim 1.
8. A terminal device, comprising: a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the speech recognition method according to claim 1 when executing the program.
CN202011101902.6A 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism Active CN112509564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011101902.6A CN112509564B (en) 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011101902.6A CN112509564B (en) 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Publications (2)

Publication Number Publication Date
CN112509564A CN112509564A (en) 2021-03-16
CN112509564B true CN112509564B (en) 2024-04-02

Family

ID=74953853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011101902.6A Active CN112509564B (en) 2020-10-15 2020-10-15 End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism

Country Status (1)

Country Link
CN (1) CN112509564B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112767926B (en) * 2021-04-09 2021-06-25 北京世纪好未来教育科技有限公司 End-to-end speech recognition two-pass decoding method and device
CN112863489B (en) * 2021-04-26 2021-07-27 腾讯科技(深圳)有限公司 Speech recognition method, apparatus, device and medium
CN113241075A (en) * 2021-05-06 2021-08-10 西北工业大学 Transformer end-to-end speech recognition method based on residual Gaussian self-attention
CN113436616B (en) * 2021-05-28 2022-08-02 中国科学院声学研究所 Multi-field self-adaptive end-to-end voice recognition method, system and electronic device
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113257239B (en) * 2021-06-15 2021-10-08 深圳市北科瑞声科技股份有限公司 Voice recognition method and device, electronic equipment and storage medium
CN113409772A (en) * 2021-06-15 2021-09-17 西北工业大学 Encoder and end-to-end voice recognition system based on local generation type attention mechanism and adopting same
CN113257248B (en) * 2021-06-18 2021-10-15 中国科学院自动化研究所 Streaming and non-streaming mixed voice recognition system and streaming voice recognition method
CN113488028B (en) * 2021-06-23 2024-02-27 中科极限元(杭州)智能科技股份有限公司 Speech transcription recognition training decoding method and system based on fast jump decoding
CN113362812B (en) * 2021-06-30 2024-02-13 北京搜狗科技发展有限公司 Voice recognition method and device and electronic equipment
CN113808573B (en) * 2021-08-06 2023-11-07 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN113762356B (en) * 2021-08-17 2023-06-16 中山大学 Cluster load prediction method and system based on clustering and attention mechanism
CN113688822A (en) * 2021-09-07 2021-11-23 河南工业大学 Time sequence attention mechanism scene image identification method
CN113782007A (en) * 2021-09-07 2021-12-10 上海企创信息科技有限公司 Voice recognition method and device, voice recognition equipment and storage medium
CN113782029B (en) * 2021-09-22 2023-10-27 广东电网有限责任公司 Training method, device, equipment and storage medium of voice recognition model
CN113707136B (en) * 2021-10-28 2021-12-31 南京南大电子智慧型服务机器人研究院有限公司 Audio and video mixed voice front-end processing method for voice interaction of service robot
CN113936641B (en) * 2021-12-17 2022-03-25 中国科学院自动化研究所 Customizable end-to-end system for Chinese-English mixed speech recognition
CN113990296B (en) * 2021-12-24 2022-05-27 深圳市友杰智新科技有限公司 Training method and post-processing method of voice acoustic model and related equipment
CN114596839A (en) * 2022-03-03 2022-06-07 网络通信与安全紫金山实验室 End-to-end voice recognition method, system and storage medium
CN116781417B (en) * 2023-08-15 2023-11-17 北京中电慧声科技有限公司 Anti-cracking voice interaction method and system based on voice recognition

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN110070879A (en) * 2019-05-13 2019-07-30 吴小军 A method of intelligent expression and phonoreception game are made based on change of voice technology

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11257481B2 (en) * 2018-10-24 2022-02-22 Tencent America LLC Multi-task training architecture and strategy for attention-based speech recognition system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN110070879A (en) * 2019-05-13 2019-07-30 吴小军 A method of intelligent expression and phonoreception game are made based on change of voice technology

Also Published As

Publication number Publication date
CN112509564A (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN112509564B (en) End-to-end voice recognition method based on connection time sequence classification and self-attention mechanism
CN110491416B (en) Telephone voice emotion analysis and identification method based on LSTM and SAE
CN107331384B (en) Audio recognition method, device, computer equipment and storage medium
CN108899051B (en) Speech emotion recognition model and recognition method based on joint feature representation
CN110459225B (en) Speaker recognition system based on CNN fusion characteristics
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
Zhang et al. Seq2seq attentional siamese neural networks for text-dependent speaker verification
CN107633842A (en) Audio recognition method, device, computer equipment and storage medium
CN108922513A (en) Speech differentiation method, apparatus, computer equipment and storage medium
CN110189749A (en) Voice keyword automatic identifying method
WO2018166316A1 (en) Speaker's flu symptoms recognition method fused with multiple end-to-end neural network structures
CN103117060A (en) Modeling approach and modeling system of acoustic model used in speech recognition
US20160358599A1 (en) Speech enhancement method, speech recognition method, clustering method and device
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN110349588A (en) A kind of LSTM network method for recognizing sound-groove of word-based insertion
CN111429938A (en) Single-channel voice separation method and device and electronic equipment
CN107993663A (en) A kind of method for recognizing sound-groove based on Android
CN109754790A (en) A kind of speech recognition system and method based on mixing acoustic model
CN107039036A (en) A kind of high-quality method for distinguishing speek person based on autocoding depth confidence network
CN108922543A (en) Model library method for building up, audio recognition method, device, equipment and medium
CN113488060B (en) Voiceprint recognition method and system based on variation information bottleneck
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
Sheng et al. GANs for children: A generative data augmentation strategy for children speech recognition
CN116230019A (en) Deep emotion clustering method based on semi-supervised speech emotion recognition framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant