CN111243578A - Chinese mandarin character-voice conversion method based on self-attention mechanism - Google Patents
Chinese mandarin character-voice conversion method based on self-attention mechanism Download PDFInfo
- Publication number
- CN111243578A CN111243578A CN202010027248.2A CN202010027248A CN111243578A CN 111243578 A CN111243578 A CN 111243578A CN 202010027248 A CN202010027248 A CN 202010027248A CN 111243578 A CN111243578 A CN 111243578A
- Authority
- CN
- China
- Prior art keywords
- attention
- layer
- classification
- output
- binding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 17
- 238000000034 method Methods 0.000 title claims description 18
- 238000006243 chemical reaction Methods 0.000 title description 4
- 241001672694 Citrus reticulata Species 0.000 title description 2
- 238000013507 mapping Methods 0.000 claims description 22
- 239000013598 vector Substances 0.000 claims description 19
- 238000003062 neural network model Methods 0.000 claims description 12
- 230000014509 gene expression Effects 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 2
- 238000005457 optimization Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 208000013409 limited attention Diseases 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention provides an end-to-end speech recognition algorithm of time-limited self-attention binding meaning classification, which fuses position-related attention mechanism classification and binding meaning classification, wherein the attention window length is selected according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion.
Description
Technical Field
The invention relates to the field of speech synthesis, in particular to an end-to-end speech recognition method based on time limit self-attention binding meaning classification.
Background
The speech recognition technology is an important technology for converting input speech into text, wherein an end-to-end speech recognition framework has become an important research direction due to the advantages of simple structure, strong universality, independence on linguistic knowledge, high reasoning speed and the like.
Although the traditional speech recognition algorithm based on hidden Markov and deep neural network has achieved high recognition accuracy, the traditional speech recognition algorithm has the defects of complex flow, inconsistent optimization, independent frame conditions, complex decoding, requirement of expert knowledge and the like. Therefore, end-to-end voice recognition becomes a research hotspot, and the method completes the conversion from voice to text through a uniform neural network. The current mainstream end-to-end identification framework mainly comprises: end-to-end speech recognition based on joint sense classification and end-to-end speech recognition based on attention-based codec networks.
The end-to-end architecture of attention-based codec networks treats speech recognition as a problem of sequence mapping, i.e. mapping input features into corresponding words. Wherein the decoding network uses an attention mechanism to find the correspondence between each word output and the encoder state. For each word output, the distribution of the idea weight is calculated by the state of the decoder and the encoder state information, and the state of the encoder is weighted and summed as the input of the decoder. Although the above structure has the advantages of end-to-end speech recognition and no condition-independent assumption is made, the attention coefficient is not constrained enough, and discontinuous attention weights are learned in the actual training process. Therefore, in order to better constrain attention weight, researchers add associative classification criteria to the training for joint optimization, thereby greatly reducing the occurrence of irregular attention coefficients.
However, the end-to-end modeling framework based on the joint sense classification criterion has the assumption that frames are independent of each other, and actual speech is a continuous time sequence that does not satisfy the assumption.
Disclosure of Invention
The invention provides an end-to-end speech recognition algorithm of time-limited self-attention binding meaning classification, which fuses position-dependent attention mechanism classification and binding meaning classification, wherein the attention window length is selected according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion. .
The technical solution adopted by the present invention to solve the above technical problems is to provide an end-to-end speech recognition method, where the end-to-end speech recognition is implemented by a neural network model, the neural network model includes an encoding layer, a decoding layer, and an attention binding meaning classification layer, and the method includes:
inputting voice features into a coding layer of the neural network model, wherein the coding layer converts the voice sequence into voice features;
the decoding layer calculates attention distribution probability to the high-dimensional vector and converts the high-dimensional vector into a first output symbol sequence representing characters;
the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier by using an attention mechanism;
and combining the first output symbol sequence and the second output symbol sequence to obtain an output symbol sequence of the expression characters of the neural network model.
Preferably, the mathematical expression of the classification criterion of the neural network model is:
LMTL=λLctc+(1-λ)Lattention
where λ is the interpolation coefficient, LctcAnd LattentionThe classification criteria of the decoding layer and the attention tie-aware classification layer, respectively.
Specifically, the mathematical expression of the classification criterion of the attention-binding meaning classification layer is as follows:
LCTC=-ln P(y|phu),
phu=Wprojcu+b
aut=Attend(phu-1,au-1,ht)
wherein, WprojAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, phuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
attend () is the attention function, attention weight autThe calculation is as follows:
eut=Score(su-1,au-1,ht)
where Score () is content-based attention, or location-based attention, the above equation may be:
eut=vTtanh(Ksu-1+Q(F*au-1)+Wht)
specifically, the mathematical expression of the classification criterion of the attention-binding meaning classification layer is as follows:
LCTC=-lnP(y|phu),
phu=Wprojcu+b
wherein, WprojAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, phuRepresenting u time binding meaning classification criterion inputOutput from the mapping layer, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
wherein the content of the first and second substances,
qt=Qbt,t=u
kt=Kbt,t=u-τ,...,u+τ
vt=Vbt,t=u-τ,...,u+τ
bt=Wembdht,t=u-τ,...,u+τ
bt is an input vector for mapping the input ht of the coding network into an attention mechanism through an input mapping matrix Wembd, k, v and q are keys, values and queries, and K, V, Q is a parameter matrix.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a structural diagram of an end-to-end speech recognition neural network model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The structure of end-to-end speech recognition neural network is as follows:
fig. 1 is a structural diagram of an end-to-end speech recognition neural network model according to an embodiment of the present invention: as shown, it includes a coded layer (Shared Encoder), a decoded layer (Decoder), and an attention binding classification layer (CTC association).
And the coding layer is used for mapping the input features into high-dimensional vectors.
And a decoding network for decoding the high dimension into an output symbol sequence.
Attention tie taxonomy classification layer
The coding layer receives input voice characteristics and converts the input voice characteristics into high-dimensional vectors;
the decoding layer is used for converting the high-dimensional vector into a first output symbol sequence representing characters, and the attention distribution probability of the speech features is calculated in the conversion;
the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier and an attention mechanism;
and combining the first output symbol sequence and the second output symbol sequence representing the characters to obtain an output symbol sequence representing the characters of the neural network model.
Second, the detailed discussion and embodiments of the model:
the method aims to solve the problem that discontinuous attention weight can be learned in the actual training process because the attention coefficient is not restrained sufficiently by an attention coding and decoding network-based end-to-end speech recognition algorithm. The invention provides a multi-task learning mechanism, namely joint optimization is carried out on a joint meaning classification criterion and a criterion of an encoding and decoding network.
Specifically, during the training process, a forward and backward algorithm of a joint sense classification criterion is used to force monotonic alignment between the input speech features and the output labels.
In one embodiment, the mathematical expression for the joint optimization criterion is:
LMTL=λLctc+(1-λ)Lattention(1)
where λ is the interpolation coefficient, LctcAnd LattentionRespectively, a joint-sense classification criterion and an attention-based codec criterion (e.g., the CTC attribute layer and the Decoder layer in fig. 1 use the classification criterion, respectively).
For the joint sense classification criterion, in order to solve the problem that the length of the output sequence is smaller than that of the input sequence, a blank symbol is added to the output symbol set, and the repeated occurrence of the blank symbol is allowed.
In another embodiment, the conditional probability that the join-sense classification criterion predicts the entire output sequence is:
by assuming mutual independence between frames, the above formula can be decomposed into:
where x represents the input speech feature and y represents the output sequence. L represents the output symbol set, and T represents the total frame number of the voice. Pi1:T=(π1,...,πT) Representing the output symbol, pi, of each frametE L 'and L' ═ L ∪ blank.p (pi)t| x) is the conditional probability at time t. B is a mapping function that performs the mapping of the output path to the output symbol sequence.
For the encoding and decoding network based on attention mechanism, the final posterior probability is not directly estimated by any condition independent hypothesis, and two networks are used: an encoding network (e.g., the Encoder layer in fig. 1) whose role is to map input features x into an implied layer vector h (a high-dimensional vector) and a decoding network (e.g., the Decoder layer in fig. 1) whose role is to decode the implied layer vector h into an output symbol sequence y.
In one embodiment, the posterior probability can be expressed as:
wherein, cuIs a function of the input characteristic x, U is the length of the output sequence is not equal to the input frame length, P (y)u|y1:u-1,cu) Can be expressed as:
P(yu|y1:u-1,cu)=Decoder(yu-1,su-1,cu) (5)
ht=Encoder(x) (7)
aut=Attend(su-1,au-1,ht) (8)
wherein Encoder () and Decode () denote encoding network and decoding network, respectively, s is an implicit state vector of the decoding network, h is an implicit state vector of the encoding network, attentive () is an attention network, attention weight autThe calculation is as follows:
eut=Score(su-1,au-1,ht) (10)
wherein Score () may be either content-based or location-based attention, hi another embodiment,
eut=vTtanh(Ksu-1+Wht) (11)
in a still further embodiment of the method,
eut=vTtanh(Ksu-1+Q(F*au-1)+Wht) (12)
according to the above description, the learning of attention weight can be effectively restricted by adding joint-meaning classification criterion for joint optimization, so that the learned attention weight keeps monotonous characteristic, however, for the joint-meaning classification criterion, the joint probability is decomposed into products of a series of probabilities by the assumption that the frame condition is independent, and the actual speech does not satisfy the assumption that the frames are independent.
To solve this problem, the present invention proposes an end-to-end speech recognition algorithm based on time-limited self-attention-binding classification, as shown in fig. 1, which incorporates a time-limited attention module before binding-meaning classification criteria, so that the output is not only dependent on the encoded network output at the current moment, but also is related to the encoded network output over a period of time.
In one embodiment, the LCTCCan be expressed as:
ph=Wprojh+b (14)
wherein, WprojAnd b represent the weight and bias matrix of the output mapping layer of the associative classification criterion, respectively, ph represents the input of the associative classification criterion.
In another embodiment, attention weights are added to the associative classification criteria, and the mathematical expression becomes:
LCTC=-ln P(y|phu) (15)
phu=Wprojcu+b (16)
aut=Attend(phu-1,au-1,ht) (18)
wherein, ph isuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the weighted summation result of the hidden layer (the classification layer is included in the classification layer, and the classification layer is a network layer with classification function, such as CTC attribute and decoding layer in FIG. 1), and τ represents the window length of attention.
In one embodiment, the attention weight is a location-based attention weight, and the mathematical expressions are shown in equations (9), (10), (12), however, the attention mechanism requires learning the dependency relationship between the sequences, which increases the modeling difficulty to some extent. In order to alleviate this problem,
in another embodiment, a join-sense classification criterion based on a self-attention mechanism is presented.
First, the input of the coding network is mapped into an input vector of attention mechanism by an input mapping matrix:
bt=Wembdht,t=u-τ,...,u+τ (16)
secondly, b in formula (16) is mapped through a linear mapping layertMapping to key, value, query is:
qt=Qbt,t=u (20)
kt=Kbt,t=u-τ,...,u+τ (21)
vt=Vbt,t=u-τ,...,u+τ (22)
finally, the attention coefficient obtained from attention and the result can be expressed as:
as can be seen from the above embodiments, the embodiments of the present invention provide an end-to-end speech recognition algorithm for time-limited self-attention binding meaning classification, which fuses a position-dependent attention mechanism classification and a binding meaning classification, wherein the attention window length is taken according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion, and by combining the self-attention mechanism and the binding meaning classification criterion, the problem that the assumption that frames are independent from each other due to the binding meaning classification is not true is solved, and the performance of the end-to-end speech recognition system can be improved.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (4)
1. An end-to-end speech recognition method, the end-to-end speech recognition being through a neural network model comprising an encoding layer, a decoding layer, an attention-binding meaning classification layer, the method comprising:
inputting voice features into a coding layer of the neural network model, wherein the coding layer converts the voice sequence into a high-dimensional vector;
the decoding layer calculates attention distribution probability to the high-dimensional vector and converts the high-dimensional vector into a first output symbol sequence representing characters;
the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier by using an attention mechanism;
and combining the first output symbol sequence and the second output symbol sequence to obtain an output symbol sequence of the expression characters of the neural network model.
2. The method of claim 1, wherein the mathematical expression of the classification criteria of the neural network model is:
LMTL=λLctc+(1-λ)Lattention
where λ is the interpolation coefficient, LctcAnd LattentionThe classification criteria of the decoding layer and the attention tie-aware classification layer, respectively.
3. The method of claim 2, the mathematical expression of the classification criteria of the attention-binding sense classification layer being:
phu=Wprojcu+b
aut=Attend(phu-1,au-1,ht)
wherein, WprojAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, phuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
attend () is the attention function, attention weight autThe calculation is as follows:
eut=Score(su-1,au-1,ht)
where Score () is content-based attention, or location-based attention, the above equation may be:
eut=vTtanh(Ksu-1+Q(F*au-1)+Wht)。
4. the method of claim 2, the mathematical expression of the classification criteria of the attention-binding sense classification layer being:
phu=Wprojcu+b
wherein, WprojAnd b represents the output of the associative classification criterionWeight and bias matrix of mapping layer, phuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
wherein the content of the first and second substances,
qt=Qbt,t=u
kt=Kbt,t=u-τ,...,u+τ
vt=Vbt,t=u-τ,...,u+τ
bt=Wembdht,t=u-τ,...,u+τ
bt is an input vector for mapping the input ht of the coding network into an attention mechanism through an input mapping matrix Wembd, k, v and q are keys, values and queries, and K, V, Q is a parameter matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010027248.2A CN111243578A (en) | 2020-01-10 | 2020-01-10 | Chinese mandarin character-voice conversion method based on self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010027248.2A CN111243578A (en) | 2020-01-10 | 2020-01-10 | Chinese mandarin character-voice conversion method based on self-attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111243578A true CN111243578A (en) | 2020-06-05 |
Family
ID=70864134
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010027248.2A Pending CN111243578A (en) | 2020-01-10 | 2020-01-10 | Chinese mandarin character-voice conversion method based on self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111243578A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113450761A (en) * | 2021-06-17 | 2021-09-28 | 清华大学深圳国际研究生院 | Parallel speech synthesis method and device based on variational self-encoder |
CN113763933A (en) * | 2021-05-06 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Speech recognition method, and training method, device and equipment of speech recognition model |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107293291A (en) * | 2016-03-30 | 2017-10-24 | 中国科学院声学研究所 | A kind of audio recognition method end to end based on autoadapted learning rate |
US20170372200A1 (en) * | 2016-06-23 | 2017-12-28 | Microsoft Technology Licensing, Llc | End-to-end memory networks for contextual language understanding |
US20180330718A1 (en) * | 2017-05-11 | 2018-11-15 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for End-to-End speech recognition |
US20180374486A1 (en) * | 2017-06-23 | 2018-12-27 | Microsoft Technology Licensing, Llc | Speaker recognition |
CN109272990A (en) * | 2018-09-25 | 2019-01-25 | 江南大学 | Audio recognition method based on convolutional neural networks |
US20190189115A1 (en) * | 2017-12-15 | 2019-06-20 | Mitsubishi Electric Research Laboratories, Inc. | Method and Apparatus for Open-Vocabulary End-to-End Speech Recognition |
CN110211574A (en) * | 2019-06-03 | 2019-09-06 | 哈尔滨工业大学 | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism |
-
2020
- 2020-01-10 CN CN202010027248.2A patent/CN111243578A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107293291A (en) * | 2016-03-30 | 2017-10-24 | 中国科学院声学研究所 | A kind of audio recognition method end to end based on autoadapted learning rate |
US20170372200A1 (en) * | 2016-06-23 | 2017-12-28 | Microsoft Technology Licensing, Llc | End-to-end memory networks for contextual language understanding |
US20180330718A1 (en) * | 2017-05-11 | 2018-11-15 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for End-to-End speech recognition |
US20180374486A1 (en) * | 2017-06-23 | 2018-12-27 | Microsoft Technology Licensing, Llc | Speaker recognition |
US20190189115A1 (en) * | 2017-12-15 | 2019-06-20 | Mitsubishi Electric Research Laboratories, Inc. | Method and Apparatus for Open-Vocabulary End-to-End Speech Recognition |
CN109272990A (en) * | 2018-09-25 | 2019-01-25 | 江南大学 | Audio recognition method based on convolutional neural networks |
CN110211574A (en) * | 2019-06-03 | 2019-09-06 | 哈尔滨工业大学 | Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism |
Non-Patent Citations (2)
Title |
---|
WATANABE, S.; HORI, T. ET AL.: "ESPnet: End-to-End Speech Processing Toolkit", 《IN PROCEEDINGS OF THE INTERSPEECH 2018》 * |
WU LONG,ET AL.: "Improving Hybrid CTCAttention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition", 《APPLIED SCIENCES》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113763933A (en) * | 2021-05-06 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Speech recognition method, and training method, device and equipment of speech recognition model |
CN113763933B (en) * | 2021-05-06 | 2024-01-05 | 腾讯科技(深圳)有限公司 | Speech recognition method, training method, device and equipment of speech recognition model |
CN113450761A (en) * | 2021-06-17 | 2021-09-28 | 清华大学深圳国际研究生院 | Parallel speech synthesis method and device based on variational self-encoder |
CN113450761B (en) * | 2021-06-17 | 2023-09-22 | 清华大学深圳国际研究生院 | Parallel voice synthesis method and device based on variation self-encoder |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110263323B (en) | Keyword extraction method and system based on barrier type long-time memory neural network | |
CN111382582B (en) | Neural machine translation decoding acceleration method based on non-autoregressive | |
CN108763284B (en) | Question-answering system implementation method based on deep learning and topic model | |
CN113158665B (en) | Method for improving dialog text generation based on text abstract generation and bidirectional corpus generation | |
CN110738370B (en) | Novel moving object destination prediction algorithm | |
CN109858044B (en) | Language processing method and device, and training method and device of language processing system | |
CN110059324B (en) | Neural network machine translation method and device based on dependency information supervision | |
CN112926303A (en) | Malicious URL detection method based on BERT-BiGRU | |
CN112906397B (en) | Short text entity disambiguation method | |
CN111581970B (en) | Text recognition method, device and storage medium for network context | |
CN114443827A (en) | Local information perception dialogue method and system based on pre-training language model | |
CN112612881B (en) | Chinese intelligent dialogue method based on Transformer | |
CN111930952A (en) | Method, system, equipment and storage medium for long text cascade classification | |
CN111243578A (en) | Chinese mandarin character-voice conversion method based on self-attention mechanism | |
CN109933773B (en) | Multiple semantic statement analysis system and method | |
WO2023231513A1 (en) | Conversation content generation method and apparatus, and storage medium and terminal | |
Duan et al. | A study of pre-trained language models in natural language processing | |
CN108363685B (en) | Self-media data text representation method based on recursive variation self-coding model | |
CN115688879A (en) | Intelligent customer service voice processing system and method based on knowledge graph | |
CN113177113B (en) | Task type dialogue model pre-training method, device, equipment and storage medium | |
CN113297374B (en) | Text classification method based on BERT and word feature fusion | |
Yu et al. | Neural network language model compression with product quantization and soft binarization | |
CN110717343B (en) | Optimal alignment method based on transformer attention mechanism output | |
CN113901758A (en) | Relation extraction method for knowledge graph automatic construction system | |
CN112182162A (en) | Personalized dialogue method and system based on memory neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200605 |
|
WD01 | Invention patent application deemed withdrawn after publication |