CN111243578A - Chinese mandarin character-voice conversion method based on self-attention mechanism - Google Patents

Chinese mandarin character-voice conversion method based on self-attention mechanism Download PDF

Info

Publication number
CN111243578A
CN111243578A CN202010027248.2A CN202010027248A CN111243578A CN 111243578 A CN111243578 A CN 111243578A CN 202010027248 A CN202010027248 A CN 202010027248A CN 111243578 A CN111243578 A CN 111243578A
Authority
CN
China
Prior art keywords
attention
layer
classification
output
binding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010027248.2A
Other languages
Chinese (zh)
Inventor
张鹏远
黎塔
邬龙
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202010027248.2A priority Critical patent/CN111243578A/en
Publication of CN111243578A publication Critical patent/CN111243578A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides an end-to-end speech recognition algorithm of time-limited self-attention binding meaning classification, which fuses position-related attention mechanism classification and binding meaning classification, wherein the attention window length is selected according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion.

Description

Chinese mandarin character-voice conversion method based on self-attention mechanism
Technical Field
The invention relates to the field of speech synthesis, in particular to an end-to-end speech recognition method based on time limit self-attention binding meaning classification.
Background
The speech recognition technology is an important technology for converting input speech into text, wherein an end-to-end speech recognition framework has become an important research direction due to the advantages of simple structure, strong universality, independence on linguistic knowledge, high reasoning speed and the like.
Although the traditional speech recognition algorithm based on hidden Markov and deep neural network has achieved high recognition accuracy, the traditional speech recognition algorithm has the defects of complex flow, inconsistent optimization, independent frame conditions, complex decoding, requirement of expert knowledge and the like. Therefore, end-to-end voice recognition becomes a research hotspot, and the method completes the conversion from voice to text through a uniform neural network. The current mainstream end-to-end identification framework mainly comprises: end-to-end speech recognition based on joint sense classification and end-to-end speech recognition based on attention-based codec networks.
The end-to-end architecture of attention-based codec networks treats speech recognition as a problem of sequence mapping, i.e. mapping input features into corresponding words. Wherein the decoding network uses an attention mechanism to find the correspondence between each word output and the encoder state. For each word output, the distribution of the idea weight is calculated by the state of the decoder and the encoder state information, and the state of the encoder is weighted and summed as the input of the decoder. Although the above structure has the advantages of end-to-end speech recognition and no condition-independent assumption is made, the attention coefficient is not constrained enough, and discontinuous attention weights are learned in the actual training process. Therefore, in order to better constrain attention weight, researchers add associative classification criteria to the training for joint optimization, thereby greatly reducing the occurrence of irregular attention coefficients.
However, the end-to-end modeling framework based on the joint sense classification criterion has the assumption that frames are independent of each other, and actual speech is a continuous time sequence that does not satisfy the assumption.
Disclosure of Invention
The invention provides an end-to-end speech recognition algorithm of time-limited self-attention binding meaning classification, which fuses position-dependent attention mechanism classification and binding meaning classification, wherein the attention window length is selected according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion. .
The technical solution adopted by the present invention to solve the above technical problems is to provide an end-to-end speech recognition method, where the end-to-end speech recognition is implemented by a neural network model, the neural network model includes an encoding layer, a decoding layer, and an attention binding meaning classification layer, and the method includes:
inputting voice features into a coding layer of the neural network model, wherein the coding layer converts the voice sequence into voice features;
the decoding layer calculates attention distribution probability to the high-dimensional vector and converts the high-dimensional vector into a first output symbol sequence representing characters;
the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier by using an attention mechanism;
and combining the first output symbol sequence and the second output symbol sequence to obtain an output symbol sequence of the expression characters of the neural network model.
Preferably, the mathematical expression of the classification criterion of the neural network model is:
LMTL=λLctc+(1-λ)Lattention
where λ is the interpolation coefficient, LctcAnd LattentionThe classification criteria of the decoding layer and the attention tie-aware classification layer, respectively.
Specifically, the mathematical expression of the classification criterion of the attention-binding meaning classification layer is as follows:
LCTC=-ln P(y|phu),
phu=Wprojcu+b
Figure BDA0002362919770000031
aut=Attend(phu-1,au-1,ht)
wherein, WprojAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, phuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
attend () is the attention function, attention weight autThe calculation is as follows:
Figure BDA0002362919770000032
eut=Score(su-1,au-1,ht)
where Score () is content-based attention, or location-based attention, the above equation may be:
eut=vTtanh(Ksu-1+Q(F*au-1)+Wht)
specifically, the mathematical expression of the classification criterion of the attention-binding meaning classification layer is as follows:
LCTC=-lnP(y|phu),
phu=Wprojcu+b
Figure BDA0002362919770000033
Figure BDA0002362919770000034
Figure BDA0002362919770000035
wherein, WprojAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, phuRepresenting u time binding meaning classification criterion inputOutput from the mapping layer, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
wherein the content of the first and second substances,
qt=Qbt,t=u
kt=Kbt,t=u-τ,...,u+τ
vt=Vbt,t=u-τ,...,u+τ
bt=Wembdht,t=u-τ,...,u+τ
bt is an input vector for mapping the input ht of the coding network into an attention mechanism through an input mapping matrix Wembd, k, v and q are keys, values and queries, and K, V, Q is a parameter matrix.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a structural diagram of an end-to-end speech recognition neural network model according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The structure of end-to-end speech recognition neural network is as follows:
fig. 1 is a structural diagram of an end-to-end speech recognition neural network model according to an embodiment of the present invention: as shown, it includes a coded layer (Shared Encoder), a decoded layer (Decoder), and an attention binding classification layer (CTC association).
And the coding layer is used for mapping the input features into high-dimensional vectors.
And a decoding network for decoding the high dimension into an output symbol sequence.
Attention tie taxonomy classification layer
The coding layer receives input voice characteristics and converts the input voice characteristics into high-dimensional vectors;
the decoding layer is used for converting the high-dimensional vector into a first output symbol sequence representing characters, and the attention distribution probability of the speech features is calculated in the conversion;
the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier and an attention mechanism;
and combining the first output symbol sequence and the second output symbol sequence representing the characters to obtain an output symbol sequence representing the characters of the neural network model.
Second, the detailed discussion and embodiments of the model:
the method aims to solve the problem that discontinuous attention weight can be learned in the actual training process because the attention coefficient is not restrained sufficiently by an attention coding and decoding network-based end-to-end speech recognition algorithm. The invention provides a multi-task learning mechanism, namely joint optimization is carried out on a joint meaning classification criterion and a criterion of an encoding and decoding network.
Specifically, during the training process, a forward and backward algorithm of a joint sense classification criterion is used to force monotonic alignment between the input speech features and the output labels.
In one embodiment, the mathematical expression for the joint optimization criterion is:
LMTL=λLctc+(1-λ)Lattention(1)
where λ is the interpolation coefficient, LctcAnd LattentionRespectively, a joint-sense classification criterion and an attention-based codec criterion (e.g., the CTC attribute layer and the Decoder layer in fig. 1 use the classification criterion, respectively).
For the joint sense classification criterion, in order to solve the problem that the length of the output sequence is smaller than that of the input sequence, a blank symbol is added to the output symbol set, and the repeated occurrence of the blank symbol is allowed.
In another embodiment, the conditional probability that the join-sense classification criterion predicts the entire output sequence is:
Figure BDA0002362919770000061
by assuming mutual independence between frames, the above formula can be decomposed into:
Figure BDA0002362919770000062
where x represents the input speech feature and y represents the output sequence. L represents the output symbol set, and T represents the total frame number of the voice. Pi1:T=(π1,...,πT) Representing the output symbol, pi, of each frametE L 'and L' ═ L ∪ blank.p (pi)t| x) is the conditional probability at time t. B is a mapping function that performs the mapping of the output path to the output symbol sequence.
For the encoding and decoding network based on attention mechanism, the final posterior probability is not directly estimated by any condition independent hypothesis, and two networks are used: an encoding network (e.g., the Encoder layer in fig. 1) whose role is to map input features x into an implied layer vector h (a high-dimensional vector) and a decoding network (e.g., the Decoder layer in fig. 1) whose role is to decode the implied layer vector h into an output symbol sequence y.
In one embodiment, the posterior probability can be expressed as:
Figure BDA0002362919770000063
wherein, cuIs a function of the input characteristic x, U is the length of the output sequence is not equal to the input frame length, P (y)u|y1:u-1,cu) Can be expressed as:
P(yu|y1:u-1,cu)=Decoder(yu-1,su-1,cu) (5)
Figure BDA0002362919770000064
ht=Encoder(x) (7)
aut=Attend(su-1,au-1,ht) (8)
wherein Encoder () and Decode () denote encoding network and decoding network, respectively, s is an implicit state vector of the decoding network, h is an implicit state vector of the encoding network, attentive () is an attention network, attention weight autThe calculation is as follows:
Figure BDA0002362919770000071
eut=Score(su-1,au-1,ht) (10)
wherein Score () may be either content-based or location-based attention, hi another embodiment,
eut=vTtanh(Ksu-1+Wht) (11)
in a still further embodiment of the method,
eut=vTtanh(Ksu-1+Q(F*au-1)+Wht) (12)
according to the above description, the learning of attention weight can be effectively restricted by adding joint-meaning classification criterion for joint optimization, so that the learned attention weight keeps monotonous characteristic, however, for the joint-meaning classification criterion, the joint probability is decomposed into products of a series of probabilities by the assumption that the frame condition is independent, and the actual speech does not satisfy the assumption that the frames are independent.
To solve this problem, the present invention proposes an end-to-end speech recognition algorithm based on time-limited self-attention-binding classification, as shown in fig. 1, which incorporates a time-limited attention module before binding-meaning classification criteria, so that the output is not only dependent on the encoded network output at the current moment, but also is related to the encoded network output over a period of time.
In one embodiment, the LCTCCan be expressed as:
Figure BDA0002362919770000072
ph=Wprojh+b (14)
wherein, WprojAnd b represent the weight and bias matrix of the output mapping layer of the associative classification criterion, respectively, ph represents the input of the associative classification criterion.
In another embodiment, attention weights are added to the associative classification criteria, and the mathematical expression becomes:
LCTC=-ln P(y|phu) (15)
phu=Wprojcu+b (16)
Figure BDA0002362919770000081
aut=Attend(phu-1,au-1,ht) (18)
wherein, ph isuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the weighted summation result of the hidden layer (the classification layer is included in the classification layer, and the classification layer is a network layer with classification function, such as CTC attribute and decoding layer in FIG. 1), and τ represents the window length of attention.
In one embodiment, the attention weight is a location-based attention weight, and the mathematical expressions are shown in equations (9), (10), (12), however, the attention mechanism requires learning the dependency relationship between the sequences, which increases the modeling difficulty to some extent. In order to alleviate this problem,
in another embodiment, a join-sense classification criterion based on a self-attention mechanism is presented.
First, the input of the coding network is mapped into an input vector of attention mechanism by an input mapping matrix:
bt=Wembdht,t=u-τ,...,u+τ (16)
secondly, b in formula (16) is mapped through a linear mapping layertMapping to key, value, query is:
qt=Qbt,t=u (20)
kt=Kbt,t=u-τ,...,u+τ (21)
vt=Vbt,t=u-τ,...,u+τ (22)
finally, the attention coefficient obtained from attention and the result can be expressed as:
Figure BDA0002362919770000082
Figure BDA0002362919770000083
Figure BDA0002362919770000084
as can be seen from the above embodiments, the embodiments of the present invention provide an end-to-end speech recognition algorithm for time-limited self-attention binding meaning classification, which fuses a position-dependent attention mechanism classification and a binding meaning classification, wherein the attention window length is taken according to the influence of different attention window lengths on a recognition result, and further provides a self-attention binding meaning classification criterion, and by combining the self-attention mechanism and the binding meaning classification criterion, the problem that the assumption that frames are independent from each other due to the binding meaning classification is not true is solved, and the performance of the end-to-end speech recognition system can be improved.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (4)

1. An end-to-end speech recognition method, the end-to-end speech recognition being through a neural network model comprising an encoding layer, a decoding layer, an attention-binding meaning classification layer, the method comprising:
inputting voice features into a coding layer of the neural network model, wherein the coding layer converts the voice sequence into a high-dimensional vector;
the decoding layer calculates attention distribution probability to the high-dimensional vector and converts the high-dimensional vector into a first output symbol sequence representing characters;
the attention binding meaning classification layer converts the high-dimensional vector into a second output symbol sequence representing characters by using a binding meaning classifier by using an attention mechanism;
and combining the first output symbol sequence and the second output symbol sequence to obtain an output symbol sequence of the expression characters of the neural network model.
2. The method of claim 1, wherein the mathematical expression of the classification criteria of the neural network model is:
LMTL=λLctc+(1-λ)Lattention
where λ is the interpolation coefficient, LctcAnd LattentionThe classification criteria of the decoding layer and the attention tie-aware classification layer, respectively.
3. The method of claim 2, the mathematical expression of the classification criteria of the attention-binding sense classification layer being:
Figure FDA0002362919760000011
phu=Wprojcu+b
Figure FDA0002362919760000012
aut=Attend(phu-1,au-1,ht)
wherein, WprojAnd b represents the weight and bias matrix of the output mapping layer of the associative classification criterion, phuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
attend () is the attention function, attention weight autThe calculation is as follows:
Figure FDA0002362919760000021
eut=Score(su-1,au-1,ht)
where Score () is content-based attention, or location-based attention, the above equation may be:
eut=vTtanh(Ksu-1+Q(F*au-1)+Wht)。
4. the method of claim 2, the mathematical expression of the classification criteria of the attention-binding sense classification layer being:
Figure FDA0002362919760000022
phu=Wprojcu+b
Figure FDA0002362919760000023
Figure FDA0002362919760000024
Figure FDA0002362919760000025
wherein, WprojAnd b represents the output of the associative classification criterionWeight and bias matrix of mapping layer, phuOutput of the output mapping layer representing the u-time binding semantic classification criterion, autRepresents the attention weight, cuRepresents the result of the weighted summation of the hidden layers, τ represents the window length of attention,
wherein the content of the first and second substances,
qt=Qbt,t=u
kt=Kbt,t=u-τ,...,u+τ
vt=Vbt,t=u-τ,...,u+τ
bt=Wembdht,t=u-τ,...,u+τ
bt is an input vector for mapping the input ht of the coding network into an attention mechanism through an input mapping matrix Wembd, k, v and q are keys, values and queries, and K, V, Q is a parameter matrix.
CN202010027248.2A 2020-01-10 2020-01-10 Chinese mandarin character-voice conversion method based on self-attention mechanism Pending CN111243578A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010027248.2A CN111243578A (en) 2020-01-10 2020-01-10 Chinese mandarin character-voice conversion method based on self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010027248.2A CN111243578A (en) 2020-01-10 2020-01-10 Chinese mandarin character-voice conversion method based on self-attention mechanism

Publications (1)

Publication Number Publication Date
CN111243578A true CN111243578A (en) 2020-06-05

Family

ID=70864134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010027248.2A Pending CN111243578A (en) 2020-01-10 2020-01-10 Chinese mandarin character-voice conversion method based on self-attention mechanism

Country Status (1)

Country Link
CN (1) CN111243578A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113450761A (en) * 2021-06-17 2021-09-28 清华大学深圳国际研究生院 Parallel speech synthesis method and device based on variational self-encoder
CN113763933A (en) * 2021-05-06 2021-12-07 腾讯科技(深圳)有限公司 Speech recognition method, and training method, device and equipment of speech recognition model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293291A (en) * 2016-03-30 2017-10-24 中国科学院声学研究所 A kind of audio recognition method end to end based on autoadapted learning rate
US20170372200A1 (en) * 2016-06-23 2017-12-28 Microsoft Technology Licensing, Llc End-to-end memory networks for contextual language understanding
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition
US20180374486A1 (en) * 2017-06-23 2018-12-27 Microsoft Technology Licensing, Llc Speaker recognition
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Audio recognition method based on convolutional neural networks
US20190189115A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Open-Vocabulary End-to-End Speech Recognition
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107293291A (en) * 2016-03-30 2017-10-24 中国科学院声学研究所 A kind of audio recognition method end to end based on autoadapted learning rate
US20170372200A1 (en) * 2016-06-23 2017-12-28 Microsoft Technology Licensing, Llc End-to-end memory networks for contextual language understanding
US20180330718A1 (en) * 2017-05-11 2018-11-15 Mitsubishi Electric Research Laboratories, Inc. System and Method for End-to-End speech recognition
US20180374486A1 (en) * 2017-06-23 2018-12-27 Microsoft Technology Licensing, Llc Speaker recognition
US20190189115A1 (en) * 2017-12-15 2019-06-20 Mitsubishi Electric Research Laboratories, Inc. Method and Apparatus for Open-Vocabulary End-to-End Speech Recognition
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Audio recognition method based on convolutional neural networks
CN110211574A (en) * 2019-06-03 2019-09-06 哈尔滨工业大学 Speech recognition modeling method for building up based on bottleneck characteristic and multiple dimensioned bull attention mechanism

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WATANABE, S.; HORI, T. ET AL.: "ESPnet: End-to-End Speech Processing Toolkit", 《IN PROCEEDINGS OF THE INTERSPEECH 2018》 *
WU LONG,ET AL.: "Improving Hybrid CTCAttention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition", 《APPLIED SCIENCES》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113763933A (en) * 2021-05-06 2021-12-07 腾讯科技(深圳)有限公司 Speech recognition method, and training method, device and equipment of speech recognition model
CN113763933B (en) * 2021-05-06 2024-01-05 腾讯科技(深圳)有限公司 Speech recognition method, training method, device and equipment of speech recognition model
CN113450761A (en) * 2021-06-17 2021-09-28 清华大学深圳国际研究生院 Parallel speech synthesis method and device based on variational self-encoder
CN113450761B (en) * 2021-06-17 2023-09-22 清华大学深圳国际研究生院 Parallel voice synthesis method and device based on variation self-encoder

Similar Documents

Publication Publication Date Title
CN110263323B (en) Keyword extraction method and system based on barrier type long-time memory neural network
CN111382582B (en) Neural machine translation decoding acceleration method based on non-autoregressive
CN108763284B (en) Question-answering system implementation method based on deep learning and topic model
CN113158665B (en) Method for improving dialog text generation based on text abstract generation and bidirectional corpus generation
CN110738370B (en) Novel moving object destination prediction algorithm
CN109858044B (en) Language processing method and device, and training method and device of language processing system
CN110059324B (en) Neural network machine translation method and device based on dependency information supervision
CN112926303A (en) Malicious URL detection method based on BERT-BiGRU
CN112906397B (en) Short text entity disambiguation method
CN111581970B (en) Text recognition method, device and storage medium for network context
CN114443827A (en) Local information perception dialogue method and system based on pre-training language model
CN112612881B (en) Chinese intelligent dialogue method based on Transformer
CN111930952A (en) Method, system, equipment and storage medium for long text cascade classification
CN111243578A (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN109933773B (en) Multiple semantic statement analysis system and method
WO2023231513A1 (en) Conversation content generation method and apparatus, and storage medium and terminal
Duan et al. A study of pre-trained language models in natural language processing
CN108363685B (en) Self-media data text representation method based on recursive variation self-coding model
CN115688879A (en) Intelligent customer service voice processing system and method based on knowledge graph
CN113177113B (en) Task type dialogue model pre-training method, device, equipment and storage medium
CN113297374B (en) Text classification method based on BERT and word feature fusion
Yu et al. Neural network language model compression with product quantization and soft binarization
CN110717343B (en) Optimal alignment method based on transformer attention mechanism output
CN113901758A (en) Relation extraction method for knowledge graph automatic construction system
CN112182162A (en) Personalized dialogue method and system based on memory neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200605

WD01 Invention patent application deemed withdrawn after publication