CN113470620A - Speech recognition method - Google Patents

Speech recognition method Download PDF

Info

Publication number
CN113470620A
CN113470620A CN202110761056.9A CN202110761056A CN113470620A CN 113470620 A CN113470620 A CN 113470620A CN 202110761056 A CN202110761056 A CN 202110761056A CN 113470620 A CN113470620 A CN 113470620A
Authority
CN
China
Prior art keywords
voice
layer
speech recognition
data preprocessing
recognition model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110761056.9A
Other languages
Chinese (zh)
Inventor
张玉腾
宁新
杜静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao Dongting Intelligent Technology Co ltd
Original Assignee
Qingdao Dongting Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao Dongting Intelligent Technology Co ltd filed Critical Qingdao Dongting Intelligent Technology Co ltd
Priority to CN202110761056.9A priority Critical patent/CN113470620A/en
Publication of CN113470620A publication Critical patent/CN113470620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a voice recognition method, which comprises the following steps: performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting words appearing in the text contents to create a dictionary; constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism; training a voice recognition model based on the FBank feature data and the dictionary data; and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result. Thereby, the result of the streaming voice recognition can be improved in the information considering the priority context.

Description

Speech recognition method
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a voice recognition method.
Background
Due to the rapid development of deep learning technology, more and more end-to-end voice recognition methods based on deep learning appear in the voice recognition field. Compared with the traditional method, the end-to-end speech recognition method simplifies the system architecture, only uses the neural network to form, and inputs the audio data and directly outputs the grapheme of the target language. This avoids the need for specific language specialists when building such systems, reducing the implementation threshold.
The streaming voice recognition is the core technology of systems such as a dialogue system, simultaneous interpretation, real-time subtitles and the like. In recent years, the performance of end-to-end speech recognition systems has outperformed highly optimized hybrid systems. The currently proposed streaming end-to-end speech recognition systems are mainly divided into two main categories: a CTC-based Neural Network model and an RNN-T (Current Neural Network transmitter) -based Neural Network model. These models are identified on a frame-by-frame basis, so that streaming speech recognition is easily achieved. Therefore, streaming speech recognition has a huge demand in real life, but the existing streaming speech recognition cannot effectively utilize the context information for effective recognition.
Disclosure of Invention
In order to solve the problem that the prior art can not effectively identify the contact context, the invention provides a voice identification method which has the characteristics of being capable of effectively identifying the contact context, having higher and more accurate identification efficiency and the like.
A speech recognition method according to an embodiment of the present invention includes:
performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting a character creation dictionary appearing in the text contents;
constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;
training the speech recognition model based on the FBank feature data and the lexicon data;
and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result.
Further, the voice data preprocessing comprises:
and converting the voice file into a WAV format, wherein the sampling rate is 8K, the single channel is adopted, and the FBank characteristic of each audio frequency is extracted.
Further, the text data preprocessing comprises:
extracting characters appearing in the audio file according to the text content of the audio file, and creating a dictionary; and giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained.
Furthermore, the voice recognition model comprises a down-sampling layer, the down-sampling layer takes the FBank characteristic data as input, two-dimensional convolution operations are sequentially carried out, then one nonlinear transformation is carried out, then the position characteristic is added to the output after the two-dimensional convolution, and the output after the two-dimensional convolution and the position characteristic are added to be used as the output of the down-sampling layer.
Further, the speech recognition model further includes an encoding layer that encodes an output of the downsampling layer based on the plurality of transform Encoder network blocks of the attention mechanism.
Furthermore, the speech recognition model further comprises a trigger layer, wherein the trigger layer is composed of a CTC module and is used for recognizing the time point of grapheme output in the sequence output by the coding layer and cutting the grapheme output into output blocks of the coding layer.
Further, the speech recognition model further comprises a decoding layer, and the decoding layer sequentially inputs the output blocks of the coding layer segmented by the trigger layer into the decoding layer based on a plurality of transform Decoder network blocks of the attention mechanism to obtain the output of the decoding layer.
Further, the decoding layer decodes based on a wave speed search algorithm.
Further, the trigger layer generates a segmentation event based on the trigger layer of the CTC, and judges whether to activate the decoding layer according to the softmax result of the segmentation event.
The invention has the beneficial effects that: after data preprocessing is carried out on a voice file, the attention-based voice recognition method is applied to streaming voice recognition, an input sequence is cut into small fragments by using a CTC model, the attention-based model is used for recognizing the result of each fragment, and finally the fragment recognition results are spliced to obtain a complete result, so that the streaming voice recognition result can be improved in the information considering the priority context, and the recognition result is more accurate and reliable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow diagram of a method of speech recognition provided in accordance with an exemplary embodiment;
FIG. 2 is a block diagram of a speech recognition model provided in accordance with an exemplary embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a speech recognition method, which specifically includes the following steps:
101. performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting words appearing in the text contents to create a dictionary; the first data set preparation is mainly composed of two steps of voice data preprocessing and text data preprocessing.
102. Constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;
103. training the speech recognition model based on FBank feature data and dictionary data;
104. and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result. The decoding of the streaming audio is realized by means of frame synchronization.
The attention-based voice recognition method is applied to streaming voice recognition, the CTC model is used for cutting an input sequence into small segments, the attention-based model is used for recognizing the result of each segment, and finally the segment recognition results are spliced to obtain a complete result, so that the streaming voice recognition result can be improved in the information considering the priority context, and the recognition result is more accurate and reliable.
As a possible implementation of the above embodiment, first, the data set preparation includes two steps of voice data preprocessing and text data preprocessing:
voice data preprocessing: and converting the voice file into a WAV format, wherein the sampling rate is 8K and the single channel is adopted. Extracting FBank characteristics of each audio, firstly pre-emphasizing a speech signal, improving high-frequency components to flatten the frequency spectrum of the signal, and realizing pre-emphasis by using a first-order FIR high-pass filter, wherein the formula is as follows:
y(n)=x(n)-ax(n-1),0.9<a<1.0
where α is the pre-emphasis coefficient. The speech signal is then framed and cut into time segments of fixed short length, where the speech signal can be processed as a stationary signal, typically with a framing length set to 20-50 ms. Windowing is performed on the voice signals after framing, spectrum leakage errors are reduced, time domain signals can better meet the periodicity requirement of Fourier transform, and a Hamming window formula is usually selected as a window function:
Figure BDA0003149104030000041
where N is the total number of samples in speech. Then, short-time Fourier transform is used for the windowed voice segment to obtain frequency domain information, and then a formula is utilized:
Figure BDA0003149104030000042
a power spectrum is obtained. Finally, the FBank features are obtained using Mel filtering and the results are logarithmized. The conversion formula for the frequency f and the Mel frequency m is:
Figure BDA0003149104030000043
Figure BDA0003149104030000044
preprocessing text data: extracting characters appearing in the data set according to text contents corresponding to the audio in the data set, and creating a dictionary; three special symbols are then added to the dictionary: < blank >, < eos/sos >, < unk >, respectively, indicating: whitespace in CTCs, text start and end symbols, unknown word symbols. And giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained. Note that: here, the < blank > index is 0, the < unk > index is 1, the < eos/sos > index is 2, and other text indexes can be set arbitrarily.
Referring to the structure diagram of the speech recognition model shown in fig. 2, the down-sampling layer takes the processed speech Fbank features as input, and sequentially passes through two-dimensional convolution operations, and the two-dimensional convolution parameters sequentially are: the convolution kernel is 3, the step length is 2; the convolution kernel is 5 and the step size is 3. After each two-dimensional convolution operation, a nonlinear transformation is performed. And adding position characteristics to the output after the two-dimensional convolution, wherein the characteristics are generated by adopting absolute position coding, and a formula is created as follows:
Figure BDA0003149104030000051
Figure BDA0003149104030000052
where pos corresponds to an input position, PE (pos,2i) represents a position code where pos is an even number, and PE (pos,2i +1) represents a position code where pos is an odd number, and represents a dimension of a position feature. And adding the output after the two-dimensional convolution and the position characteristic to obtain the output of the down-sampling layer.
And (3) coding layer: the layer is composed of N transform Encoder network blocks, wherein N is an integer greater than 2, the input of the first transform network block is the output of the downsampling layer, and the inputs of other transform network blocks are the outputs of the previous block. The transform network block sequentially includes a multi-head attention layer, a normalization layer, a residual error layer, and a feedforward connection layer, which can be specifically constructed according to the existing structure, and the present invention is not described herein again.
An active layer: the layer is composed of a CTC module, can identify the output time point of grapheme in the sequence output by the coding layer, controls whether the decoding layer network is activated or not, obtains an output sequence after inputting the coding layer to the activation layer, and obtains the output sequence by selecting the maximum value at each time step as a result due to the various characteristics of the CTC. And (2) reserving the first word of the continuous words which are not the < blank > in the sequence, replacing the first word with the word < block >, and then segmenting the output of the coding layer according to the index i of the word which is not the < blank >, and segmenting the output into output blocks of H coding layers. And when the model is cut, the cutting position can be controlled according to the parameter e, and is set as i-e, so that the model can see more history information.
A decoding layer: the layer is composed of M transform Decoder network blocks, wherein M is an integer larger than 2, and output blocks of the coding layer segmented by the activation layer are sequentially input into the decoding layer to obtain the output of the decoding layer. And splicing the results of all the blocks to obtain an identification result.
And finally, decoding the streaming audio in a frame synchronization mode. And (3) generating a segmentation event based on the activation layer of the CTC, judging whether to activate a decoding layer according to the softmax result of the event, and activating the decoding layer by the event with the result larger than 0.6 in our experiment. The output of the coding layer between the two recent events is sent to the decoding layer, and the decoding process uses the traditional wave speed searching algorithm. In the decoding process, because the sequence input by the decoding layer each time is the input sequence of one word in the CTC result, the problem of misalignment does not exist, and a penalty factor of adding length constraint in decoding based on tag synchronization is not required.
In some embodiments of the present invention, for training of the speech recognition model, let S ═ S (S)1,...,sT) Representing a CTC sequence frame of length T, where st∈E∪<blank>And E represents a collection of different graphemes,<blank>indicating a blank symbol. Let K ═ K1,...,kL) Wherein k islE represents a grapheme sequence of length L and assumes that when the repeated labels are folded into a single instance and blank symbols are removed, the sequence S is reduced to K. The derivation of CTC is as follows:
Figure BDA0003149104030000061
where p (S | K) represents transition probability and p (S | H) represents acoustic model.
The derivation formula of the alignment information provided by the activation layer on the decoding layer is as follows:
Figure BDA0003149104030000062
co-training the active layer and the decoding layer by using a multi-objective loss function, wherein the formula is as follows:
Figure BDA0003149104030000063
where λ is a hyperparameter, for controlling pctcAnd patThe weight of (c).
The speech recognition method provided by the above embodiment of the present invention applies the attention-based speech recognition method to the streaming speech recognition, cuts the input sequence into small segments using the CTC model, allows the attention-based model to recognize the result of each segment, and finally splices the segment recognition results to obtain a complete result, so that the streaming speech recognition result can be improved in the information considering the priority context.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (9)

1. A speech recognition method, comprising:
performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting a character creation dictionary appearing in the text contents;
constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;
training the speech recognition model based on the FBank feature data and the lexicon data;
and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result.
2. The speech recognition method of claim 1, wherein the speech data preprocessing comprises:
and converting the voice file into a WAV format, wherein the sampling rate is 8K, the single channel is adopted, and the FBank characteristic of each audio frequency is extracted.
3. The speech recognition method of claim 1, wherein the text data preprocessing comprises:
extracting characters appearing in the audio file according to the text content of the audio file, and creating a dictionary; and giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained.
4. The speech recognition method according to claim 1, wherein the speech recognition model includes a down-sampling layer, the down-sampling layer takes the FBank feature data as an input, sequentially performs two-dimensional convolution operations, further performs a non-linear transformation, further adds a position feature to the output after the two-dimensional convolution, and adds the output after the two-dimensional convolution and the position feature to obtain an output of the down-sampling layer.
5. The speech recognition method of claim 4, wherein the speech recognition model further comprises an encoding layer that encodes an output of the downsampling layer based on a plurality of transform Encoder network blocks of the attention mechanism.
6. The speech recognition method of claim 5, wherein the speech recognition model further comprises a trigger layer, the trigger layer being composed of a CTC module for identifying time points of output of graphemes in the sequence output by the coding layer and segmenting the output blocks into coding layers.
7. The speech recognition method of claim 6, wherein the speech recognition model further comprises a decoding layer, and the decoding layer sequentially inputs the output blocks of the coding layer segmented by the triggering layer into the decoding layer based on a plurality of transform Decoder network blocks of the attention mechanism to obtain an output of the decoding layer.
8. The speech recognition method of claim 7, wherein the decoding layer decodes based on a wave speed search algorithm.
9. The speech recognition method of claim 7, wherein the trigger layer further generates a slicing event based on a trigger layer of a CTC, and determines whether to activate a decoding layer according to a softmax result of the slicing event.
CN202110761056.9A 2021-07-06 2021-07-06 Speech recognition method Pending CN113470620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110761056.9A CN113470620A (en) 2021-07-06 2021-07-06 Speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110761056.9A CN113470620A (en) 2021-07-06 2021-07-06 Speech recognition method

Publications (1)

Publication Number Publication Date
CN113470620A true CN113470620A (en) 2021-10-01

Family

ID=77878353

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110761056.9A Pending CN113470620A (en) 2021-07-06 2021-07-06 Speech recognition method

Country Status (1)

Country Link
CN (1) CN113470620A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279614A1 (en) * 2018-03-09 2019-09-12 Microsoft Technology Licensing, Llc Advancing word-based speech recognition processing
CN110992941A (en) * 2019-10-22 2020-04-10 国网天津静海供电有限公司 Power grid dispatching voice recognition method and device based on spectrogram
CN111048082A (en) * 2019-12-12 2020-04-21 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
CN111429889A (en) * 2019-01-08 2020-07-17 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN112037798A (en) * 2020-09-18 2020-12-04 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN112382278A (en) * 2020-11-18 2021-02-19 北京百度网讯科技有限公司 Streaming voice recognition result display method and device, electronic equipment and storage medium
CN112489637A (en) * 2020-11-03 2021-03-12 北京百度网讯科技有限公司 Speech recognition method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190279614A1 (en) * 2018-03-09 2019-09-12 Microsoft Technology Licensing, Llc Advancing word-based speech recognition processing
CN111429889A (en) * 2019-01-08 2020-07-17 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN110992941A (en) * 2019-10-22 2020-04-10 国网天津静海供电有限公司 Power grid dispatching voice recognition method and device based on spectrogram
CN111048082A (en) * 2019-12-12 2020-04-21 中国电子科技集团公司第二十八研究所 Improved end-to-end speech recognition method
CN111415667A (en) * 2020-03-25 2020-07-14 极限元(杭州)智能科技股份有限公司 Stream-type end-to-end speech recognition model training and decoding method
CN112037798A (en) * 2020-09-18 2020-12-04 中科极限元(杭州)智能科技股份有限公司 Voice recognition method and system based on trigger type non-autoregressive model
CN112489637A (en) * 2020-11-03 2021-03-12 北京百度网讯科技有限公司 Speech recognition method and device
CN112382278A (en) * 2020-11-18 2021-02-19 北京百度网讯科技有限公司 Streaming voice recognition result display method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110827801B (en) Automatic voice recognition method and system based on artificial intelligence
CN111968679B (en) Emotion recognition method and device, electronic equipment and storage medium
CN111477221A (en) Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network
CN111429889A (en) Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention
CN111968629A (en) Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC
CN110797002B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112217947B (en) Method, system, equipment and storage medium for transcribing text by customer service telephone voice
CN113674732B (en) Voice confidence detection method and device, electronic equipment and storage medium
CN114360557B (en) Voice tone conversion method, model training method, device, equipment and medium
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN111028824A (en) Method and device for synthesizing Minnan
CN112489616A (en) Speech synthesis method
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN113436612A (en) Intention recognition method, device and equipment based on voice data and storage medium
CN113724718A (en) Target audio output method, device and system
CN113793599B (en) Training method of voice recognition model, voice recognition method and device
CN113782042B (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN113470620A (en) Speech recognition method
CN114626424B (en) Data enhancement-based silent speech recognition method and device
CN115240645A (en) Stream type voice recognition method based on attention re-scoring
CN114783428A (en) Voice translation method, voice translation device, voice translation model training method, voice translation model training device, voice translation equipment and storage medium
CN114512121A (en) Speech synthesis method, model training method and device
CN115565547A (en) Abnormal heart sound identification method based on deep neural network
CN114550741A (en) Semantic recognition method and system
CN117095674B (en) Interactive control method and system for intelligent doors and windows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination