CN113470620A - Speech recognition method - Google Patents
Speech recognition method Download PDFInfo
- Publication number
- CN113470620A CN113470620A CN202110761056.9A CN202110761056A CN113470620A CN 113470620 A CN113470620 A CN 113470620A CN 202110761056 A CN202110761056 A CN 202110761056A CN 113470620 A CN113470620 A CN 113470620A
- Authority
- CN
- China
- Prior art keywords
- voice
- layer
- speech recognition
- data preprocessing
- recognition model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000007781 pre-processing Methods 0.000 claims abstract description 35
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- 230000011218 segmentation Effects 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims abstract description 6
- 238000005070 sampling Methods 0.000 claims description 11
- 230000009466 transformation Effects 0.000 claims description 3
- 238000010845 search algorithm Methods 0.000 claims description 2
- 210000005266 circulating tumour cell Anatomy 0.000 description 15
- 230000004913 activation Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000012634 fragment Substances 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000009432 framing Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000003213 activating effect Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a voice recognition method, which comprises the following steps: performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting words appearing in the text contents to create a dictionary; constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism; training a voice recognition model based on the FBank feature data and the dictionary data; and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result. Thereby, the result of the streaming voice recognition can be improved in the information considering the priority context.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a voice recognition method.
Background
Due to the rapid development of deep learning technology, more and more end-to-end voice recognition methods based on deep learning appear in the voice recognition field. Compared with the traditional method, the end-to-end speech recognition method simplifies the system architecture, only uses the neural network to form, and inputs the audio data and directly outputs the grapheme of the target language. This avoids the need for specific language specialists when building such systems, reducing the implementation threshold.
The streaming voice recognition is the core technology of systems such as a dialogue system, simultaneous interpretation, real-time subtitles and the like. In recent years, the performance of end-to-end speech recognition systems has outperformed highly optimized hybrid systems. The currently proposed streaming end-to-end speech recognition systems are mainly divided into two main categories: a CTC-based Neural Network model and an RNN-T (Current Neural Network transmitter) -based Neural Network model. These models are identified on a frame-by-frame basis, so that streaming speech recognition is easily achieved. Therefore, streaming speech recognition has a huge demand in real life, but the existing streaming speech recognition cannot effectively utilize the context information for effective recognition.
Disclosure of Invention
In order to solve the problem that the prior art can not effectively identify the contact context, the invention provides a voice identification method which has the characteristics of being capable of effectively identifying the contact context, having higher and more accurate identification efficiency and the like.
A speech recognition method according to an embodiment of the present invention includes:
performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting a character creation dictionary appearing in the text contents;
constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;
training the speech recognition model based on the FBank feature data and the lexicon data;
and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result.
Further, the voice data preprocessing comprises:
and converting the voice file into a WAV format, wherein the sampling rate is 8K, the single channel is adopted, and the FBank characteristic of each audio frequency is extracted.
Further, the text data preprocessing comprises:
extracting characters appearing in the audio file according to the text content of the audio file, and creating a dictionary; and giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained.
Furthermore, the voice recognition model comprises a down-sampling layer, the down-sampling layer takes the FBank characteristic data as input, two-dimensional convolution operations are sequentially carried out, then one nonlinear transformation is carried out, then the position characteristic is added to the output after the two-dimensional convolution, and the output after the two-dimensional convolution and the position characteristic are added to be used as the output of the down-sampling layer.
Further, the speech recognition model further includes an encoding layer that encodes an output of the downsampling layer based on the plurality of transform Encoder network blocks of the attention mechanism.
Furthermore, the speech recognition model further comprises a trigger layer, wherein the trigger layer is composed of a CTC module and is used for recognizing the time point of grapheme output in the sequence output by the coding layer and cutting the grapheme output into output blocks of the coding layer.
Further, the speech recognition model further comprises a decoding layer, and the decoding layer sequentially inputs the output blocks of the coding layer segmented by the trigger layer into the decoding layer based on a plurality of transform Decoder network blocks of the attention mechanism to obtain the output of the decoding layer.
Further, the decoding layer decodes based on a wave speed search algorithm.
Further, the trigger layer generates a segmentation event based on the trigger layer of the CTC, and judges whether to activate the decoding layer according to the softmax result of the segmentation event.
The invention has the beneficial effects that: after data preprocessing is carried out on a voice file, the attention-based voice recognition method is applied to streaming voice recognition, an input sequence is cut into small fragments by using a CTC model, the attention-based model is used for recognizing the result of each fragment, and finally the fragment recognition results are spliced to obtain a complete result, so that the streaming voice recognition result can be improved in the information considering the priority context, and the recognition result is more accurate and reliable.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow diagram of a method of speech recognition provided in accordance with an exemplary embodiment;
FIG. 2 is a block diagram of a speech recognition model provided in accordance with an exemplary embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a speech recognition method, which specifically includes the following steps:
101. performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting words appearing in the text contents to create a dictionary; the first data set preparation is mainly composed of two steps of voice data preprocessing and text data preprocessing.
102. Constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;
103. training the speech recognition model based on FBank feature data and dictionary data;
104. and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result. The decoding of the streaming audio is realized by means of frame synchronization.
The attention-based voice recognition method is applied to streaming voice recognition, the CTC model is used for cutting an input sequence into small segments, the attention-based model is used for recognizing the result of each segment, and finally the segment recognition results are spliced to obtain a complete result, so that the streaming voice recognition result can be improved in the information considering the priority context, and the recognition result is more accurate and reliable.
As a possible implementation of the above embodiment, first, the data set preparation includes two steps of voice data preprocessing and text data preprocessing:
voice data preprocessing: and converting the voice file into a WAV format, wherein the sampling rate is 8K and the single channel is adopted. Extracting FBank characteristics of each audio, firstly pre-emphasizing a speech signal, improving high-frequency components to flatten the frequency spectrum of the signal, and realizing pre-emphasis by using a first-order FIR high-pass filter, wherein the formula is as follows:
y(n)=x(n)-ax(n-1),0.9<a<1.0
where α is the pre-emphasis coefficient. The speech signal is then framed and cut into time segments of fixed short length, where the speech signal can be processed as a stationary signal, typically with a framing length set to 20-50 ms. Windowing is performed on the voice signals after framing, spectrum leakage errors are reduced, time domain signals can better meet the periodicity requirement of Fourier transform, and a Hamming window formula is usually selected as a window function:
where N is the total number of samples in speech. Then, short-time Fourier transform is used for the windowed voice segment to obtain frequency domain information, and then a formula is utilized:
a power spectrum is obtained. Finally, the FBank features are obtained using Mel filtering and the results are logarithmized. The conversion formula for the frequency f and the Mel frequency m is:
preprocessing text data: extracting characters appearing in the data set according to text contents corresponding to the audio in the data set, and creating a dictionary; three special symbols are then added to the dictionary: < blank >, < eos/sos >, < unk >, respectively, indicating: whitespace in CTCs, text start and end symbols, unknown word symbols. And giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained. Note that: here, the < blank > index is 0, the < unk > index is 1, the < eos/sos > index is 2, and other text indexes can be set arbitrarily.
Referring to the structure diagram of the speech recognition model shown in fig. 2, the down-sampling layer takes the processed speech Fbank features as input, and sequentially passes through two-dimensional convolution operations, and the two-dimensional convolution parameters sequentially are: the convolution kernel is 3, the step length is 2; the convolution kernel is 5 and the step size is 3. After each two-dimensional convolution operation, a nonlinear transformation is performed. And adding position characteristics to the output after the two-dimensional convolution, wherein the characteristics are generated by adopting absolute position coding, and a formula is created as follows:
where pos corresponds to an input position, PE (pos,2i) represents a position code where pos is an even number, and PE (pos,2i +1) represents a position code where pos is an odd number, and represents a dimension of a position feature. And adding the output after the two-dimensional convolution and the position characteristic to obtain the output of the down-sampling layer.
And (3) coding layer: the layer is composed of N transform Encoder network blocks, wherein N is an integer greater than 2, the input of the first transform network block is the output of the downsampling layer, and the inputs of other transform network blocks are the outputs of the previous block. The transform network block sequentially includes a multi-head attention layer, a normalization layer, a residual error layer, and a feedforward connection layer, which can be specifically constructed according to the existing structure, and the present invention is not described herein again.
An active layer: the layer is composed of a CTC module, can identify the output time point of grapheme in the sequence output by the coding layer, controls whether the decoding layer network is activated or not, obtains an output sequence after inputting the coding layer to the activation layer, and obtains the output sequence by selecting the maximum value at each time step as a result due to the various characteristics of the CTC. And (2) reserving the first word of the continuous words which are not the < blank > in the sequence, replacing the first word with the word < block >, and then segmenting the output of the coding layer according to the index i of the word which is not the < blank >, and segmenting the output into output blocks of H coding layers. And when the model is cut, the cutting position can be controlled according to the parameter e, and is set as i-e, so that the model can see more history information.
A decoding layer: the layer is composed of M transform Decoder network blocks, wherein M is an integer larger than 2, and output blocks of the coding layer segmented by the activation layer are sequentially input into the decoding layer to obtain the output of the decoding layer. And splicing the results of all the blocks to obtain an identification result.
And finally, decoding the streaming audio in a frame synchronization mode. And (3) generating a segmentation event based on the activation layer of the CTC, judging whether to activate a decoding layer according to the softmax result of the event, and activating the decoding layer by the event with the result larger than 0.6 in our experiment. The output of the coding layer between the two recent events is sent to the decoding layer, and the decoding process uses the traditional wave speed searching algorithm. In the decoding process, because the sequence input by the decoding layer each time is the input sequence of one word in the CTC result, the problem of misalignment does not exist, and a penalty factor of adding length constraint in decoding based on tag synchronization is not required.
In some embodiments of the present invention, for training of the speech recognition model, let S ═ S (S)1,...,sT) Representing a CTC sequence frame of length T, where st∈E∪<blank>And E represents a collection of different graphemes,<blank>indicating a blank symbol. Let K ═ K1,...,kL) Wherein k islE represents a grapheme sequence of length L and assumes that when the repeated labels are folded into a single instance and blank symbols are removed, the sequence S is reduced to K. The derivation of CTC is as follows:
where p (S | K) represents transition probability and p (S | H) represents acoustic model.
The derivation formula of the alignment information provided by the activation layer on the decoding layer is as follows:
co-training the active layer and the decoding layer by using a multi-objective loss function, wherein the formula is as follows:
where λ is a hyperparameter, for controlling pctcAnd patThe weight of (c).
The speech recognition method provided by the above embodiment of the present invention applies the attention-based speech recognition method to the streaming speech recognition, cuts the input sequence into small segments using the CTC model, allows the attention-based model to recognize the result of each segment, and finally splices the segment recognition results to obtain a complete result, so that the streaming speech recognition result can be improved in the information considering the priority context.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (9)
1. A speech recognition method, comprising:
performing data preprocessing on a voice file, wherein the data preprocessing comprises voice data preprocessing and text data preprocessing, the voice data preprocessing is used for acquiring FBank characteristic data in the voice file, and the text data preprocessing is used for acquiring text contents in the voice file and extracting a character creation dictionary appearing in the text contents;
constructing a voice recognition model, wherein the voice recognition model carries out segmentation of a voice sequence based on a CTC algorithm; the voice recognition model recognizes the segmented segments based on an attention mechanism;
training the speech recognition model based on the FBank feature data and the lexicon data;
and recognizing the voice file by using the trained voice recognition model, and splicing the recognition result into a voice recognition result.
2. The speech recognition method of claim 1, wherein the speech data preprocessing comprises:
and converting the voice file into a WAV format, wherein the sampling rate is 8K, the single channel is adopted, and the FBank characteristic of each audio frequency is extracted.
3. The speech recognition method of claim 1, wherein the text data preprocessing comprises:
extracting characters appearing in the audio file according to the text content of the audio file, and creating a dictionary; and giving an index starting from 0 to each character in the dictionary, replacing the character in the original text with the corresponding index by using the index, and generating the text to be trained.
4. The speech recognition method according to claim 1, wherein the speech recognition model includes a down-sampling layer, the down-sampling layer takes the FBank feature data as an input, sequentially performs two-dimensional convolution operations, further performs a non-linear transformation, further adds a position feature to the output after the two-dimensional convolution, and adds the output after the two-dimensional convolution and the position feature to obtain an output of the down-sampling layer.
5. The speech recognition method of claim 4, wherein the speech recognition model further comprises an encoding layer that encodes an output of the downsampling layer based on a plurality of transform Encoder network blocks of the attention mechanism.
6. The speech recognition method of claim 5, wherein the speech recognition model further comprises a trigger layer, the trigger layer being composed of a CTC module for identifying time points of output of graphemes in the sequence output by the coding layer and segmenting the output blocks into coding layers.
7. The speech recognition method of claim 6, wherein the speech recognition model further comprises a decoding layer, and the decoding layer sequentially inputs the output blocks of the coding layer segmented by the triggering layer into the decoding layer based on a plurality of transform Decoder network blocks of the attention mechanism to obtain an output of the decoding layer.
8. The speech recognition method of claim 7, wherein the decoding layer decodes based on a wave speed search algorithm.
9. The speech recognition method of claim 7, wherein the trigger layer further generates a slicing event based on a trigger layer of a CTC, and determines whether to activate a decoding layer according to a softmax result of the slicing event.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110761056.9A CN113470620A (en) | 2021-07-06 | 2021-07-06 | Speech recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110761056.9A CN113470620A (en) | 2021-07-06 | 2021-07-06 | Speech recognition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113470620A true CN113470620A (en) | 2021-10-01 |
Family
ID=77878353
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110761056.9A Pending CN113470620A (en) | 2021-07-06 | 2021-07-06 | Speech recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113470620A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190279614A1 (en) * | 2018-03-09 | 2019-09-12 | Microsoft Technology Licensing, Llc | Advancing word-based speech recognition processing |
CN110992941A (en) * | 2019-10-22 | 2020-04-10 | 国网天津静海供电有限公司 | Power grid dispatching voice recognition method and device based on spectrogram |
CN111048082A (en) * | 2019-12-12 | 2020-04-21 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
CN111415667A (en) * | 2020-03-25 | 2020-07-14 | 极限元(杭州)智能科技股份有限公司 | Stream-type end-to-end speech recognition model training and decoding method |
CN111429889A (en) * | 2019-01-08 | 2020-07-17 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention |
CN112037798A (en) * | 2020-09-18 | 2020-12-04 | 中科极限元(杭州)智能科技股份有限公司 | Voice recognition method and system based on trigger type non-autoregressive model |
CN112382278A (en) * | 2020-11-18 | 2021-02-19 | 北京百度网讯科技有限公司 | Streaming voice recognition result display method and device, electronic equipment and storage medium |
CN112489637A (en) * | 2020-11-03 | 2021-03-12 | 北京百度网讯科技有限公司 | Speech recognition method and device |
-
2021
- 2021-07-06 CN CN202110761056.9A patent/CN113470620A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190279614A1 (en) * | 2018-03-09 | 2019-09-12 | Microsoft Technology Licensing, Llc | Advancing word-based speech recognition processing |
CN111429889A (en) * | 2019-01-08 | 2020-07-17 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention |
CN110992941A (en) * | 2019-10-22 | 2020-04-10 | 国网天津静海供电有限公司 | Power grid dispatching voice recognition method and device based on spectrogram |
CN111048082A (en) * | 2019-12-12 | 2020-04-21 | 中国电子科技集团公司第二十八研究所 | Improved end-to-end speech recognition method |
CN111415667A (en) * | 2020-03-25 | 2020-07-14 | 极限元(杭州)智能科技股份有限公司 | Stream-type end-to-end speech recognition model training and decoding method |
CN112037798A (en) * | 2020-09-18 | 2020-12-04 | 中科极限元(杭州)智能科技股份有限公司 | Voice recognition method and system based on trigger type non-autoregressive model |
CN112489637A (en) * | 2020-11-03 | 2021-03-12 | 北京百度网讯科技有限公司 | Speech recognition method and device |
CN112382278A (en) * | 2020-11-18 | 2021-02-19 | 北京百度网讯科技有限公司 | Streaming voice recognition result display method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110827801B (en) | Automatic voice recognition method and system based on artificial intelligence | |
CN111968679B (en) | Emotion recognition method and device, electronic equipment and storage medium | |
CN111477221A (en) | Speech recognition system using bidirectional time sequence convolution and self-attention mechanism network | |
CN111429889A (en) | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention | |
CN111968629A (en) | Chinese speech recognition method combining Transformer and CNN-DFSMN-CTC | |
CN110797002B (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium | |
CN112217947B (en) | Method, system, equipment and storage medium for transcribing text by customer service telephone voice | |
CN113674732B (en) | Voice confidence detection method and device, electronic equipment and storage medium | |
CN114360557B (en) | Voice tone conversion method, model training method, device, equipment and medium | |
CN116364055B (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN111028824A (en) | Method and device for synthesizing Minnan | |
CN112489616A (en) | Speech synthesis method | |
CN114783418B (en) | End-to-end voice recognition method and system based on sparse self-attention mechanism | |
CN113436612A (en) | Intention recognition method, device and equipment based on voice data and storage medium | |
CN113724718A (en) | Target audio output method, device and system | |
CN113793599B (en) | Training method of voice recognition model, voice recognition method and device | |
CN113782042B (en) | Speech synthesis method, vocoder training method, device, equipment and medium | |
CN113470620A (en) | Speech recognition method | |
CN114626424B (en) | Data enhancement-based silent speech recognition method and device | |
CN115240645A (en) | Stream type voice recognition method based on attention re-scoring | |
CN114783428A (en) | Voice translation method, voice translation device, voice translation model training method, voice translation model training device, voice translation equipment and storage medium | |
CN114512121A (en) | Speech synthesis method, model training method and device | |
CN115565547A (en) | Abnormal heart sound identification method based on deep neural network | |
CN114550741A (en) | Semantic recognition method and system | |
CN117095674B (en) | Interactive control method and system for intelligent doors and windows |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |