CN113763939B - Mixed voice recognition system and method based on end-to-end model - Google Patents
Mixed voice recognition system and method based on end-to-end model Download PDFInfo
- Publication number
- CN113763939B CN113763939B CN202111041405.6A CN202111041405A CN113763939B CN 113763939 B CN113763939 B CN 113763939B CN 202111041405 A CN202111041405 A CN 202111041405A CN 113763939 B CN113763939 B CN 113763939B
- Authority
- CN
- China
- Prior art keywords
- model
- acoustic
- audio data
- training
- encoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 18
- 238000005457 optimization Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 7
- XDBZPHDFHYZHNG-UHFFFAOYSA-L disodium 3-[(5-chloro-2-phenoxyphenyl)diazenyl]-4-hydroxy-5-[(4-methylphenyl)sulfonylamino]naphthalene-2,7-disulfonate Chemical compound [Na+].[Na+].C1=CC(C)=CC=C1S(=O)(=O)NC(C1=C2O)=CC(S([O-])(=O)=O)=CC1=CC(S([O-])(=O)=O)=C2N=NC1=CC(Cl)=CC=C1OC1=CC=CC=C1 XDBZPHDFHYZHNG-UHFFFAOYSA-L 0.000 abstract description 6
- 238000004590 computer program Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 6
- 238000003860 storage Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005452 bending Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/081—Search algorithms, e.g. Baum-Welch or Viterbi
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a mixed voice recognition system and method based on an end-to-end model, comprising a feature extraction module, a language model, an acoustic model based on the end-to-end model, a decoder, a word graph re-estimation module and an output module. The invention adopts the acoustic language end-to-end modeling technology to model mass voice data, takes the coding network of the end-to-end model as an acoustic model, and embeds the coding network into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in projects. In addition, the invention can further improve the recognition accuracy by continuously performing the differential acoustic model training (SMBR, MPE, etc.) on the basis of the coding network of the end-to-end model.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a mixed voice recognition system and method based on an end-to-end model.
Background
In recent years, with the continuous development of AI technology and computer hardware, the field of speech recognition has been rapidly developed. The speech recognition system framework goes through three phases in succession. The first stage is a template matching system, the most representative algorithm in the first stage is a dynamic bending algorithm (Dynamic Time Warping), and a simple isolated word recognition system is realized by calculating the similarity of two templates and bending in time; the second stage is a hybrid speech recognition system, which is based on a hidden markov (HMM) framework, and is modularized according to a bayesian formula, and is divided into five modules: feature extraction, decoder, language model, acoustic model, post-processing; the feature extraction is to convert the voice signal from the time domain signal to the frequency domain feature, and generally adopts MFCC or FBank; the decoder generally adopts a static decoder based on a weighted finite state machine (Weighted Finite State Transducer, WFST), and searches an optimal path in a decoding network by using a Viterbi algorithm as a recognition result; the static decoder uniformly represents a language model, a pronunciation dictionary, a phoneme modeling in a voice recognition system into a WFST form, and then utilizes algorithms such as a compound operation, a deterministic operation, a minimizing operation and the like in a finite state machine to fully optimize a decoding network so as to improve decoding efficiency, and in addition, the WFST-based decoder can adopt technologies such as a class-based language model, hotword enhancement, optimized pronunciation dictionary and the like to realize item customization and further improve recognition rate; the acoustic model sequentially goes through the traditional Gaussian mixture (GMM) and the deep neural network (DNN, RNN, LSTM, CNN and other neural network structures), and a loss function in training is from Cross Entropy (CE) to connection time domain classification (CTC), wherein discriminative training (SMBR, MPE and the like) is helpful for improving the recognition rate. The third stage is a pure end-to-end voice recognition system, which optimizes the combination of an acoustic model and a language model, thoroughly discards the HMM framework, and comprises an Encoder (Encoder) and a Decoder (Decoder), wherein the Encoder is responsible for learning the advanced features of a voice signal, and the Decoder is responsible for learning the semantic features and giving out a decoding result;
in the related art, a pure end-to-end voice recognition system develops rapidly, and LAS and RNN-T, CT (Conformer-transducer) are sequentially proposed, wherein a CT structure simultaneously considers global features and local features of voice signals, a CTC/Attention combined optimization mechanism is adopted in training, training is stable, and good results are obtained. However, in implementing project optimization, a pure end-to-end speech recognition system faces two bottlenecks: firstly, if the training set is not matched with the project field, the recognition effect is poor; the other is that the recognition rate of certain keywords in the project cannot be optimized quickly.
Disclosure of Invention
In view of the above, the invention aims to overcome the defects of the prior art, and provides a mixed voice recognition system and a method based on an end-to-end model, so as to solve the problems of poor recognition effect and realization of rapid optimization of keyword recognition rate in projects when training sets and project fields are not matched in the prior art.
In order to achieve the above purpose, the invention adopts the following technical scheme: a hybrid speech recognition system based on an end-to-end model, comprising: the device comprises a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph re-estimation module and an output module;
the characteristic extraction module is used for extracting acoustic characteristics in the audio data;
the language model is used for obtaining the language model score of the candidate text corresponding to the acoustic feature;
the acoustic model based on the end-to-end model is used for acquiring posterior probability of each modeling unit of the acoustic feature; wherein the modeling unit comprises words, single words, pinyin with or without tone and phonemes;
the decoder is used for carrying out weighting processing on the language model scores and the posterior probabilities of the corresponding modeling units, and then carrying out searching and sorting according to the scores after the weighting processing;
the word graph re-estimation module is used for re-estimating and re-ordering the sequenced recognition results;
the output module is used for outputting the reordered identification result.
Further, the method for constructing the acoustic model based on the end-to-end model comprises the following steps:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connected with time domain classification and attention structure to obtain an encoder of the pure end-to-end model;
inputting the training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing differential training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
Further, the decoder adopts a Viterbi algorithm.
Further, the method comprises the steps of,
modeling units corresponding to the voice data in advance to generate a plurality of modeling units; wherein the modeling unit comprises words, single words, pinyin with or without tone, and phonemes.
Further, pre-processing, windowing, FFT conversion and Mel filter processing are carried out on the pre-marked audio data to obtain acoustic characteristics, or the audio data is directly used as the acoustic characteristics.
Further, preprocessing the pre-labeled audio data includes:
and carrying out noise reduction processing or amplitude adjustment on the pre-marked audio data.
The embodiment of the application provides a mixed voice recognition method based on an end-to-end model, which comprises the following steps:
extracting acoustic features in the audio data;
obtaining the language model score of the candidate text corresponding to the acoustic feature;
acquiring posterior probability of each modeling unit of the acoustic feature; wherein the modeling unit comprises single words or tone pinyin;
weighting the language model scores and the posterior probabilities of the corresponding modeling units, and then searching and sorting according to the weighted scores;
re-estimating and re-ordering the sequenced recognition results;
and outputting the reordered identification result.
Further, the method for constructing the acoustic model based on the end-to-end model comprises the following steps:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connected with time domain classification and attention structure to obtain an encoder of the pure end-to-end model;
inputting the training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing differential training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
By adopting the technical scheme, the invention has the following beneficial effects:
the invention provides a mixed voice recognition system and a method based on an end-to-end model, which adopts an acoustic language end-to-end modeling technology to model mass voice data, takes a coding network of the end-to-end model as an acoustic model, embeds the coding network into the mixed voice recognition system, not only further improves the voice recognition accuracy, but also solves the problem that a pure end-to-end voice recognition system is difficult to customize in a project. In addition, the invention can further improve the recognition accuracy by continuously performing the differential acoustic model training (SMBR, MPE, etc.) on the basis of the coding network of the end-to-end model.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of the steps of a hybrid speech recognition system based on an end-to-end model according to the present invention;
FIG. 2 is a schematic step-by-step diagram of a method of constructing an end-to-end model-based acoustic model according to the present invention;
fig. 3 is a schematic diagram illustrating steps of a hybrid speech recognition method based on an end-to-end model according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, based on the examples herein, which are within the scope of the invention as defined by the claims, will be within the scope of the invention as defined by the claims.
A specific end-to-end model-based hybrid speech recognition system and method provided in embodiments of the present application are described below with reference to the accompanying drawings.
As shown in fig. 1, a hybrid speech recognition system based on an end-to-end model provided in an embodiment of the present application includes: the device comprises a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph re-estimation module and an output module;
the characteristic extraction module is used for extracting acoustic characteristics in the audio data;
the language model is used for obtaining the language model score of the candidate text corresponding to the acoustic feature;
the acoustic model based on the end-to-end model is used for acquiring posterior probability of each modeling unit of the acoustic feature; wherein the modeling unit comprises words, single words, pinyin with or without tone and phonemes;
the decoder is used for carrying out weighting processing on the language model scores and the posterior probabilities of the corresponding modeling units, and then carrying out searching and sorting according to the scores after the weighting processing;
the word graph re-estimation module is an optional module and is used for re-estimating and re-ordering the sequenced recognition results;
the output module is used for outputting the reordered identification result.
Preferably, modeling is performed on the single word or the pinyin with tone corresponding to the voice data in advance, so as to generate a plurality of modeling units; wherein the modeling unit comprises words, single words, pinyin with or without tone, and phonemes.
The working principle of the mixed voice recognition system based on the end-to-end model is that the feature extraction module extracts acoustic features in audio data; the language model obtains the language model score of the candidate text corresponding to the acoustic feature; acquiring posterior probability of each modeling unit of the acoustic feature based on an acoustic model of the end-to-end model; wherein the modeling unit includes but is not limited to single words or tone-modulated pinyin including initials, finals and tones; the decoder is used for carrying out weighting processing on the language model scores and the posterior probabilities of the corresponding modeling units, and then carrying out searching and sorting according to the weighted scores; the word graph re-estimation module re-estimates and re-orders the sequenced recognition results; the output module outputs the reordered identification results, and can directly output the identification results of the search ordering if the search ordering is accurate. It is understood that the toned pinyin includes initials, finals and tones. Other modeling units may also be included, and are not limited herein.
Preferably, pre-processing, windowing, FFT conversion and Mel filter processing can be performed on the pre-marked audio data to obtain acoustic features, or the audio data can be directly used as the acoustic features.
Preferably, preprocessing the pre-labeled audio data includes:
and carrying out noise reduction processing or amplitude adjustment on the pre-marked audio data.
Specifically, in the present application, the audio data needs to be processed to obtain acoustic features of the audio data, where the processing manner may be implemented by using the existing technology, for example, steps of preprocessing, windowing, FFT transformation, mel filter, etc. are performed on the audio data to extract acoustic features of the voice to be recognized. Wherein the preprocessing may be a sound denoising process, or amplitude adjustment.
The score of the language model and the posterior probability of the acoustic model can be represented by scores, and the posterior probability is one of basic concepts of the information theory. In a communication system, after a certain message is received, the probability that the message is transmitted is known to the receiving end as a posterior probability. After the decoder performs weighting processing on the two scores, the scores of a plurality of candidate texts can be obtained, the scores are ranked from high to low, then the word graph re-estimation module can re-evaluate the weighted results, and if the ranking is not performed, the word graph re-estimation module can adopt a model in the prior art and can be realized by adopting the prior art, and the application is not repeated here.
The invention adopts the acoustic language end-to-end modeling technology to model mass voice data, takes the coding network of the end-to-end model as an acoustic model, and embeds the coding network into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in projects. In addition, on the basis of end-to-end model training, the invention further carries out differential acoustic model training (SMBR, MPE and the like) to further improve the recognition accuracy.
Preferably, as shown in fig. 2, the method for constructing an acoustic model based on an end-to-end model includes:
s101, extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism for connecting time domain classification (CTC, connectionist Temporal Classification) and Attention structure (Attention), so as to obtain an encoder of the pure end-to-end model;
s102, inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing differential training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
Preferably, the decoder employs a Viterbi algorithm.
Specifically, the specific implementation steps of the mixed voice recognition system based on the end-to-end model are as follows:
modeling a voice signal by adopting an end-to-end model and a related objective function, extracting acoustic features which can be used for model training from audio data which are collected and marked in advance (for example, the steps of preprocessing, windowing, FFT (fast Fourier transform), mel filter and the like can be carried out through a traditional signal processing method), taking the acoustic features as the input of model training, taking marked texts as training targets, completing training of model parameters through a deep learning method under massive data, and obtaining a usable acoustic language end-to-end model;
2, extracting an encoder in the end-to-end model of the acoustic language trained in the step 1), using the encoder as an acoustic model of the mixed speech recognition system, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discrimination training (such as SMBR, MPE and the like) on the basis of the word graph file and the forced alignment file;
and 3, extracting, namely, taking the acoustic model based on the end-to-end model trained in the step 2) as an acoustic model of a final end-to-end model-based hybrid speech recognition system, and calculating posterior probability. The same acoustic feature extraction is carried out on the input voice signal according to the step 1), the voice signal is input into a language model and an acoustic model based on an end-to-end model, and the acoustic model based on the end-to-end model outputs the posterior probabilities corresponding to all modeling units corresponding to each frame; the language model outputs a language model score for the text.
And 4) combining the posterior probability and the language model score of the modeling unit, adopting a decoder based on a Viterbi algorithm, and searching out a path with the highest score in a decoding network as a recognition result.
As shown in fig. 3, an embodiment of the present application provides a method for recognizing mixed speech based on an end-to-end model, including:
s201, extracting acoustic features in audio data;
s202, obtaining language model scores of candidate texts corresponding to the acoustic features;
s203, acquiring posterior probability of each modeling unit of the acoustic feature; wherein the modeling unit includes, but is not limited to, words, toned or non-toned pinyin, and phonemes;
s204, weighting the language model scores and the posterior probabilities of the corresponding modeling units, and then carrying out search sorting according to the weighted scores;
s205, reevaluating and reordering the sequenced recognition results;
s206, outputting the reordered identification result.
The working principle of the mixed voice recognition method based on the end-to-end model provided by the embodiment of the application is that acoustic features in audio data are extracted; obtaining the language model score of the text in the acoustic feature; acquiring posterior probability of each modeling unit of the acoustic feature; wherein the modeling unit comprises words, single words, pinyin with or without tone and phonemes; the posterior probability of each modeling unit is weighted and sequenced; re-estimating and re-ordering the sequenced recognition results; and outputting the reordered identification result.
Preferably, the method for constructing the acoustic model based on the end-to-end model comprises the following steps:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism for connecting time domain classification (CTC, connectionist Temporal Classification) and Attention structure (Attention), so as to obtain an encoder of the pure end-to-end model;
inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminative training on the acoustic language end-to-end model through the word graph file and the forced alignment file to obtain an acoustic model based on the end-to-end model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
the memory is used for storing a computer program, and the computer program is used for executing the mixed voice recognition system based on the end-to-end model provided by any embodiment;
the processor is used to call and execute the computer program in the memory.
In summary, the present invention provides a system and a method for hybrid speech recognition based on an end-to-end model, which includes a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph re-estimation module, and an output module. The invention adopts the acoustic language end-to-end modeling technology to model mass voice data, takes the coding network of the end-to-end model as an acoustic model, and embeds the coding network into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in projects. In addition, on the basis of end-to-end model training, the invention further carries out differential acoustic model training (SMBR, MPE and the like) to further improve the recognition accuracy.
It can be understood that the system embodiments provided above correspond to the method embodiments described above, and the corresponding specific details may be referred to each other, which is not described herein again.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of systems, apparatuses (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (6)
1. A hybrid speech recognition system based on an end-to-end model, comprising: the device comprises a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph re-estimation module and an output module;
the characteristic extraction module is used for extracting acoustic characteristics in the audio data;
the language model is used for obtaining the language model score of the candidate text corresponding to the acoustic feature;
the acoustic model based on the end-to-end model is used for acquiring posterior probability of each modeling unit of the acoustic feature; the modeling unit comprises words, single words, pinyin with or without tone and phonemes;
the decoder is used for carrying out weighting processing on the language model scores and the posterior probabilities of the corresponding modeling units, and then carrying out searching and sorting according to the scores after the weighting processing;
the word graph re-estimation module is used for re-estimating and re-ordering the sequenced recognition results;
the output module is used for outputting the reordered identification result;
a method of constructing an acoustic model based on an end-to-end model, comprising:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connected with time domain classification and attention structure to obtain an encoder of the pure end-to-end model;
inputting the training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing differential training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
2. The system of claim 1, wherein the system further comprises a controller configured to control the controller,
the decoder employs a Viterbi algorithm.
3. The system of claim 1, wherein the system further comprises a controller configured to control the controller,
modeling units corresponding to the voice data are modeled in advance, and a plurality of modeling units are generated.
4. The system of claim 1, wherein the system further comprises a controller configured to control the controller,
the pre-marked audio data is processed by preprocessing, windowing, FFT conversion and Mel filter to obtain acoustic characteristics, or the audio data is directly used as the acoustic characteristics.
5. The system of claim 4, wherein pre-processing the pre-annotated audio data comprises:
and carrying out noise reduction processing or amplitude adjustment on the pre-marked audio data.
6. A method for hybrid speech recognition based on an end-to-end model, comprising:
extracting acoustic features in the audio data;
obtaining the language model score of the candidate text corresponding to the acoustic feature;
acquiring posterior probability of each modeling unit of the acoustic feature; the modeling unit comprises words, single words, pinyin with or without tone and phonemes;
weighting the language model scores and the posterior probabilities of the corresponding modeling units, and then searching and sorting according to the weighted scores;
re-estimating and re-ordering the sequenced recognition results;
outputting the reordered identification result;
a method of constructing an acoustic model based on an end-to-end model, comprising:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connected with time domain classification and attention structure to obtain an encoder of the pure end-to-end model;
inputting the training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing differential training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111041405.6A CN113763939B (en) | 2021-09-07 | 2021-09-07 | Mixed voice recognition system and method based on end-to-end model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111041405.6A CN113763939B (en) | 2021-09-07 | 2021-09-07 | Mixed voice recognition system and method based on end-to-end model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113763939A CN113763939A (en) | 2021-12-07 |
CN113763939B true CN113763939B (en) | 2024-04-16 |
Family
ID=78793279
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111041405.6A Active CN113763939B (en) | 2021-09-07 | 2021-09-07 | Mixed voice recognition system and method based on end-to-end model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113763939B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
CN110349573A (en) * | 2019-07-04 | 2019-10-18 | 广州云从信息科技有限公司 | A kind of audio recognition method, device, machine readable media and equipment |
CN110942763A (en) * | 2018-09-20 | 2020-03-31 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN112489635A (en) * | 2020-12-03 | 2021-03-12 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112712804A (en) * | 2020-12-23 | 2021-04-27 | 哈尔滨工业大学(威海) | Speech recognition method, system, medium, computer device, terminal and application |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11107463B2 (en) * | 2018-08-01 | 2021-08-31 | Google Llc | Minimum word error rate training for attention-based sequence-to-sequence models |
-
2021
- 2021-09-07 CN CN202111041405.6A patent/CN113763939B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
CN110942763A (en) * | 2018-09-20 | 2020-03-31 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN110349573A (en) * | 2019-07-04 | 2019-10-18 | 广州云从信息科技有限公司 | A kind of audio recognition method, device, machine readable media and equipment |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN112489635A (en) * | 2020-12-03 | 2021-03-12 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112712804A (en) * | 2020-12-23 | 2021-04-27 | 哈尔滨工业大学(威海) | Speech recognition method, system, medium, computer device, terminal and application |
Also Published As
Publication number | Publication date |
---|---|
CN113763939A (en) | 2021-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111429889B (en) | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention | |
US20210272551A1 (en) | Speech recognition apparatus, speech recognition method, and electronic device | |
CN108711421B (en) | Speech recognition acoustic model establishing method and device and electronic equipment | |
WO2017076222A1 (en) | Speech recognition method and apparatus | |
US8321218B2 (en) | Searching in audio speech | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
CN111210807B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN107077842A (en) | System and method for phonetic transcription | |
US11355113B2 (en) | Method, apparatus, device and computer readable storage medium for recognizing and decoding voice based on streaming attention model | |
CN111916064A (en) | End-to-end neural network speech recognition model training method | |
Hu et al. | The USTC system for blizzard challenge 2017 | |
US20110218802A1 (en) | Continuous Speech Recognition | |
Serafini et al. | An experimental review of speaker diarization methods with application to two-speaker conversational telephone speech recordings | |
Sakamoto et al. | StarGAN-VC+ ASR: Stargan-based non-parallel voice conversion regularized by automatic speech recognition | |
Dhakal et al. | Automatic speech recognition for the Nepali language using CNN, bidirectional LSTM and ResNet | |
Zhou et al. | Extracting unit embeddings using sequence-to-sequence acoustic models for unit selection speech synthesis | |
CN113763939B (en) | Mixed voice recognition system and method based on end-to-end model | |
KR20160000218A (en) | Languange model clustering based speech recognition apparatus and method | |
Tasnia et al. | An overview of bengali speech recognition: Methods, challenges, and future direction | |
Li et al. | A multi-feature multi-classifier system for speech emotion recognition | |
Tailor et al. | Deep learning approach for spoken digit recognition in Gujarati language | |
Banjara et al. | Nepali speech recognition using cnn and sequence models | |
Bhatia et al. | Speech-to-text conversion using GRU and one hot vector encodings | |
Chenxuan | Research on speech recognition technology for smart home | |
Savitha | Deep recurrent neural network based audio speech recognition system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |