CN113763939A - Mixed speech recognition system and method based on end-to-end model - Google Patents
Mixed speech recognition system and method based on end-to-end model Download PDFInfo
- Publication number
- CN113763939A CN113763939A CN202111041405.6A CN202111041405A CN113763939A CN 113763939 A CN113763939 A CN 113763939A CN 202111041405 A CN202111041405 A CN 202111041405A CN 113763939 A CN113763939 A CN 113763939A
- Authority
- CN
- China
- Prior art keywords
- model
- acoustic
- tones
- audio data
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000012549 training Methods 0.000 claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 9
- 238000012163 sequencing technique Methods 0.000 claims description 8
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 abstract description 12
- XDBZPHDFHYZHNG-UHFFFAOYSA-L disodium 3-[(5-chloro-2-phenoxyphenyl)diazenyl]-4-hydroxy-5-[(4-methylphenyl)sulfonylamino]naphthalene-2,7-disulfonate Chemical compound [Na+].[Na+].C1=CC(C)=CC=C1S(=O)(=O)NC(C1=C2O)=CC(S([O-])(=O)=O)=CC1=CC(S([O-])(=O)=O)=C2N=NC1=CC(Cl)=CC=C1OC1=CC=CC=C1 XDBZPHDFHYZHNG-UHFFFAOYSA-L 0.000 abstract description 6
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000004590 computer program Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 6
- 238000003860 storage Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005452 bending Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/081—Search algorithms, e.g. Baum-Welch or Viterbi
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a mixed speech recognition system and method based on an end-to-end model, which comprises a feature extraction module, a language model, an acoustic model based on the end-to-end model, a decoder, a word diagram reestimation module and an output module. The invention adopts the acoustic language end-to-end modeling technology to model mass voice data, and takes the coding network of the end-to-end model as the acoustic model to be embedded into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in the project. In addition, on the basis of the coding network of the end-to-end model, the invention continues to carry out discriminant acoustic model training (SMBR, MPE and the like), thereby further improving the identification accuracy.
Description
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a mixed voice recognition system and method based on an end-to-end model.
Background
In recent years, with the continuous development of AI technology and computer hardware, the field of speech recognition has been rapidly developed. The speech recognition system framework goes through three stages in sequence. The first stage is a template matching system, the most representative algorithm in the stage is a Dynamic bending algorithm (Dynamic Time Warping), and a simple isolated word recognition system is realized by calculating the similarity of two templates and bending in Time; the second stage is a mixed speech recognition system, which is based on a hidden Markov (HMM) framework and is modularized according to a Bayesian formula, and the mixed speech recognition system framework is divided into five modules: feature extraction, a decoder, a language model, an acoustic model and post-processing; the feature extraction is to convert the voice signal from a time domain signal into a frequency domain feature, and MFCC or FBank is generally adopted; the decoder generally adopts a static decoder based on Weighted Finite State machine (WFST), and searches the optimal path in the decoding network as the identification result by using a Viterbi algorithm; the static decoder models a language model, a pronunciation dictionary and phonemes in the speech recognition system, uniformly expresses the models into a WFST form, and then fully optimizes a decoding network by utilizing algorithms such as composite operation, deterministic operation, minimum operation and the like in a finite state machine, so that the decoding efficiency is improved; the acoustic model successively experiences the traditional Gaussian mixture (GMM) and the deep neural network (neural network structures such as DNN, RNN, LSTM and CNN), and the loss function in training is from Cross Entropy (CE) to connected time domain classification (CTC), wherein discriminant training (SMBR, MPE and the like) is helpful for improving the recognition rate. The third stage is a pure end-to-end speech recognition system, which jointly optimizes an acoustic model and a language model, thoroughly abandons the framework of an HMM and comprises an Encoder (Encoder) and a Decoder (Decoder), wherein the Encoder is responsible for learning high-level features of a speech signal, and the Decoder is responsible for learning semantic features and giving out a decoding result;
in the related art, a pure end-to-end voice recognition system develops rapidly, LAS and RNN-T, CT (transducer-Transformer) are sequentially provided, wherein the CT structure considers the global characteristics and the local characteristics of voice signals at the same time, and a CTC/Attention combined optimization mechanism is adopted in training, so that the training is stable, and good results are obtained. However, in the optimization of the implementation project, the pure end-to-end speech recognition system faces two bottlenecks: firstly, if the training set is not matched with the project field, the recognition effect is poor; another is the inability to quickly optimize the recognition rate of certain keywords in the project.
Disclosure of Invention
In view of the above, the present invention provides a hybrid speech recognition system and method based on an end-to-end model to overcome the shortcomings in the prior art, so as to solve the problems of poor recognition effect when the training set and the project field are not matched and the rapid optimization of the keyword recognition rate in the project in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme: a hybrid speech recognition system based on an end-to-end model, comprising: the system comprises a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph reestimation module and an output module;
the feature extraction module is used for extracting acoustic features in the audio data;
the language model is used for acquiring a language model score of a corresponding candidate text in the acoustic features;
the acoustic model based on the end-to-end model is used for acquiring the posterior probability of each modeling unit of the acoustic characteristics; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phoneme of tones;
the decoder is used for weighting the language model scores and the posterior probabilities of the corresponding modeling units and then carrying out searching and sequencing according to the scores after weighting;
the word graph reevaluation module is used for reevaluating and reordering the sorted recognition results;
and the output module is used for outputting the reordered recognition result.
Further, a method of constructing an acoustic model based on an end-to-end model, comprising:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connecting time domain classification and an attention structure to obtain an encoder of the pure end-to-end model;
inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminant training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
Further, the decoder employs a Viterbi algorithm.
Further, in the above-mentioned case,
modeling units corresponding to the voice data in advance to generate a plurality of modeling units; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phonemes.
Further, pre-processing, windowing, FFT transformation, mel-filter processing are performed on the pre-labeled audio data to obtain acoustic features, or the audio data is directly used as the acoustic features.
Further, pre-processing the pre-labeled audio data includes:
and carrying out noise reduction processing or amplitude adjustment on the pre-marked audio data.
The embodiment of the application provides a mixed speech recognition method based on an end-to-end model, which comprises the following steps:
extracting acoustic features in the audio data;
acquiring a language model score of a candidate text corresponding to the acoustic feature;
obtaining a posterior probability of each modeling unit of the acoustic features; wherein, the modeling unit comprises single characters or pinyin with tone;
weighting the language model scores and the posterior probabilities of the corresponding modeling units, and then searching and sequencing according to the scores after weighting;
reestimating and reordering the sorted recognition results;
and outputting the reordered recognition result.
Further, a method of constructing an acoustic model based on an end-to-end model, comprising:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connecting time domain classification and an attention structure to obtain an encoder of the pure end-to-end model;
inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminant training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
By adopting the technical scheme, the invention can achieve the following beneficial effects:
the invention provides a mixed voice recognition system and a method based on an end-to-end model, which adopts an acoustic language end-to-end modeling technology to model mass voice data, and takes a coding network of the end-to-end model as an acoustic model to be embedded into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in the project. In addition, on the basis of the coding network of the end-to-end model, the invention continues to carry out discriminant acoustic model training (SMBR, MPE and the like), thereby further improving the identification accuracy.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of the structural steps of an end-to-end model-based hybrid speech recognition system according to the present invention;
FIG. 2 is a schematic diagram of the steps of the method of constructing an end-to-end model-based acoustic model according to the present invention;
FIG. 3 is a schematic diagram illustrating the steps of the end-to-end model-based hybrid speech recognition method according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.
A specific end-to-end model-based hybrid speech recognition system and method provided in the embodiments of the present application will be described with reference to the accompanying drawings.
As shown in fig. 1, the hybrid speech recognition system based on an end-to-end model provided in the embodiment of the present application includes: the system comprises a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph reestimation module and an output module;
the feature extraction module is used for extracting acoustic features in the audio data;
the language model is used for acquiring a language model score of a corresponding candidate text in the acoustic features;
the acoustic model based on the end-to-end model is used for acquiring the posterior probability of each modeling unit of the acoustic characteristics; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phoneme of tones;
the decoder is used for weighting the language model scores and the posterior probabilities of the corresponding modeling units and then carrying out searching and sequencing according to the scores after weighting;
the word graph reevaluation module is an optional module and is used for reevaluating and reordering the sorted recognition results;
and the output module is used for outputting the reordered recognition result.
Preferably, modeling is carried out on the single characters or the pinyin with tones corresponding to the voice data in advance to generate a plurality of modeling units; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phonemes.
The working principle of the mixed voice recognition system based on the end-to-end model is that the feature extraction module extracts acoustic features in audio data; the language model acquires a language model score of a corresponding candidate text in the acoustic features; acquiring the posterior probability of each modeling unit of the acoustic characteristics based on the acoustic model of the end-to-end model; the modeling unit comprises but is not limited to single characters or pinyin with tones, wherein the pinyin with tones comprises initials, finals and tones; the decoder is used for carrying out weighting processing on the language model scores and the posterior probabilities of the corresponding modeling units and then carrying out searching and sequencing according to the scores after the weighting processing; the word graph reevaluation module reevaluates and reorders the sorted recognition results; the output module outputs the recognition result after reordering, and can directly output the recognition result of search ordering if the search ordering is accurate. It is understood that the toned pinyin includes initials, finals, and tones. Other modeling units may also be included, and the present application is not limited thereto.
Preferably, the pre-labeled audio data may be subjected to preprocessing, windowing, FFT transformation, mel-filter processing to obtain the acoustic features, or the audio data may be directly used as the acoustic features.
Preferably, the pre-processing of the pre-labeled audio data comprises:
and carrying out noise reduction processing or amplitude adjustment on the pre-marked audio data.
Specifically, the audio data needs to be processed to obtain the acoustic features of the audio data, where the processing mode may be implemented by using the prior art, for example, the audio data is subjected to preprocessing, windowing, FFT transformation, mel-filter and other steps to extract the acoustic features of the speech to be recognized. The preprocessing can be sound denoising processing or amplitude adjustment.
The language model score and the posterior probability of the acoustic model can be expressed by scores, and the posterior probability is one of basic concepts of information theory. In a communication system, the probability that a message is transmitted after being received is known by the receiver as the a posteriori probability. After the decoder performs weighting processing on the two scores, the scores of a plurality of candidate texts can be obtained, the scores are sorted from high to low, then the word graph reevaluation module can reevaluate the weighted results, if the sorting cannot be performed, the re-sorting can be performed, wherein the word graph reevaluation module can adopt a model in the prior art and can be realized by adopting the prior art, and the description is omitted here.
The invention adopts the acoustic language end-to-end modeling technology to model mass voice data, and takes the coding network of the end-to-end model as the acoustic model to be embedded into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in the project. In addition, on the basis of end-to-end model training, the invention further performs discriminant acoustic model training (SMBR, MPE and the like), and further improves the identification accuracy.
Preferably, as shown in fig. 2, the method for constructing an acoustic model based on an end-to-end model includes:
s101, extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connecting a time domain Classification (CTC) and an Attention structure (Attention) to obtain an encoder of the pure end-to-end model;
s102, inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminant training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
Preferably, the decoder employs a Viterbi algorithm.
Specifically, the implementation steps of the hybrid speech recognition system based on the end-to-end model are as follows:
1, modeling a voice signal by adopting an end-to-end model and a related objective function, extracting acoustic features for model training from audio data collected and labeled in advance (for example, the steps of preprocessing, windowing, FFT (fast Fourier transform), Mel filter and the like can be performed by a traditional signal processing method), taking the acoustic features as input of model training, taking a labeled text as a training target, and completing training of model parameters by a deep learning method under massive data to obtain an end-to-end acoustic language model which can be used;
2, extracting the coder in the acoustic language end-to-end model trained in the step 1) as an acoustic model of the hybrid speech recognition system, decoding by using the acoustic model to obtain a word graph file and a forced alignment file corresponding to a training set, and carrying out discrimination training (such as SMBR, MPE and the like) on the basis;
and 3, extracting the acoustic model based on the end-to-end model trained in the step 2), serving as the acoustic model of the final end-to-end model-based mixed speech recognition system, and calculating the posterior probability. Performing the same acoustic feature extraction on the input voice signal according to the step 1), inputting the voice signal to a language model and an acoustic model based on an end-to-end model, and outputting posterior probabilities corresponding to all modeling units corresponding to each frame by the acoustic model based on the end-to-end model; the language model outputs a language model score for the text.
And 4) searching a path with the highest score in a decoding network by combining the posterior probability of the modeling unit and the language model score by adopting a decoder based on a Viterbi algorithm to serve as a recognition result.
As shown in fig. 3, an embodiment of the present application provides a hybrid speech recognition method based on an end-to-end model, including:
s201, extracting acoustic features in audio data;
s202, acquiring language model scores of candidate texts corresponding to the acoustic features;
s203, obtaining posterior probability of each modeling unit of the acoustic features; wherein, the modeling unit comprises but is not limited to words, single words, pinyin with or without tones, and phonemes;
s204, weighting the language model scores and the posterior probabilities of the corresponding modeling units, and then searching and sequencing according to the scores after weighting;
s205, reestimating and reordering the sorted recognition results;
and S206, outputting the recognition result after reordering.
The working principle of the mixed speech recognition method based on the end-to-end model provided by the embodiment of the application is that acoustic features in audio data are extracted; acquiring a language model score of a text in the acoustic features; obtaining a posterior probability of each modeling unit of the acoustic features; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phoneme of tones; carrying out weighting processing on the language model scores and the posterior probability of each modeling unit for sequencing; reestimating and reordering the sorted recognition results; and outputting the reordered recognition result.
Preferably, the method for constructing an acoustic model based on an end-to-end model includes:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connecting a time domain Classification (CTC) and an Attention structure (Attention) to obtain an encoder of the pure end-to-end model;
inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminant training on the acoustic language end-to-end model through the word graph file and the forced alignment file to obtain an acoustic model based on the end-to-end model.
The embodiment of the application provides computer equipment, which comprises a processor and a memory connected with the processor;
the memory is used for storing a computer program used for executing the hybrid speech recognition system based on the end-to-end model provided by any one of the embodiments;
the processor is used to call and execute the computer program in the memory.
In summary, the present invention provides a hybrid speech recognition system and method based on an end-to-end model, which includes a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph reestimation module, and an output module. The invention adopts the acoustic language end-to-end modeling technology to model mass voice data, and takes the coding network of the end-to-end model as the acoustic model to be embedded into the mixed voice recognition system, thereby not only further improving the voice recognition accuracy, but also solving the problem that the pure end-to-end voice recognition system is difficult to customize in the project. In addition, on the basis of end-to-end model training, the invention further performs discriminant acoustic model training (SMBR, MPE and the like), and further improves the identification accuracy.
It is to be understood that the system embodiments provided above correspond to the method embodiments described above, and corresponding specific contents may be referred to each other, which are not described herein again.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of systems, devices (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction system which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (8)
1. A hybrid speech recognition system based on an end-to-end model, comprising: the system comprises a feature extraction module, a language model, an acoustic model based on an end-to-end model, a decoder, a word graph reestimation module and an output module;
the feature extraction module is used for extracting acoustic features in the audio data;
the language model is used for acquiring a language model score of a corresponding candidate text in the acoustic features;
the acoustic model based on the end-to-end model is used for acquiring the posterior probability of each modeling unit of the acoustic characteristics; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phoneme of tones;
the decoder is used for weighting the language model scores and the posterior probabilities of the corresponding modeling units and then carrying out searching and sequencing according to the scores after weighting;
the word graph reevaluation module is used for reevaluating and reordering the sorted recognition results;
and the output module is used for outputting the reordered recognition result.
2. The system of claim 1, wherein the method of constructing an end-to-end model based acoustic model comprises:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connecting time domain classification and an attention structure to obtain an encoder of the pure end-to-end model;
inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminant training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
3. The system of claim 1,
the decoder employs a Viterbi algorithm.
4. The system of claim 1,
modeling units corresponding to the voice data in advance to generate a plurality of modeling units; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phonemes.
5. The system of claim 1,
and (3) preprocessing the pre-labeled audio data, windowing, FFT (fast Fourier transform) conversion and Mel filter processing to obtain acoustic characteristics, or directly using the audio data as the acoustic characteristics.
6. The system of claim 5, wherein pre-processing the pre-labeled audio data comprises:
and carrying out noise reduction processing or amplitude adjustment on the pre-marked audio data.
7. A hybrid speech recognition method based on an end-to-end model is characterized by comprising the following steps:
extracting acoustic features in the audio data;
acquiring a language model score of a candidate text corresponding to the acoustic feature;
obtaining a posterior probability of each modeling unit of the acoustic features; the modeling unit comprises words, single characters, pinyin with tones or without tones, and phoneme of tones;
weighting the language model scores and the posterior probabilities of the corresponding modeling units, and then searching and sequencing according to the scores after weighting;
reestimating and reordering the sorted recognition results;
and outputting the reordered recognition result.
8. The method of claim 7, wherein the method of constructing an end-to-end model based acoustic model comprises:
extracting acoustic features from pre-labeled audio data, taking the acoustic features and a corresponding modeling unit as input, and training a pre-constructed pure end-to-end model by adopting an optimization mechanism connecting time domain classification and an attention structure to obtain an encoder of the pure end-to-end model;
inputting a training set into the encoder, decoding to obtain a word graph file and a forced alignment file corresponding to the training set, and performing discriminant training on the encoder through the word graph file and the forced alignment file to obtain a final acoustic model based on an end-to-end model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111041405.6A CN113763939B (en) | 2021-09-07 | 2021-09-07 | Mixed voice recognition system and method based on end-to-end model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111041405.6A CN113763939B (en) | 2021-09-07 | 2021-09-07 | Mixed voice recognition system and method based on end-to-end model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113763939A true CN113763939A (en) | 2021-12-07 |
CN113763939B CN113763939B (en) | 2024-04-16 |
Family
ID=78793279
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111041405.6A Active CN113763939B (en) | 2021-09-07 | 2021-09-07 | Mixed voice recognition system and method based on end-to-end model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113763939B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114564564A (en) * | 2022-02-25 | 2022-05-31 | 山东新一代信息产业技术研究院有限公司 | Hot word enhancement method, equipment and medium for voice recognition |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
CN110349573A (en) * | 2019-07-04 | 2019-10-18 | 广州云从信息科技有限公司 | A kind of audio recognition method, device, machine readable media and equipment |
US20200043483A1 (en) * | 2018-08-01 | 2020-02-06 | Google Llc | Minimum word error rate training for attention-based sequence-to-sequence models |
CN110942763A (en) * | 2018-09-20 | 2020-03-31 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN112489635A (en) * | 2020-12-03 | 2021-03-12 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112712804A (en) * | 2020-12-23 | 2021-04-27 | 哈尔滨工业大学(威海) | Speech recognition method, system, medium, computer device, terminal and application |
-
2021
- 2021-09-07 CN CN202111041405.6A patent/CN113763939B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
US20200043483A1 (en) * | 2018-08-01 | 2020-02-06 | Google Llc | Minimum word error rate training for attention-based sequence-to-sequence models |
CN110942763A (en) * | 2018-09-20 | 2020-03-31 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
CN110349573A (en) * | 2019-07-04 | 2019-10-18 | 广州云从信息科技有限公司 | A kind of audio recognition method, device, machine readable media and equipment |
CN111968622A (en) * | 2020-08-18 | 2020-11-20 | 广州市优普科技有限公司 | Attention mechanism-based voice recognition method, system and device |
CN112489635A (en) * | 2020-12-03 | 2021-03-12 | 杭州电子科技大学 | Multi-mode emotion recognition method based on attention enhancement mechanism |
CN112712804A (en) * | 2020-12-23 | 2021-04-27 | 哈尔滨工业大学(威海) | Speech recognition method, system, medium, computer device, terminal and application |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114564564A (en) * | 2022-02-25 | 2022-05-31 | 山东新一代信息产业技术研究院有限公司 | Hot word enhancement method, equipment and medium for voice recognition |
Also Published As
Publication number | Publication date |
---|---|
CN113763939B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111429889B (en) | Method, apparatus, device and computer readable storage medium for real-time speech recognition based on truncated attention | |
CN110534095B (en) | Speech recognition method, apparatus, device and computer readable storage medium | |
CN109410914B (en) | Method for identifying Jiangxi dialect speech and dialect point | |
CN111210807B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN111145729B (en) | Speech recognition model training method, system, mobile terminal and storage medium | |
CN104681036B (en) | A kind of detecting system and method for language audio | |
CN109065032B (en) | External corpus speech recognition method based on deep convolutional neural network | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
CN111916064A (en) | End-to-end neural network speech recognition model training method | |
CN116303966A (en) | Dialogue behavior recognition system based on prompt learning | |
CN113763939B (en) | Mixed voice recognition system and method based on end-to-end model | |
Sakamoto et al. | Stargan-vc+ asr: Stargan-based non-parallel voice conversion regularized by automatic speech recognition | |
Bhatta et al. | Nepali speech recognition using CNN, GRU and CTC | |
Zhou et al. | Extracting unit embeddings using sequence-to-sequence acoustic models for unit selection speech synthesis | |
KR101727306B1 (en) | Languange model clustering based speech recognition apparatus and method | |
Tasnia et al. | An overview of bengali speech recognition: Methods, challenges, and future direction | |
Banjara et al. | Nepali speech recognition using cnn and sequence models | |
Bhatia et al. | Speech-to-text conversion using GRU and one hot vector encodings | |
Savitha | Deep recurrent neural network based audio speech recognition system | |
Yakubovskyi et al. | Speech Models Training Technologies Comparison Using Word Error Rate | |
Ghadekar et al. | ASR for Indian regional language using Nvidia’s NeMo toolkit | |
CN112530414B (en) | Iterative large-scale pronunciation dictionary construction method and device | |
US20240119924A1 (en) | Techniques for improved audio processing using acoustic and language identification models | |
Chauhan et al. | Speech Recognition System-Review | |
SUDHAKARAN et al. | AN END-TO-END DEEP LEARNING APPROACH FOR AN INDIAN ENGLISH REPOSITORY |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |