CN117456987B - Voice recognition method and system - Google Patents
Voice recognition method and system Download PDFInfo
- Publication number
- CN117456987B CN117456987B CN202311627049.5A CN202311627049A CN117456987B CN 117456987 B CN117456987 B CN 117456987B CN 202311627049 A CN202311627049 A CN 202311627049A CN 117456987 B CN117456987 B CN 117456987B
- Authority
- CN
- China
- Prior art keywords
- voice
- offline
- training
- scenes
- voice recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 62
- 238000012549 training Methods 0.000 claims abstract description 175
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 46
- 238000011156 evaluation Methods 0.000 claims abstract description 27
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 230000007613 environmental effect Effects 0.000 claims abstract description 8
- 108010076504 Protein Sorting Signals Proteins 0.000 claims description 14
- 238000013138 pruning Methods 0.000 claims description 6
- 230000000875 corresponding effect Effects 0.000 description 16
- 238000001514 detection method Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 230000009467 reduction Effects 0.000 description 10
- 238000012360 testing method Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 230000004069 differentiation Effects 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 238000009432 framing Methods 0.000 description 4
- 230000002452 interceptive effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 241001522301 Apogonichthyoides nigripinnis Species 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008451 emotion Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000000746 purification Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 235000013555 soy sauce Nutrition 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010411 cooking Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a voice recognition method, which comprises the following steps: collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set; training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes; evaluating the offline speech recognition models of a plurality of scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model; and responding to the voice recognition instructions of the scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models. The method and the system disclosed by the invention can be used for carrying out high-precision voice recognition in a plurality of different voice scenes, and the problems of poor recognition precision, low robustness and poor user experience of the existing voice recognition model are solved.
Description
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and system.
Background
With the development and progress of speech recognition technology, speech recognition-based applications are becoming more and more widespread, and have been gradually covered in aspects of home life, office area, entertainment, etc. The voice is input through an external or internal microphone on a mobile phone, an intelligent computer, a notebook computer, a learning machine, an intelligent home terminal and the like, and then the voice-text conversion is completed through a voice recognition model embedded in the voice input device.
At present, the accuracy of voice recognition of a voice recognition model is a focus, and particularly in certain daily voice scenes, such as gymnasiums, beauty parlors, kitchens, schools and the like, the voice recognition model does not have real-time network connection, and special offline voice recognition needs are often provided. For these offline speech recognition scenarios, if a high-precision speech recognition result cannot be provided, the user's use site will be interfered, and the interactive use experience of the user is greatly reduced.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a voice recognition method and a voice recognition system, which can perform high-precision voice recognition in a plurality of different voice scenes, and solve the problems of poor recognition precision, low robustness and poor user experience of the existing voice recognition model.
To solve the above technical problem, a first aspect of the present invention discloses a speech recognition method, which includes: collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set; training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline voice recognition model of a plurality of scenes; evaluating the offline speech recognition models of the scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model; and responding to the voice recognition instructions of the scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models.
In some embodiments, the training the offline instruction training set by extending a Baum-Welch algorithm constructs an offline speech recognition model of a plurality of scenes, wherein: the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes.
In some embodiments, the method for constructing an offline speech recognition model of a plurality of scenes using the speech signal sequence processed by the Baum-Welch algorithm as a model initial parameter comprises: calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; extracting N maximum values from the M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of the offline voice recognition model of each scene; and generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes.
In some implementations, the number of hidden states of the speech recognition model is different for each scene.
In some embodiments, training the offline instruction training set by extending a Baum-Welch algorithm constructs an offline speech recognition model of a plurality of scenes, including: and decoding each off-line instruction training set to generate voice points, and performing lattice pruning of posterior probability on the voice points by using maximum mutual information estimation to generate a rejected off-line instruction training set.
In some embodiments, the offline speech recognition models of the plurality of scenes are evaluated based on a preset precision evaluation system to determine an optimal offline speech recognition model, wherein the preset precision evaluation system is implemented as follows: calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene; calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy; and forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.
In a second aspect, the invention discloses a speech recognition system, the system comprising: the preprocessing module is used for collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set; the offline voice recognition models of the multiple scenes are generated by training, constructing and generating the offline instruction training set based on an extended Baum-Welch algorithm; the optimal offline speech recognition model is used for evaluating and determining the offline speech recognition models of the scenes based on a preset precision evaluation system; and the voice recognition output module is used for responding to voice recognition instructions of a plurality of scenes and outputting scene voice recognition results by matching with the associated optimal offline voice recognition model.
In some implementations, in the offline speech recognition model of multiple scenarios: the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes.
In some implementations, the offline speech recognition model for multiple scenarios is implemented as: calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; extracting N maximum values from the M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of the offline voice recognition model of each scene; and generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes.
In some implementations, the number of hidden states of the speech recognition model is different for each scene.
In some embodiments, in the optimal offline speech recognition model, the preset accuracy evaluation system is implemented as: calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene; calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy; and forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.
A third aspect of the invention discloses a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a speech recognition method as described.
The fourth aspect of the present invention discloses a speech recognition apparatus, the apparatus comprising: a memory storing executable program code; a processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform the speech recognition method.
Compared with the prior art, the invention has the beneficial effects that:
By implementing the method, the voice recognition model meeting different environments can be constructed on the premise of different offline voice scenes, the existing Baum-Welch algorithm is improved, the training data sets under different offline voice scene characteristics are processed accurately and then used as initial training parameters of the Baum-Welch algorithm, so that the voice recognition model is controlled with strict accuracy at the initial stage of training, and the accuracy of the subsequent offline voice recognition result under the real scene is facilitated. In addition, in order to further control the voice recognition precision of different scenes in an offline state, the invention evaluates the offline voice recognition models of a plurality of scenes based on a preset precision evaluation system so as to determine the most preferable offline voice recognition model. Therefore, the method and the device not only carry out precision control in the earlier stage of training of the voice recognition model, but also screen and remove interference factors again after the voice recognition model is generated, thereby greatly improving the precision and the robustness of the offline voice recognition model under different target scenes and greatly improving the interactive experience of users.
Drawings
FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another speech recognition method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a speech recognition system according to an embodiment of the present invention;
FIG. 4 is an interactive schematic diagram of a speech recognition system according to an embodiment of the present invention;
Fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention.
Detailed Description
For a better understanding and implementation, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the invention discloses a voice recognition method and a voice recognition system, which can construct voice recognition models meeting different environments on the premise of different offline voice scenes, improve the existing Baum-Welch algorithm, and treat the training data sets under the characteristics of the different offline voice scenes with precision and then serve as initial training parameters of the Baum-Welch algorithm, so that the voice recognition model is strictly controlled with precision in the initial training stage, and the accuracy of the subsequent offline voice recognition results under the real scenes is facilitated. In addition, in order to further control the voice recognition precision of different scenes in an offline state, the invention evaluates the offline voice recognition models of a plurality of scenes based on a preset precision evaluation system so as to determine the most preferable offline voice recognition model. Therefore, the method and the device not only carry out precision control in the earlier stage of training of the voice recognition model, but also screen and remove interference factors again after the voice recognition model is generated, thereby greatly improving the precision and the robustness of the offline voice recognition model under different target scenes and greatly improving the interactive experience of users.
Example 1
Referring to fig. 1, fig. 1 is a flow chart of a voice recognition method according to an embodiment of the invention. The voice recognition method can be applied to an intelligent home system, an intelligent gymnasium fitness system, a school classroom education system, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scene of the voice recognition method. As shown in fig. 1, the speech recognition method may include the following operations:
step 101, training voice data of a plurality of target scenes are collected, and preprocessing is performed on the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set.
Under the daily diversified intelligent voice use scene, the received voice data is different in different use scenes. Under various target scenes, such as gymnasiums, speech data related to training is a frequently occurring word; in a school classroom, speech data related to education is a frequently occurring word; in the kitchen, speech data related to cooking is a frequent word. In these different scenarios, not only are the keywords and frequency of occurrence of the speech data different, but the background noise is also a large interference term. The training voice data of each target scene is acquired through the voice acquisition equipment, the radio loudspeaker and other acquisition devices of each target scene, and the training voice data is training data which is collected according to experience before the development of a voice recognition model and accords with the voice characteristics of each scene in different states of each scene.
After the training voice data are obtained, the corresponding training voice data are preprocessed according to the characteristic differentiation of each scene. The differentiation is the process of selectively carrying out noise reduction, denoising and purification on training voice data by combining scene characteristics. Because the speech signal in the training speech data is a non-stationary time-varying signal, but is approximately stationary for a short period of time, it carries various information, including noise under the influence of various scenes, and may also have silence segments, such as often remain silent in a classroom. Therefore, the probability processing of all the training voice data by the existing denoising mode can not be combined with scene characteristics, so that subsequent voice recognition errors can be caused, the training voice data contains too much redundant information or excessively processed missing key recognition information, and if the training voice data is directly used for training a subsequent voice recognition model, the calculation difficulty is high and the recognition efficiency is low. Therefore, before training the training voice data, preprocessing such as framing, windowing, noise reduction, endpoint detection and the like are respectively carried out based on different scene characteristics. The method is concretely realized as follows: feature parameters of speech recognition in each scene are first extracted, features useful for speech recognition in each scene are extracted, and features conveying non-lexical information, such as emphasis and emotion, are removed. The feature extraction also minimizes the effects of variations caused by speaker and recording conditions. Features in each scene should not be strongly correlated to avoid redundant model parameters, and can be illustratively realized by using an MFCC feature extraction method for each scene customization. Based on different characteristic parameters, corresponding preprocessing modes are distributed, and the scene background noise is divided into 5 processing dimensions according to a dividing standard: the highest to the lightest noise of the scene is defined as 1-5-level noise intensity. Framing, windowing, noise reduction, endpoint detection and pre-emphasis are adopted for the 1-level noise intensity; windowing, noise reduction, endpoint detection and pre-emphasis are adopted for the level 2 noise intensity; noise reduction, endpoint detection and pre-emphasis treatment are adopted for the 3-level noise intensity; endpoint detection and pre-emphasis treatment are adopted for the 4-level noise intensity; pre-emphasis treatment is adopted for the 5-level noise intensity.
Each piece of preprocessed training voice data can form text information which contains the preprocessed training voice data and corresponds to each piece of training voice data and is used for indicating specific instruction content contained in the audio, and the text information can be indicated as a voice recognition result.
And 102, training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes.
In the model training of speech recognition, a method comprising supervised learning and unsupervised learning is adopted, wherein a training data set in the supervised learning must be composed of a labeled sample, the recognition result is represented by labeling the data to be recognized, and the unsupervised learning method only needs to analyze the data set itself and has no label in advance. Manually labeling training data tends to be costly, especially for the field of speech signal recognition. The labeling process is omitted in the offline instruction training set generated in step 101 of the present embodiment, and the training of the speech recognition model is completed in this step by using an unsupervised learning method. Therefore, a Baum-Welch non-supervision learning algorithm is adopted, initial estimation is firstly carried out on model parameters, then the effectiveness of the parameters is estimated through a given training data set, errors caused by the existing model parameters are reduced, the model parameters are updated continuously, error values with the given training data are enabled to be smaller continuously, and finally the model parameters optimal for the given training data set are obtained.
However, in the model training process under the Baum-Welch algorithm, multiple experiments show that the selection of initial parameters not only affects the convergence speed of the voice recognition algorithm, but also determines whether the model can be converged to global optimum finally, thereby affecting the precision of the voice recognition model. Thus, an improvement to extend the Baum-Welch algorithm was developed. Taking the due model as an HMM model for illustration, in the traditional Baum-Welch algorithm, the specific parameter lambda= (A, B, pi) of the HMM model is shown as an example, A: a state transition probability matrix (Transition Probability Matrix) represents the probability of transitioning from one hidden state to another. Element a (i, j) of the a matrix represents the probability of transitioning to the hidden state j with the hidden state i. B: an observation probability matrix (Observation Probability Matrix) represents the probability of generating an observation symbol in each hidden state. The element B (i, j) of the B matrix represents the probability of generating the observation symbol j with the hidden state i. Pi: an initial state probability vector (INITIAL STATE Probability Vector) represents the probability distribution that the model is in each hidden state at the initial time. The element pi (i) of the pi vector represents the probability that the model initial moment is in the hidden state i. For A, B, the initial value of pi is selected, and an average method is generally adopted. Firstly, the voice signal sequence O= [ O1, O2, …, oT ] is divided into N segments, wherein N is the number of hidden states. And then, corresponding each small segment of voice signal to one state of the HMM model, carrying out segmented K-means clustering, and then respectively calculating parameters such as weight, mean value, variance and the like for each class. The traditional method selects initial parameters of the model, only simply equally divides the voice signals, has rough division and does not well reflect the characteristics of the voice signals of the training voice data under different scene characteristics. The pronunciation status of the voice signal has a very large relation with each pronunciation phoneme, different voice small segments representing the same pronunciation phoneme show a strong correlation, and the voice signal segments corresponding to adjacent phonemes have a weaker correlation. Therefore, the selection of the initial model parameters can be improved according to the correlation between adjacent frames of the voice signals, wherein the expansion Baum-Welch algorithm is based on the correlation expansion Baum-Welch algorithm between M adjacent voice signal frames of training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes. The specific implementation steps are as follows:
Calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; if there are M frames in the sequence, M-1 distance values are obtained in total and recorded as =[/>,/>,…,/>]. Assume that the sequence of eigenvalue compositions is {/>},
Then:=/>,n=1,2,…,M-1;
Thereafter, in { from } Extracting N maximum values from M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of an offline voice recognition model of each scene;
And finally, generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes. Therefore, the initial training parameters can be divided in a characteristic way, wherein the number of hidden states of the voice recognition models of all scenes is different, and the number of hidden states can be converged according to different voice recognition precision requirements, so that the voice recognition model training targets of different scene requirements can be met.
And step 103, evaluating the offline speech recognition models of the multiple scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model.
After the offline speech recognition models of multiple scenes are performed, machine noise may be introduced during the training process due to processes such as machine learning and neural network learning, so that after the speech recognition models are output, accuracy control is performed again, and in this embodiment, a prefabricated accuracy evaluation system is adopted for performing judgment. The preset precision evaluation system is realized as follows: the hit probability of the target voice word of each voice data in the offline instruction training set of each target scene is calculated, wherein the target voice word is a word meeting the current scene based on experience summary, and the target voice word can also be realized by excluding irrelevant words. Illustratively, in a classroom scenario, the words "i want to wash the dishes", "where soy sauce is located", etc. are excluded for a kitchen scenario. In the k-element language model, the hit probability of the target speech word of the speech data can be calculated by the following formula. The test set sequence in the speech recognition model contains sentences (/ >)) Evaluating each sentence in the test set, and then performing product operation on the result, wherein the calculation formula is as follows:
thereafter, cross entropy of hit probability of the bulls-eye speech word of each speech data is calculated Determining the ambiguity of an offline instruction training set of each target scene based on the cross entropy;
Wherein the method comprises the steps of Representing the length size of the test text T;
then, the confusion degree of the test text can be obtained by taking the reciprocal of the set average value of the obtained cross entropy :
Forming an accuracy evaluation system according to the hit probability of the target voice word of each voice data and the ambiguity of the offline instruction training set of each target scene, namely deriving by combining the formulas, and testing the occurrence probability in the textAnd confusion, cross entropy is inversely proportional, such that/>And when the maximum value is obtained, the lower the confusion degree and the cross entropy are, and the language recognition model under the accuracy evaluation system under the condition is the optimal offline voice recognition model.
And 104, responding to voice recognition instructions of a plurality of scenes, and outputting scene voice recognition results by matching with the associated optimal offline voice recognition model.
Finally, after receiving the voice recognition instruction of the user, analyzing the instruction, matching with the optimal offline voice recognition model in each scene determined in the steps, and directly outputting a voice recognition result through a voice recognition result output module managed by the voice recognition model in each scene, wherein the voice recognition result can be any visual and audible mode such as text, voice broadcasting converted text or short messages.
Therefore, according to the method provided by the embodiment, on the premise of different offline voice scenes, voice recognition models meeting different environments can be constructed, the existing Baum-Welch algorithm is improved, the training data sets under different offline voice scene characteristics are processed in precision and then used as initial training parameters of the Baum-Welch algorithm, and the trained voice recognition models are subjected to secondary precision screening, so that the voice recognition models are subjected to strict precision control in the initial stage of training, and the accuracy of the offline voice recognition results in the real scenes is facilitated.
Example two
Fig. 2 is a flow chart of another speech recognition method according to an embodiment of the present invention. The voice recognition method can be applied to an intelligent home system, an intelligent gymnasium fitness system, a school classroom education system, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scene of the voice recognition method. As shown in fig. 2, the speech recognition method may include the following operations:
Step 201, training voice data of a plurality of target scenes are collected, and preprocessing is performed on the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set.
And 202, decoding each off-line instruction training set to generate voice points, and performing lattice pruning of posterior probability on the voice points by using maximum mutual information estimation to generate a removed off-line instruction training set.
And 203, training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes.
And 204, evaluating the offline speech recognition models of the multiple scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model.
Step 205, responding to the voice recognition instructions of a plurality of scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models.
The implementation manners of step 201, step 203 and step 205 are the same as the implementation manners of step 101 to step 104 in the above embodiment 1, and are not described herein.
For step 202, in the context of the multi-scenario interaction of the present embodiment, there is a possibility of using the switching method, that is, the voice recognition method is fixed in a portable voice recognition device, and after the kitchen context is used, the portable voice recognition device is brought to a gym for use, so that voice recognition models in multiple scenarios contained in the portable voice recognition device cannot be distinguished from each other, and when the confusing words are recognized, the voice recognition result is wrong due to the indistinguishable difference. Thus, the training criteria for discriminative culling are added in this embodiment, the core principle being targeting the smallest classification errors. The maximum mutual information estimation (Maximum Mutual Information Estimation, MMIE) is mainly used for training, and the maximum mutual information estimation is used for performing lattice pruning of posterior probability on the voice points to generate a removed off-line instruction training set. The method is concretely realized as follows:
Firstly, each off-line instruction training set is decoded to generate a voice point, a weak language model (Weak Language Model) is utilized in the decoding process, and the weak language model (Weak Language Model) refers to a model for modeling certain aspects of a language in natural language processing, but the modeling capability is relatively weak, and the complexity and fine semantic information of the language may not be accurately captured. In this embodiment, an N-gram model and a statistical-based model may be adopted, and details are not described here;
further, the lattice pruning of posterior probability is carried out on the voice points, word assumptions with lower posterior probability are pruned, and word assumptions are carried out again, so as to simplify the realization of the expansion Baum-Welch algorithm,
The posterior probability P (a|b) for a speech point may be calculated based on bayesian Theorem (Bayes' thestorem):
P(A|B) = (P(B|A) * P(A)) / P(B)
Where P (a|b) represents the probability of occurrence of a, i.e., the posterior probability, given that B occurs; p (b|a) represents the probability, i.e., likelihood, that B occurs in the case where a occurs; p (a) and P (B) represent a priori probabilities of a and B, respectively, and P (B) is an edge likelihood, in this embodiment, a refers to a probability that a speech point is known to occur, and B refers to a probability that a speech point redundancy occurs.
In addition, the previous weak language model is utilized to reassign the language model score to each word; finally, maximum mutual information estimation training is started, an iterative running extended Baum-Welch algorithm is used for collecting statistical information, and model parameters are updated by using a norm program, and in the embodiment, a context correlation model is preferably trained, namely, all possible combinations of contexts are tried.
Therefore, the mutual interference of the voice recognition models in multiple scenes can be reduced, and the accuracy of the voice recognition method is further improved.
Example III
Fig. 3 is a schematic diagram of a speech recognition system according to an embodiment of the present invention. The voice recognition system can be applied to an intelligent home system, an intelligent gymnasium body-building system, a school classroom education system, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scene of the voice recognition system. As shown in fig. 3, the speech recognition system may include:
A preprocessing module 31, an offline speech recognition model 32, an optimal offline speech recognition model 33, and a speech recognition output module 34. The preprocessing module 31 is configured to collect training voice data of a plurality of target scenes, and perform preprocessing on the training voice data based on environmental features of each target scene to generate at least one offline instruction training set; the offline speech recognition model 32 of the plurality of scenes is generated by training and constructing an offline instruction training set based on an extended Baum-Welch algorithm; the optimal offline speech recognition model 33 evaluates and determines the offline speech recognition models of a plurality of scenes based on a preset precision evaluation system; the speech recognition output module 34 is configured to output a scene speech recognition result in response to the speech recognition instructions of the plurality of scenes, and in response to matching with the associated optimal offline speech recognition model.
Specifically, the preprocessing module 31 is implemented as: the training voice data of each target scene is acquired through the voice acquisition equipment, the radio loudspeaker and other acquisition devices of each target scene, and the training voice data is training data which is collected according to experience before the development of a voice recognition model and accords with the voice characteristics of each scene in different states of each scene.
After the training voice data are obtained, the corresponding training voice data are preprocessed according to the characteristic differentiation of each scene. The differentiation is the process of selectively carrying out noise reduction, denoising and purification on training voice data by combining scene characteristics. Because the speech signal in the training speech data is a non-stationary time-varying signal, but is approximately stationary for a short period of time, it carries various information, including noise under the influence of various scenes, and may also have silence segments, such as often remain silent in a classroom. Therefore, the probability processing of all the training voice data by the existing denoising mode can not be combined with scene characteristics, so that subsequent voice recognition errors can be caused, the training voice data contains too much redundant information or excessively processed missing key recognition information, and if the training voice data is directly used for training a subsequent voice recognition model, the calculation difficulty is high and the recognition efficiency is low. Therefore, before training the training voice data, preprocessing such as framing, windowing, noise reduction, endpoint detection and the like are respectively carried out based on different scene characteristics. The method is concretely realized as follows: feature parameters of speech recognition in each scene are first extracted, features useful for speech recognition in each scene are extracted, and features conveying non-lexical information, such as emphasis and emotion, are removed. The feature extraction also minimizes the effects of variations caused by speaker and recording conditions. Features in each scene should not be strongly correlated to avoid redundant model parameters, and can be illustratively realized by using an MFCC feature extraction method for each scene customization. Based on different characteristic parameters, corresponding preprocessing modes are distributed, and the scene background noise is divided into 5 processing dimensions according to a dividing standard: the highest to the lightest noise of the scene is defined as 1-5-level noise intensity. Framing, windowing, noise reduction, endpoint detection and pre-emphasis are adopted for the 1-level noise intensity; windowing, noise reduction, endpoint detection and pre-emphasis are adopted for the level 2 noise intensity; noise reduction, endpoint detection and pre-emphasis treatment are adopted for the 3-level noise intensity; endpoint detection and pre-emphasis treatment are adopted for the 4-level noise intensity; pre-emphasis treatment is adopted for the 5-level noise intensity.
Each piece of preprocessed training voice data can form text information which contains the preprocessed training voice data and corresponds to each piece of training voice data and is used for indicating specific instruction content contained in the audio, and the text information can be indicated as a voice recognition result.
Specifically, for the offline speech recognition model 32 is implemented as: training a voice recognition model through a preset expansion Baum-Welch algorithm, wherein the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between M adjacent voice signal frames based on training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes. The specific implementation steps are as follows:
Calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; if there are M frames in the sequence, M-1 distance values are obtained in total and recorded as =[/>,/>,…,/>]. Assume that the sequence of eigenvalue compositions is {/>},
Then:=/>,n=1,2,…,M-1;
Thereafter, in { from } Extracting N maximum values from M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of an offline voice recognition model of each scene;
And finally, generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes. Therefore, the initial training parameters can be divided in a characteristic way, wherein the number of hidden states of the voice recognition models of all scenes is different, and the number of hidden states can be converged according to different voice recognition precision requirements, so that the voice recognition model training targets of different scene requirements can be met.
Specifically, the optimal offline speech recognition model 33 is implemented as: and judging by adopting a prefabricated precision evaluation system. The preset precision evaluation system is realized as follows: the hit probability of the target voice word of each voice data in the offline instruction training set of each target scene is calculated, wherein the target voice word is a word meeting the current scene based on experience summary, and the target voice word can also be realized by excluding irrelevant words. Illustratively, in a classroom scenario, the words "i want to wash the dishes", "where soy sauce is located", etc. are excluded for a kitchen scenario. In the k-element language model, the hit probability of the target speech word of the speech data can be calculated by the following formula. The test set sequence in the speech recognition model contains sentences (/ >)) Evaluating each sentence in the test set, and then performing product operation on the result, wherein the calculation formula is as follows:
thereafter, cross entropy of hit probability of the bulls-eye speech word of each speech data is calculated Determining the ambiguity of an offline instruction training set of each target scene based on the cross entropy;
Wherein the method comprises the steps of Representing the length size of the test text T;
then, the confusion degree of the test text can be obtained by taking the reciprocal of the set average value of the obtained cross entropy :
Forming an accuracy evaluation system according to the hit probability of the target voice word of each voice data and the ambiguity of the offline instruction training set of each target scene, namely deriving by combining the formulas, and testing the occurrence probability in the textAnd confusion, cross entropy is inversely proportional, such that/>And when the maximum value is obtained, the lower the confusion degree and the cross entropy are, and the language recognition model under the accuracy evaluation system under the condition is the optimal offline voice recognition model.
Specifically, the speech recognition output module 34 is implemented as: the voice recognition result can be directly output through a voice recognition result output module managed by the voice recognition model under each scene, and the voice recognition result can be any visual and audible mode such as text, voice broadcasting converted from text, short messages and the like.
In other preferred embodiments, the speech recognition system further comprises a culling module 35. The rejection module is realized as follows: firstly, each off-line instruction training set is decoded to generate a voice point, a weak language model (Weak Language Model) is utilized in the decoding process, and the weak language model (Weak Language Model) refers to a model for modeling certain aspects of a language in natural language processing, but the modeling capability is relatively weak, and the complexity and fine semantic information of the language may not be accurately captured. In this embodiment, an N-gram model and a statistical-based model may be adopted, and details are not described here;
further, the lattice pruning of posterior probability is carried out on the voice points, word assumptions with lower posterior probability are pruned, and word assumptions are carried out again, so as to simplify the realization of the expansion Baum-Welch algorithm,
The posterior probability P (a|b) for a speech point may be calculated based on bayesian Theorem (Bayes' thestorem):
P(A|B) = (P(B|A) * P(A)) / P(B)
Where P (a|b) represents the probability of occurrence of a, i.e., the posterior probability, given that B occurs; p (b|a) represents the probability, i.e., likelihood, that B occurs in the case where a occurs; p (a) and P (B) represent a priori probabilities of a and B, respectively, and P (B) is an edge likelihood, in this embodiment, a refers to a probability that a speech point is known to occur, and B refers to a probability that a speech point redundancy occurs.
In addition, the previous weak language model is utilized to reassign the language model score to each word; finally, maximum mutual information estimation training is started, an iterative running extended Baum-Welch algorithm is used for collecting statistical information, and model parameters are updated by using a norm program, and in the embodiment, a context correlation model is preferably trained, namely, all possible combinations of contexts are tried.
It should be noted that, the present speech recognition system may be a speech recognition module embedded in a certain intelligent scene, or may be a speech recognition system that is executed independently, and the present embodiment does not limit the installation carrier and the use form of the speech recognition system.
Example IV
Fig. 4 is an interaction schematic diagram of a speech recognition system according to an embodiment of the present invention. The voice recognition system is applied to an intelligent gymnasium body-building system. As shown in fig. 4:
A user used in the gymnasium can send out corresponding voice recognition instructions at any time, for example, a user can take a bath, finish running training for a long time, and a trainer in the next training class, a microphone, a loudspeaker and other voice acquisition modules of an intelligent gymnasium body-building system collect the voice recognition instructions and send the voice recognition instructions to an embedded voice recognition system, the voice recognition system generates a high-precision voice recognition result based on the voice recognition method of the embodiment, and the voice recognition result can display intelligent voice interaction with the user in a mode of voice playing, text displaying, short message pushing and the like, so that the experience of the user is improved.
Example five
Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the invention. The device described in fig. 5 can be applied to an intelligent home system, an intelligent gymnasium fitness system, a classroom education system in schools, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scenario of the voice recognition method. The embodiment of the invention is not limited to the application system of the voice recognition device. As shown in fig. 5, the apparatus may include:
a memory 501 in which executable program codes are stored;
a processor 502 coupled to the memory 501;
the processor 502 invokes executable program code stored in the memory 501 for performing the described speech recognition method.
Example six
The embodiment of the invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the described voice recognition method.
Example seven
Embodiments of the present invention disclose a computer program product comprising a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the described speech recognition method.
The embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
Finally, it should be noted that: the disclosed voice recognition method and system are only preferred embodiments of the present invention, and are only used for illustrating the technical scheme of the present invention, but not limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme recorded in the various embodiments can be modified or part of technical features in the technical scheme can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.
Claims (7)
1. A method of speech recognition, the method comprising:
collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set;
Training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes, wherein the offline speech recognition model comprises the following steps: the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number;
analyzing the offline instruction training set to generate a voice signal sequence;
Taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter, and constructing an offline voice recognition model of a plurality of scenes: the method comprises the steps of extracting N maximum values from M-1 distance values by calculating M-1 distance values between voice characteristic values of M adjacent voice signal frames of training voice data, taking the voice signal frames corresponding to the maximum values as demarcation points, wherein N is the number of hidden states of offline voice recognition models of all scenes, generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition models of all scenes, and performing iterative training by using the hidden state sequences to construct the offline voice recognition models of the scenes;
Evaluating the offline speech recognition models of the scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model;
And responding to the voice recognition instructions of the scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models.
2. The method of claim 1, wherein the number of hidden states of the speech recognition model is different for each scene.
3. The method according to claim 1 or 2, wherein training the offline instruction training set by extending a Baum-Welch algorithm constructs an offline speech recognition model of a plurality of scenes, previously comprising:
And decoding each off-line instruction training set to generate voice points, and performing lattice pruning of posterior probability on the voice points by using maximum mutual information estimation to generate a rejected off-line instruction training set.
4. The method according to claim 1, wherein the offline speech recognition models of the plurality of scenes are evaluated based on a preset accuracy evaluation system to determine an optimal offline speech recognition model, wherein the preset accuracy evaluation system is implemented as:
Calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene;
Calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy;
And forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.
5. A speech recognition system, the system comprising:
The preprocessing module is used for collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set;
The offline voice recognition models of the multiple scenes are generated by training and constructing the offline instruction training set based on an extended Baum-Welch algorithm, wherein the extended Baum-Welch algorithm is a correlation extended Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number;
Analyzing the offline instruction training set to generate a voice signal sequence; taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter, and constructing an offline voice recognition model of a plurality of scenes: realizing M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data by calculating; extracting N maximum values from the M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of the offline voice recognition model of each scene; generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes;
the optimal offline speech recognition model is used for evaluating and determining the offline speech recognition models of the scenes based on a preset precision evaluation system;
And the voice recognition output module is used for responding to voice recognition instructions of a plurality of scenes and outputting scene voice recognition results by matching with the associated optimal offline voice recognition model.
6. The speech recognition system of claim 5, wherein the number of hidden states of the speech recognition model is different for each scene.
7. The speech recognition system of claim 6, wherein in the optimal offline speech recognition model, a preset accuracy rating system is implemented as:
Calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene;
Calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy;
And forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311627049.5A CN117456987B (en) | 2023-11-29 | 2023-11-29 | Voice recognition method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311627049.5A CN117456987B (en) | 2023-11-29 | 2023-11-29 | Voice recognition method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117456987A CN117456987A (en) | 2024-01-26 |
CN117456987B true CN117456987B (en) | 2024-06-21 |
Family
ID=89583770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311627049.5A Active CN117456987B (en) | 2023-11-29 | 2023-11-29 | Voice recognition method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117456987B (en) |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101281745B (en) * | 2008-05-23 | 2011-08-10 | 深圳市北科瑞声科技有限公司 | Interactive system for vehicle-mounted voice |
GB2480084B (en) * | 2010-05-05 | 2012-08-08 | Toshiba Res Europ Ltd | A speech processing system and method |
CN106157950A (en) * | 2016-09-29 | 2016-11-23 | 合肥华凌股份有限公司 | Speech control system and awakening method, Rouser and household electrical appliances, coprocessor |
CN111613212B (en) * | 2020-05-13 | 2023-10-31 | 携程旅游信息技术(上海)有限公司 | Speech recognition method, system, electronic device and storage medium |
CN114530141A (en) * | 2020-11-23 | 2022-05-24 | 北京航空航天大学 | Chinese and English mixed offline voice keyword recognition method under specific scene and system implementation thereof |
-
2023
- 2023-11-29 CN CN202311627049.5A patent/CN117456987B/en active Active
Non-Patent Citations (2)
Title |
---|
Neural networks for statistical recognition of continuous speech;NELSON MORGAN et al.;《Proceedings of the IEEE》;19950531;第83卷(第5期);全文 * |
一种基于深度神经网络的电力系统调度控制语音识别模型;胡翔 等;《电子器件》;20230228;第46卷(第1期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117456987A (en) | 2024-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110111775B (en) | Streaming voice recognition method, device, equipment and storage medium | |
CN109817213B (en) | Method, device and equipment for performing voice recognition on self-adaptive language | |
CN110473531B (en) | Voice recognition method, device, electronic equipment, system and storage medium | |
US6718303B2 (en) | Apparatus and method for automatically generating punctuation marks in continuous speech recognition | |
CN110610707B (en) | Voice keyword recognition method and device, electronic equipment and storage medium | |
EP4018437B1 (en) | Optimizing a keyword spotting system | |
CN108428446A (en) | Audio recognition method and device | |
CN106935239A (en) | The construction method and device of a kind of pronunciation dictionary | |
CN112017645B (en) | Voice recognition method and device | |
CN111862942B (en) | Method and system for training mixed speech recognition model of Mandarin and Sichuan | |
CN109036471B (en) | Voice endpoint detection method and device | |
CN107871499B (en) | Speech recognition method, system, computer device and computer-readable storage medium | |
CN110634469B (en) | Speech signal processing method and device based on artificial intelligence and storage medium | |
CN112017690B (en) | Audio processing method, device, equipment and medium | |
CN110853669B (en) | Audio identification method, device and equipment | |
CN114333828A (en) | Quick voice recognition system for digital product | |
CN117765932A (en) | Speech recognition method, device, electronic equipment and storage medium | |
CN110570838B (en) | Voice stream processing method and device | |
CN117456987B (en) | Voice recognition method and system | |
CN115512692B (en) | Voice recognition method, device, equipment and storage medium | |
CN111640423A (en) | Word boundary estimation method and device and electronic equipment | |
CN114783410B (en) | Speech synthesis method, system, electronic device and storage medium | |
CN115762500A (en) | Voice processing method, device, equipment and storage medium | |
CN111833869B (en) | Voice interaction method and system applied to urban brain | |
CN114792521A (en) | Intelligent answering method and device based on voice recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |