CN117456987B

CN117456987B - Voice recognition method and system

Info

Publication number: CN117456987B
Application number: CN202311627049.5A
Authority: CN
Inventors: 李伟; 邓宝玉
Original assignee: Shenzhen Pinsheng Technology Co ltd
Current assignee: Shenzhen Pinsheng Technology Co ltd
Priority date: 2023-11-29
Filing date: 2023-11-29
Publication date: 2024-06-21
Anticipated expiration: 2043-11-29
Also published as: CN117456987A

Abstract

The invention discloses a voice recognition method, which comprises the following steps: collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set; training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes; evaluating the offline speech recognition models of a plurality of scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model; and responding to the voice recognition instructions of the scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models. The method and the system disclosed by the invention can be used for carrying out high-precision voice recognition in a plurality of different voice scenes, and the problems of poor recognition precision, low robustness and poor user experience of the existing voice recognition model are solved.

Description

Voice recognition method and system

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method and system.

Background

With the development and progress of speech recognition technology, speech recognition-based applications are becoming more and more widespread, and have been gradually covered in aspects of home life, office area, entertainment, etc. The voice is input through an external or internal microphone on a mobile phone, an intelligent computer, a notebook computer, a learning machine, an intelligent home terminal and the like, and then the voice-text conversion is completed through a voice recognition model embedded in the voice input device.

At present, the accuracy of voice recognition of a voice recognition model is a focus, and particularly in certain daily voice scenes, such as gymnasiums, beauty parlors, kitchens, schools and the like, the voice recognition model does not have real-time network connection, and special offline voice recognition needs are often provided. For these offline speech recognition scenarios, if a high-precision speech recognition result cannot be provided, the user's use site will be interfered, and the interactive use experience of the user is greatly reduced.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a voice recognition method and a voice recognition system, which can perform high-precision voice recognition in a plurality of different voice scenes, and solve the problems of poor recognition precision, low robustness and poor user experience of the existing voice recognition model.

To solve the above technical problem, a first aspect of the present invention discloses a speech recognition method, which includes: collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set; training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline voice recognition model of a plurality of scenes; evaluating the offline speech recognition models of the scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model; and responding to the voice recognition instructions of the scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models.

In some embodiments, the training the offline instruction training set by extending a Baum-Welch algorithm constructs an offline speech recognition model of a plurality of scenes, wherein: the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes.

In some embodiments, the method for constructing an offline speech recognition model of a plurality of scenes using the speech signal sequence processed by the Baum-Welch algorithm as a model initial parameter comprises: calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; extracting N maximum values from the M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of the offline voice recognition model of each scene; and generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes.

In some implementations, the number of hidden states of the speech recognition model is different for each scene.

In some embodiments, training the offline instruction training set by extending a Baum-Welch algorithm constructs an offline speech recognition model of a plurality of scenes, including: and decoding each off-line instruction training set to generate voice points, and performing lattice pruning of posterior probability on the voice points by using maximum mutual information estimation to generate a rejected off-line instruction training set.

In some embodiments, the offline speech recognition models of the plurality of scenes are evaluated based on a preset precision evaluation system to determine an optimal offline speech recognition model, wherein the preset precision evaluation system is implemented as follows: calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene; calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy; and forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.

In a second aspect, the invention discloses a speech recognition system, the system comprising: the preprocessing module is used for collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set; the offline voice recognition models of the multiple scenes are generated by training, constructing and generating the offline instruction training set based on an extended Baum-Welch algorithm; the optimal offline speech recognition model is used for evaluating and determining the offline speech recognition models of the scenes based on a preset precision evaluation system; and the voice recognition output module is used for responding to voice recognition instructions of a plurality of scenes and outputting scene voice recognition results by matching with the associated optimal offline voice recognition model.

In some implementations, in the offline speech recognition model of multiple scenarios: the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes.

In some implementations, the offline speech recognition model for multiple scenarios is implemented as: calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; extracting N maximum values from the M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of the offline voice recognition model of each scene; and generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes.

In some embodiments, in the optimal offline speech recognition model, the preset accuracy evaluation system is implemented as: calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene; calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy; and forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.

A third aspect of the invention discloses a computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a speech recognition method as described.

The fourth aspect of the present invention discloses a speech recognition apparatus, the apparatus comprising: a memory storing executable program code; a processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform the speech recognition method.

Compared with the prior art, the invention has the beneficial effects that:

By implementing the method, the voice recognition model meeting different environments can be constructed on the premise of different offline voice scenes, the existing Baum-Welch algorithm is improved, the training data sets under different offline voice scene characteristics are processed accurately and then used as initial training parameters of the Baum-Welch algorithm, so that the voice recognition model is controlled with strict accuracy at the initial stage of training, and the accuracy of the subsequent offline voice recognition result under the real scene is facilitated. In addition, in order to further control the voice recognition precision of different scenes in an offline state, the invention evaluates the offline voice recognition models of a plurality of scenes based on a preset precision evaluation system so as to determine the most preferable offline voice recognition model. Therefore, the method and the device not only carry out precision control in the earlier stage of training of the voice recognition model, but also screen and remove interference factors again after the voice recognition model is generated, thereby greatly improving the precision and the robustness of the offline voice recognition model under different target scenes and greatly improving the interactive experience of users.

Drawings

FIG. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speech recognition method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech recognition system according to an embodiment of the present invention;

FIG. 4 is an interactive schematic diagram of a speech recognition system according to an embodiment of the present invention;

Fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention.

Detailed Description

For a better understanding and implementation, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules that are expressly listed or inherent to such process, method, article, or apparatus.

The embodiment of the invention discloses a voice recognition method and a voice recognition system, which can construct voice recognition models meeting different environments on the premise of different offline voice scenes, improve the existing Baum-Welch algorithm, and treat the training data sets under the characteristics of the different offline voice scenes with precision and then serve as initial training parameters of the Baum-Welch algorithm, so that the voice recognition model is strictly controlled with precision in the initial training stage, and the accuracy of the subsequent offline voice recognition results under the real scenes is facilitated. In addition, in order to further control the voice recognition precision of different scenes in an offline state, the invention evaluates the offline voice recognition models of a plurality of scenes based on a preset precision evaluation system so as to determine the most preferable offline voice recognition model. Therefore, the method and the device not only carry out precision control in the earlier stage of training of the voice recognition model, but also screen and remove interference factors again after the voice recognition model is generated, thereby greatly improving the precision and the robustness of the offline voice recognition model under different target scenes and greatly improving the interactive experience of users.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of a voice recognition method according to an embodiment of the invention. The voice recognition method can be applied to an intelligent home system, an intelligent gymnasium fitness system, a school classroom education system, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scene of the voice recognition method. As shown in fig. 1, the speech recognition method may include the following operations:

step 101, training voice data of a plurality of target scenes are collected, and preprocessing is performed on the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set.

Under the daily diversified intelligent voice use scene, the received voice data is different in different use scenes. Under various target scenes, such as gymnasiums, speech data related to training is a frequently occurring word; in a school classroom, speech data related to education is a frequently occurring word; in the kitchen, speech data related to cooking is a frequent word. In these different scenarios, not only are the keywords and frequency of occurrence of the speech data different, but the background noise is also a large interference term. The training voice data of each target scene is acquired through the voice acquisition equipment, the radio loudspeaker and other acquisition devices of each target scene, and the training voice data is training data which is collected according to experience before the development of a voice recognition model and accords with the voice characteristics of each scene in different states of each scene.

After the training voice data are obtained, the corresponding training voice data are preprocessed according to the characteristic differentiation of each scene. The differentiation is the process of selectively carrying out noise reduction, denoising and purification on training voice data by combining scene characteristics. Because the speech signal in the training speech data is a non-stationary time-varying signal, but is approximately stationary for a short period of time, it carries various information, including noise under the influence of various scenes, and may also have silence segments, such as often remain silent in a classroom. Therefore, the probability processing of all the training voice data by the existing denoising mode can not be combined with scene characteristics, so that subsequent voice recognition errors can be caused, the training voice data contains too much redundant information or excessively processed missing key recognition information, and if the training voice data is directly used for training a subsequent voice recognition model, the calculation difficulty is high and the recognition efficiency is low. Therefore, before training the training voice data, preprocessing such as framing, windowing, noise reduction, endpoint detection and the like are respectively carried out based on different scene characteristics. The method is concretely realized as follows: feature parameters of speech recognition in each scene are first extracted, features useful for speech recognition in each scene are extracted, and features conveying non-lexical information, such as emphasis and emotion, are removed. The feature extraction also minimizes the effects of variations caused by speaker and recording conditions. Features in each scene should not be strongly correlated to avoid redundant model parameters, and can be illustratively realized by using an MFCC feature extraction method for each scene customization. Based on different characteristic parameters, corresponding preprocessing modes are distributed, and the scene background noise is divided into 5 processing dimensions according to a dividing standard: the highest to the lightest noise of the scene is defined as 1-5-level noise intensity. Framing, windowing, noise reduction, endpoint detection and pre-emphasis are adopted for the 1-level noise intensity; windowing, noise reduction, endpoint detection and pre-emphasis are adopted for the level 2 noise intensity; noise reduction, endpoint detection and pre-emphasis treatment are adopted for the 3-level noise intensity; endpoint detection and pre-emphasis treatment are adopted for the 4-level noise intensity; pre-emphasis treatment is adopted for the 5-level noise intensity.

Each piece of preprocessed training voice data can form text information which contains the preprocessed training voice data and corresponds to each piece of training voice data and is used for indicating specific instruction content contained in the audio, and the text information can be indicated as a voice recognition result.

And 102, training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes.

In the model training of speech recognition, a method comprising supervised learning and unsupervised learning is adopted, wherein a training data set in the supervised learning must be composed of a labeled sample, the recognition result is represented by labeling the data to be recognized, and the unsupervised learning method only needs to analyze the data set itself and has no label in advance. Manually labeling training data tends to be costly, especially for the field of speech signal recognition. The labeling process is omitted in the offline instruction training set generated in step 101 of the present embodiment, and the training of the speech recognition model is completed in this step by using an unsupervised learning method. Therefore, a Baum-Welch non-supervision learning algorithm is adopted, initial estimation is firstly carried out on model parameters, then the effectiveness of the parameters is estimated through a given training data set, errors caused by the existing model parameters are reduced, the model parameters are updated continuously, error values with the given training data are enabled to be smaller continuously, and finally the model parameters optimal for the given training data set are obtained.

However, in the model training process under the Baum-Welch algorithm, multiple experiments show that the selection of initial parameters not only affects the convergence speed of the voice recognition algorithm, but also determines whether the model can be converged to global optimum finally, thereby affecting the precision of the voice recognition model. Thus, an improvement to extend the Baum-Welch algorithm was developed. Taking the due model as an HMM model for illustration, in the traditional Baum-Welch algorithm, the specific parameter lambda= (A, B, pi) of the HMM model is shown as an example, A: a state transition probability matrix (Transition Probability Matrix) represents the probability of transitioning from one hidden state to another. Element a (i, j) of the a matrix represents the probability of transitioning to the hidden state j with the hidden state i. B: an observation probability matrix (Observation Probability Matrix) represents the probability of generating an observation symbol in each hidden state. The element B (i, j) of the B matrix represents the probability of generating the observation symbol j with the hidden state i. Pi: an initial state probability vector (INITIAL STATE Probability Vector) represents the probability distribution that the model is in each hidden state at the initial time. The element pi (i) of the pi vector represents the probability that the model initial moment is in the hidden state i. For A, B, the initial value of pi is selected, and an average method is generally adopted. Firstly, the voice signal sequence O= [ O1, O2, …, oT ] is divided into N segments, wherein N is the number of hidden states. And then, corresponding each small segment of voice signal to one state of the HMM model, carrying out segmented K-means clustering, and then respectively calculating parameters such as weight, mean value, variance and the like for each class. The traditional method selects initial parameters of the model, only simply equally divides the voice signals, has rough division and does not well reflect the characteristics of the voice signals of the training voice data under different scene characteristics. The pronunciation status of the voice signal has a very large relation with each pronunciation phoneme, different voice small segments representing the same pronunciation phoneme show a strong correlation, and the voice signal segments corresponding to adjacent phonemes have a weaker correlation. Therefore, the selection of the initial model parameters can be improved according to the correlation between adjacent frames of the voice signals, wherein the expansion Baum-Welch algorithm is based on the correlation expansion Baum-Welch algorithm between M adjacent voice signal frames of training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes. The specific implementation steps are as follows:

Calculating M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data; if there are M frames in the sequence, M-1 distance values are obtained in total and recorded as =[/>,/>,…,/>]. Assume that the sequence of eigenvalue compositions is {/>}，

Then:=/>，n=1,2,…,M-1;

Thereafter, in { from } Extracting N maximum values from M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of an offline voice recognition model of each scene;

And finally, generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes. Therefore, the initial training parameters can be divided in a characteristic way, wherein the number of hidden states of the voice recognition models of all scenes is different, and the number of hidden states can be converged according to different voice recognition precision requirements, so that the voice recognition model training targets of different scene requirements can be met.

And step 103, evaluating the offline speech recognition models of the multiple scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model.

After the offline speech recognition models of multiple scenes are performed, machine noise may be introduced during the training process due to processes such as machine learning and neural network learning, so that after the speech recognition models are output, accuracy control is performed again, and in this embodiment, a prefabricated accuracy evaluation system is adopted for performing judgment. The preset precision evaluation system is realized as follows: the hit probability of the target voice word of each voice data in the offline instruction training set of each target scene is calculated, wherein the target voice word is a word meeting the current scene based on experience summary, and the target voice word can also be realized by excluding irrelevant words. Illustratively, in a classroom scenario, the words "i want to wash the dishes", "where soy sauce is located", etc. are excluded for a kitchen scenario. In the k-element language model, the hit probability of the target speech word of the speech data can be calculated by the following formula. The test set sequence in the speech recognition model contains sentences (/ >)) Evaluating each sentence in the test set, and then performing product operation on the result, wherein the calculation formula is as follows:

thereafter, cross entropy of hit probability of the bulls-eye speech word of each speech data is calculated Determining the ambiguity of an offline instruction training set of each target scene based on the cross entropy;

Wherein the method comprises the steps of Representing the length size of the test text T;

then, the confusion degree of the test text can be obtained by taking the reciprocal of the set average value of the obtained cross entropy ：

Forming an accuracy evaluation system according to the hit probability of the target voice word of each voice data and the ambiguity of the offline instruction training set of each target scene, namely deriving by combining the formulas, and testing the occurrence probability in the textAnd confusion, cross entropy is inversely proportional, such that/>And when the maximum value is obtained, the lower the confusion degree and the cross entropy are, and the language recognition model under the accuracy evaluation system under the condition is the optimal offline voice recognition model.

And 104, responding to voice recognition instructions of a plurality of scenes, and outputting scene voice recognition results by matching with the associated optimal offline voice recognition model.

Finally, after receiving the voice recognition instruction of the user, analyzing the instruction, matching with the optimal offline voice recognition model in each scene determined in the steps, and directly outputting a voice recognition result through a voice recognition result output module managed by the voice recognition model in each scene, wherein the voice recognition result can be any visual and audible mode such as text, voice broadcasting converted text or short messages.

Therefore, according to the method provided by the embodiment, on the premise of different offline voice scenes, voice recognition models meeting different environments can be constructed, the existing Baum-Welch algorithm is improved, the training data sets under different offline voice scene characteristics are processed in precision and then used as initial training parameters of the Baum-Welch algorithm, and the trained voice recognition models are subjected to secondary precision screening, so that the voice recognition models are subjected to strict precision control in the initial stage of training, and the accuracy of the offline voice recognition results in the real scenes is facilitated.

Example two

Fig. 2 is a flow chart of another speech recognition method according to an embodiment of the present invention. The voice recognition method can be applied to an intelligent home system, an intelligent gymnasium fitness system, a school classroom education system, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scene of the voice recognition method. As shown in fig. 2, the speech recognition method may include the following operations:

Step 201, training voice data of a plurality of target scenes are collected, and preprocessing is performed on the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set.

And 202, decoding each off-line instruction training set to generate voice points, and performing lattice pruning of posterior probability on the voice points by using maximum mutual information estimation to generate a removed off-line instruction training set.

And 203, training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes.

And 204, evaluating the offline speech recognition models of the multiple scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model.

Step 205, responding to the voice recognition instructions of a plurality of scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models.

The implementation manners of step 201, step 203 and step 205 are the same as the implementation manners of step 101 to step 104 in the above embodiment 1, and are not described herein.

For step 202, in the context of the multi-scenario interaction of the present embodiment, there is a possibility of using the switching method, that is, the voice recognition method is fixed in a portable voice recognition device, and after the kitchen context is used, the portable voice recognition device is brought to a gym for use, so that voice recognition models in multiple scenarios contained in the portable voice recognition device cannot be distinguished from each other, and when the confusing words are recognized, the voice recognition result is wrong due to the indistinguishable difference. Thus, the training criteria for discriminative culling are added in this embodiment, the core principle being targeting the smallest classification errors. The maximum mutual information estimation (Maximum Mutual Information Estimation, MMIE) is mainly used for training, and the maximum mutual information estimation is used for performing lattice pruning of posterior probability on the voice points to generate a removed off-line instruction training set. The method is concretely realized as follows:

Firstly, each off-line instruction training set is decoded to generate a voice point, a weak language model (Weak Language Model) is utilized in the decoding process, and the weak language model (Weak Language Model) refers to a model for modeling certain aspects of a language in natural language processing, but the modeling capability is relatively weak, and the complexity and fine semantic information of the language may not be accurately captured. In this embodiment, an N-gram model and a statistical-based model may be adopted, and details are not described here;

further, the lattice pruning of posterior probability is carried out on the voice points, word assumptions with lower posterior probability are pruned, and word assumptions are carried out again, so as to simplify the realization of the expansion Baum-Welch algorithm,

The posterior probability P (a|b) for a speech point may be calculated based on bayesian Theorem (Bayes' thestorem):

P(A|B) = (P(B|A) * P(A)) / P(B)

Where P (a|b) represents the probability of occurrence of a, i.e., the posterior probability, given that B occurs; p (b|a) represents the probability, i.e., likelihood, that B occurs in the case where a occurs; p (a) and P (B) represent a priori probabilities of a and B, respectively, and P (B) is an edge likelihood, in this embodiment, a refers to a probability that a speech point is known to occur, and B refers to a probability that a speech point redundancy occurs.

In addition, the previous weak language model is utilized to reassign the language model score to each word; finally, maximum mutual information estimation training is started, an iterative running extended Baum-Welch algorithm is used for collecting statistical information, and model parameters are updated by using a norm program, and in the embodiment, a context correlation model is preferably trained, namely, all possible combinations of contexts are tried.

Therefore, the mutual interference of the voice recognition models in multiple scenes can be reduced, and the accuracy of the voice recognition method is further improved.

Example III

Fig. 3 is a schematic diagram of a speech recognition system according to an embodiment of the present invention. The voice recognition system can be applied to an intelligent home system, an intelligent gymnasium body-building system, a school classroom education system, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scene of the voice recognition system. As shown in fig. 3, the speech recognition system may include:

A preprocessing module 31, an offline speech recognition model 32, an optimal offline speech recognition model 33, and a speech recognition output module 34. The preprocessing module 31 is configured to collect training voice data of a plurality of target scenes, and perform preprocessing on the training voice data based on environmental features of each target scene to generate at least one offline instruction training set; the offline speech recognition model 32 of the plurality of scenes is generated by training and constructing an offline instruction training set based on an extended Baum-Welch algorithm; the optimal offline speech recognition model 33 evaluates and determines the offline speech recognition models of a plurality of scenes based on a preset precision evaluation system; the speech recognition output module 34 is configured to output a scene speech recognition result in response to the speech recognition instructions of the plurality of scenes, and in response to matching with the associated optimal offline speech recognition model.

Specifically, the preprocessing module 31 is implemented as: the training voice data of each target scene is acquired through the voice acquisition equipment, the radio loudspeaker and other acquisition devices of each target scene, and the training voice data is training data which is collected according to experience before the development of a voice recognition model and accords with the voice characteristics of each scene in different states of each scene.

Specifically, for the offline speech recognition model 32 is implemented as: training a voice recognition model through a preset expansion Baum-Welch algorithm, wherein the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between M adjacent voice signal frames based on training voice data, and M is a natural number; analyzing the offline instruction training set to generate a voice signal sequence; and taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter to construct an offline voice recognition model of a plurality of scenes. The specific implementation steps are as follows:

Then:=/>，n=1,2,…,M-1;

Specifically, the optimal offline speech recognition model 33 is implemented as: and judging by adopting a prefabricated precision evaluation system. The preset precision evaluation system is realized as follows: the hit probability of the target voice word of each voice data in the offline instruction training set of each target scene is calculated, wherein the target voice word is a word meeting the current scene based on experience summary, and the target voice word can also be realized by excluding irrelevant words. Illustratively, in a classroom scenario, the words "i want to wash the dishes", "where soy sauce is located", etc. are excluded for a kitchen scenario. In the k-element language model, the hit probability of the target speech word of the speech data can be calculated by the following formula. The test set sequence in the speech recognition model contains sentences (/ >)) Evaluating each sentence in the test set, and then performing product operation on the result, wherein the calculation formula is as follows:

Specifically, the speech recognition output module 34 is implemented as: the voice recognition result can be directly output through a voice recognition result output module managed by the voice recognition model under each scene, and the voice recognition result can be any visual and audible mode such as text, voice broadcasting converted from text, short messages and the like.

In other preferred embodiments, the speech recognition system further comprises a culling module 35. The rejection module is realized as follows: firstly, each off-line instruction training set is decoded to generate a voice point, a weak language model (Weak Language Model) is utilized in the decoding process, and the weak language model (Weak Language Model) refers to a model for modeling certain aspects of a language in natural language processing, but the modeling capability is relatively weak, and the complexity and fine semantic information of the language may not be accurately captured. In this embodiment, an N-gram model and a statistical-based model may be adopted, and details are not described here;

P(A|B) = (P(B|A) * P(A)) / P(B)

It should be noted that, the present speech recognition system may be a speech recognition module embedded in a certain intelligent scene, or may be a speech recognition system that is executed independently, and the present embodiment does not limit the installation carrier and the use form of the speech recognition system.

Example IV

Fig. 4 is an interaction schematic diagram of a speech recognition system according to an embodiment of the present invention. The voice recognition system is applied to an intelligent gymnasium body-building system. As shown in fig. 4:

A user used in the gymnasium can send out corresponding voice recognition instructions at any time, for example, a user can take a bath, finish running training for a long time, and a trainer in the next training class, a microphone, a loudspeaker and other voice acquisition modules of an intelligent gymnasium body-building system collect the voice recognition instructions and send the voice recognition instructions to an embedded voice recognition system, the voice recognition system generates a high-precision voice recognition result based on the voice recognition method of the embodiment, and the voice recognition result can display intelligent voice interaction with the user in a mode of voice playing, text displaying, short message pushing and the like, so that the experience of the user is improved.

Example five

Referring to fig. 5, fig. 5 is a schematic structural diagram of a voice recognition device according to an embodiment of the invention. The device described in fig. 5 can be applied to an intelligent home system, an intelligent gymnasium fitness system, a classroom education system in schools, a KTV singing system and the like, and the embodiment of the invention is not limited to the application scenario of the voice recognition method. The embodiment of the invention is not limited to the application system of the voice recognition device. As shown in fig. 5, the apparatus may include:

a memory 501 in which executable program codes are stored;

a processor 502 coupled to the memory 501;

the processor 502 invokes executable program code stored in the memory 501 for performing the described speech recognition method.

Example six

The embodiment of the invention discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the described voice recognition method.

Example seven

Embodiments of the present invention disclose a computer program product comprising a non-transitory computer readable storage medium storing a computer program, and the computer program is operable to cause a computer to perform the described speech recognition method.

The embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above detailed description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product that may be stored in a computer-readable storage medium including Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), one-time programmable Read-Only Memory (OTPROM), electrically erasable programmable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic disc Memory, tape Memory, or any other medium that can be used for computer-readable carrying or storing data.

Finally, it should be noted that: the disclosed voice recognition method and system are only preferred embodiments of the present invention, and are only used for illustrating the technical scheme of the present invention, but not limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that; the technical scheme recorded in the various embodiments can be modified or part of technical features in the technical scheme can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of speech recognition, the method comprising:

collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set;

Training the offline instruction training set by expanding a Baum-Welch algorithm to construct an offline speech recognition model of a plurality of scenes, wherein the offline speech recognition model comprises the following steps: the expansion Baum-Welch algorithm is a correlation expansion Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number;

analyzing the offline instruction training set to generate a voice signal sequence;

Taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter, and constructing an offline voice recognition model of a plurality of scenes: the method comprises the steps of extracting N maximum values from M-1 distance values by calculating M-1 distance values between voice characteristic values of M adjacent voice signal frames of training voice data, taking the voice signal frames corresponding to the maximum values as demarcation points, wherein N is the number of hidden states of offline voice recognition models of all scenes, generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition models of all scenes, and performing iterative training by using the hidden state sequences to construct the offline voice recognition models of the scenes;

Evaluating the offline speech recognition models of the scenes based on a preset precision evaluation system to determine an optimal offline speech recognition model;

And responding to the voice recognition instructions of the scenes, and outputting scene voice recognition results by matching the associated optimal offline voice recognition models.

2. The method of claim 1, wherein the number of hidden states of the speech recognition model is different for each scene.

3. The method according to claim 1 or 2, wherein training the offline instruction training set by extending a Baum-Welch algorithm constructs an offline speech recognition model of a plurality of scenes, previously comprising:

And decoding each off-line instruction training set to generate voice points, and performing lattice pruning of posterior probability on the voice points by using maximum mutual information estimation to generate a rejected off-line instruction training set.

4. The method according to claim 1, wherein the offline speech recognition models of the plurality of scenes are evaluated based on a preset accuracy evaluation system to determine an optimal offline speech recognition model, wherein the preset accuracy evaluation system is implemented as:

Calculating hit probability of the bullseye speech word of each speech data in the offline instruction training set of each target scene;

Calculating cross entropy of hit probability of the bullseye speech word of each speech data, and determining ambiguity of an offline instruction training set of each target scene based on the cross entropy;

And forming a precision evaluation system according to the hit probability of the bullseye voice words of each voice data and the ambiguity of the offline instruction training set of each target scene.

5. A speech recognition system, the system comprising:

The preprocessing module is used for collecting training voice data of a plurality of target scenes, and preprocessing the training voice data based on the environmental characteristics of each target scene to generate at least one offline instruction training set;

The offline voice recognition models of the multiple scenes are generated by training and constructing the offline instruction training set based on an extended Baum-Welch algorithm, wherein the extended Baum-Welch algorithm is a correlation extended Baum-Welch algorithm between adjacent M voice signal frames based on training voice data, and M is a natural number;

Analyzing the offline instruction training set to generate a voice signal sequence; taking the voice signal sequence processed by the Baum-Welch algorithm as an initial training parameter, and constructing an offline voice recognition model of a plurality of scenes: realizing M-1 distance values between voice characteristic values of adjacent M voice signal frames of training voice data by calculating; extracting N maximum values from the M-1 distance values, taking a voice signal frame corresponding to the maximum value as a demarcation point, wherein N is the number of hidden states of the offline voice recognition model of each scene; generating a plurality of hidden state sequences corresponding to each N sections of voice signals and one training state of the offline voice recognition model of each scene, and performing iterative training by using the hidden state sequences to construct the offline voice recognition model of the plurality of scenes;

the optimal offline speech recognition model is used for evaluating and determining the offline speech recognition models of the scenes based on a preset precision evaluation system;

And the voice recognition output module is used for responding to voice recognition instructions of a plurality of scenes and outputting scene voice recognition results by matching with the associated optimal offline voice recognition model.

6. The speech recognition system of claim 5, wherein the number of hidden states of the speech recognition model is different for each scene.

7. The speech recognition system of claim 6, wherein in the optimal offline speech recognition model, a preset accuracy rating system is implemented as: